Messages from 158275

Article: 158275
Subject: Re: Correlator of a big antenna array on FPGA
From: GaborSzakacs <gabor@alacron.com>
Date: Thu, 01 Oct 2015 16:14:37 -0400
Links: << >> << T >> << A >>

rickman wrote:
> On 10/1/2015 3:15 PM, ste3191 wrote:
>>> On 9/30/2015 8:13 AM, ste3191 wrote:
>>>> Hi, i have a serious problem with the architecture of a correlator for
>> a
>>>> planar antenna array (16 x 16).
>>>> Theoretically i can't implent the normal expression sum(X*X^H) because
>> i
>>>> would obtain a covariance matrix of 256 x 256. Then i can think to
>>>> implement the spatial smoothing technique, namely it takes an average
>> of
>>>> overlapped subarray, with the advantage to have a smaller covariance
>>>> matrix. This is right but is slow technique!! I need efficient and
>> fast
>>>> method to compute the covariance matrix on FPGA. with a less number of
>>>> multiplier possible. Infact for a covariance matrix 16 x 16 i need
>> about
>>>> 6000 multipliers! So i have seen the correlators based on hard-limiting
>> (
>>>> sign+xor + counter) at this link
>>>>
>>>>
>>> https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&vedDYQFjACahUKEwjwsKjT257IAhVlgXIKHQKFCWw&url=http%3A%2F%2Fhandle.dtic.mil%2F100.2%2FADA337434&usg¯QjCNG5QUylZORV9KFHYizyu1QJZSBM5A&bvm=bv.103627116,d.d2s&cad=rja 
>>>
>>>>
>>>> but i don't know if this technique is right, on simulink is very
>>> different
>>>> from the results of normal correlator.
>>>> Can someone help me?
>>>
>>> Even though your solution will be implemented in an FPGA, I'm not sure
>>> the FPGA group is the best place to ask this question since it is about
>>> the algorithm more than the FPGA implementation.  I am cross posting to
>>> the DSP group to see if anyone there has experience with it.
>>>
>>> That said, you don't say what your data rate and processing rates are.
>>> How often do you need to run this calculation?  If it is slow enough you
>>
>>> can use the same multipliers for many computation to produce one result.
>>
>>>   Or will this be run on every data sample at a high rate?
>>>
>>> -- 
>>>
>>> Rick
>>
>> Yes, the sampling rate is higher than 80MSPS and i can't share resources.
>> I posted it on dsp forum but nobody has answered yet.
> 
> Yes, I saw that.  Looks like you beat me to it.  lol
> 
> I don't know where else to seek advice.  Maybe talk to the FPGA vendors? 
>  I know they have various expertise in applications.  Is this something 
> you will end up building?  If so, and it uses a lot of resources, you 
> should be able to get some application support.
> 
> You know, 80 MHz is not so fast for multiplies or adds.  The multiplier 
> block in most newer FPGAs will run at 100's of MHz.  So you certainly 
> should be able to multiplex the multiplier unit by 4x or more.  But that 
> really doesn't solve your problem if you want to do it on a single chip. 
>  I haven't looked at the high end, but I'm pretty sure they don't put 
> 1500 multipliers on a chip.  But it may put you in the ballpark where 
> you can do this with a small handful of large FPGAs.  Very pricey though.
> 

Actually you can get up to 1,920 DSP slices on a Kintex-7 and
considerably more on the Virtex-7 and Virtex Ultrascale devices,
however a "multiplier" may eat more than one DSP slice depending
on the number of bits you want.  On the other hand they are supposed
to run at 500 MHz in these parts.

-- 
Gabor

Article: 158276
Subject: Re: Correlator of a big antenna array on FPGA
From: rickman <gnuarm@gmail.com>
Date: Thu, 1 Oct 2015 21:05:06 -0400
Links: << >> << T >> << A >>

On 10/1/2015 4:14 PM, GaborSzakacs wrote:
> rickman wrote:
>> On 10/1/2015 3:15 PM, ste3191 wrote:
>>>> On 9/30/2015 8:13 AM, ste3191 wrote:
>>>>> Hi, i have a serious problem with the architecture of a correlator for
>>> a
>>>>> planar antenna array (16 x 16).
>>>>> Theoretically i can't implent the normal expression sum(X*X^H) because
>>> i
>>>>> would obtain a covariance matrix of 256 x 256. Then i can think to
>>>>> implement the spatial smoothing technique, namely it takes an average
>>> of
>>>>> overlapped subarray, with the advantage to have a smaller covariance
>>>>> matrix. This is right but is slow technique!! I need efficient and
>>> fast
>>>>> method to compute the covariance matrix on FPGA. with a less number of
>>>>> multiplier possible. Infact for a covariance matrix 16 x 16 i need
>>> about
>>>>> 6000 multipliers! So i have seen the correlators based on
>>>>> hard-limiting
>>> (
>>>>> sign+xor + counter) at this link
>>>>>
>>>>>
>>>> https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&vedDYQFjACahUKEwjwsKjT257IAhVlgXIKHQKFCWw&url=http%3A%2F%2Fhandle.dtic.mil%2F100.2%2FADA337434&usg¯QjCNG5QUylZORV9KFHYizyu1QJZSBM5A&bvm=bv.103627116,d.d2s&cad=rja
>>>>
>>>>>
>>>>> but i don't know if this technique is right, on simulink is very
>>>> different
>>>>> from the results of normal correlator.
>>>>> Can someone help me?
>>>>
>>>> Even though your solution will be implemented in an FPGA, I'm not sure
>>>> the FPGA group is the best place to ask this question since it is about
>>>> the algorithm more than the FPGA implementation.  I am cross posting to
>>>> the DSP group to see if anyone there has experience with it.
>>>>
>>>> That said, you don't say what your data rate and processing rates are.
>>>> How often do you need to run this calculation?  If it is slow enough
>>>> you
>>>
>>>> can use the same multipliers for many computation to produce one
>>>> result.
>>>
>>>>   Or will this be run on every data sample at a high rate?
>>>>
>>>> --
>>>>
>>>> Rick
>>>
>>> Yes, the sampling rate is higher than 80MSPS and i can't share
>>> resources.
>>> I posted it on dsp forum but nobody has answered yet.
>>
>> Yes, I saw that.  Looks like you beat me to it.  lol
>>
>> I don't know where else to seek advice.  Maybe talk to the FPGA
>> vendors?  I know they have various expertise in applications.  Is this
>> something you will end up building?  If so, and it uses a lot of
>> resources, you should be able to get some application support.
>>
>> You know, 80 MHz is not so fast for multiplies or adds.  The
>> multiplier block in most newer FPGAs will run at 100's of MHz.  So you
>> certainly should be able to multiplex the multiplier unit by 4x or
>> more.  But that really doesn't solve your problem if you want to do it
>> on a single chip.  I haven't looked at the high end, but I'm pretty
>> sure they don't put 1500 multipliers on a chip.  But it may put you in
>> the ballpark where you can do this with a small handful of large
>> FPGAs.  Very pricey though.
>>
>
> Actually you can get up to 1,920 DSP slices on a Kintex-7 and
> considerably more on the Virtex-7 and Virtex Ultrascale devices,
> however a "multiplier" may eat more than one DSP slice depending
> on the number of bits you want.  On the other hand they are supposed
> to run at 500 MHz in these parts.

Are those the $1000 chips?  I worked for a test equipment company once 
and they used a $1500 chip in a product that sold for over $100 k.  They 
initially only used about 20% of the part so they could add more stuff 
as upgrades.  Lots of margin in a $100k product just like there's lots 
of margin in a $1500 chip.

-- 

Rick

Article: 158277
Subject: Re: Correlator of a big antenna array on FPGA
From: GaborSzakacs <gabor@alacron.com>
Date: Fri, 02 Oct 2015 09:20:04 -0400
Links: << >> << T >> << A >>

rickman wrote:
> On 10/1/2015 4:14 PM, GaborSzakacs wrote:
>> rickman wrote:
>>> On 10/1/2015 3:15 PM, ste3191 wrote:
>>>>> On 9/30/2015 8:13 AM, ste3191 wrote:
>>>>>> Hi, i have a serious problem with the architecture of a correlator 
>>>>>> for
>>>> a
>>>>>> planar antenna array (16 x 16).
>>>>>> Theoretically i can't implent the normal expression sum(X*X^H) 
>>>>>> because
>>>> i
>>>>>> would obtain a covariance matrix of 256 x 256. Then i can think to
>>>>>> implement the spatial smoothing technique, namely it takes an average
>>>> of
>>>>>> overlapped subarray, with the advantage to have a smaller covariance
>>>>>> matrix. This is right but is slow technique!! I need efficient and
>>>> fast
>>>>>> method to compute the covariance matrix on FPGA. with a less 
>>>>>> number of
>>>>>> multiplier possible. Infact for a covariance matrix 16 x 16 i need
>>>> about
>>>>>> 6000 multipliers! So i have seen the correlators based on
>>>>>> hard-limiting
>>>> (
>>>>>> sign+xor + counter) at this link
>>>>>>
>>>>>>
>>>>> https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&vedDYQFjACahUKEwjwsKjT257IAhVlgXIKHQKFCWw&url=http%3A%2F%2Fhandle.dtic.mil%2F100.2%2FADA337434&usg¯QjCNG5QUylZORV9KFHYizyu1QJZSBM5A&bvm=bv.103627116,d.d2s&cad=rja 
>>>>>
>>>>>
>>>>>>
>>>>>> but i don't know if this technique is right, on simulink is very
>>>>> different
>>>>>> from the results of normal correlator.
>>>>>> Can someone help me?
>>>>>
>>>>> Even though your solution will be implemented in an FPGA, I'm not sure
>>>>> the FPGA group is the best place to ask this question since it is 
>>>>> about
>>>>> the algorithm more than the FPGA implementation.  I am cross 
>>>>> posting to
>>>>> the DSP group to see if anyone there has experience with it.
>>>>>
>>>>> That said, you don't say what your data rate and processing rates are.
>>>>> How often do you need to run this calculation?  If it is slow enough
>>>>> you
>>>>
>>>>> can use the same multipliers for many computation to produce one
>>>>> result.
>>>>
>>>>>   Or will this be run on every data sample at a high rate?
>>>>>
>>>>> -- 
>>>>>
>>>>> Rick
>>>>
>>>> Yes, the sampling rate is higher than 80MSPS and i can't share
>>>> resources.
>>>> I posted it on dsp forum but nobody has answered yet.
>>>
>>> Yes, I saw that.  Looks like you beat me to it.  lol
>>>
>>> I don't know where else to seek advice.  Maybe talk to the FPGA
>>> vendors?  I know they have various expertise in applications.  Is this
>>> something you will end up building?  If so, and it uses a lot of
>>> resources, you should be able to get some application support.
>>>
>>> You know, 80 MHz is not so fast for multiplies or adds.  The
>>> multiplier block in most newer FPGAs will run at 100's of MHz.  So you
>>> certainly should be able to multiplex the multiplier unit by 4x or
>>> more.  But that really doesn't solve your problem if you want to do it
>>> on a single chip.  I haven't looked at the high end, but I'm pretty
>>> sure they don't put 1500 multipliers on a chip.  But it may put you in
>>> the ballpark where you can do this with a small handful of large
>>> FPGAs.  Very pricey though.
>>>
>>
>> Actually you can get up to 1,920 DSP slices on a Kintex-7 and
>> considerably more on the Virtex-7 and Virtex Ultrascale devices,
>> however a "multiplier" may eat more than one DSP slice depending
>> on the number of bits you want.  On the other hand they are supposed
>> to run at 500 MHz in these parts.
> 
> Are those the $1000 chips?  I worked for a test equipment company once 
> and they used a $1500 chip in a product that sold for over $100 k.  They 
> initially only used about 20% of the part so they could add more stuff 
> as upgrades.  Lots of margin in a $100k product just like there's lots 
> of margin in a $1500 chip.
> 

The list price for the XC7K410T, which has 1,540 DSP slices starts at
about $1,300.  A DSP slice includes a 25 x 18 bit signed multiplier.
The list price (you can see it at Digikey) for the largest Kintex-7 is
around $3,000.  Virtex-7 is more expensive.  I'm not suggesting this as
a solution unless there's no other way, including using several devices
which often saves money over using the largest available ones.  On the
other hand you suggested that you can't get 1,500 multipliers in an
FPGA, and I was just pointing out that in fact you can get that many and
even more if you have the money to pay for it.  If you can figure out
how to partition the design into say 3 or 4 pieces, you can use an
XC7K160T with 600 DSP units starting at about $210 each.  This seems
to be the sweet spot (for now) in price per DSP in that series.  An
Artix XC7A200T is in the same price range with a bit more logic and
740 DSP slices, but the fabric is a bit slower in that series.

My guess is that Altera has a range of parts with similar multiplier
counts, since they generally compete head to head with Xilinx and at
this point the Xilinx 7-series is old news.

-- 
Gabor

Article: 158278
Subject: Re: Correlator of a big antenna array on FPGA
From: "ste3191" <107600@FPGARelated>
Date: Fri, 02 Oct 2015 09:11:11 -0500
Links: << >> << T >> << A >>

>rickman wrote:
>> On 10/1/2015 4:14 PM, GaborSzakacs wrote:
>>> rickman wrote:
>>>> On 10/1/2015 3:15 PM, ste3191 wrote:
>>>>>> On 9/30/2015 8:13 AM, ste3191 wrote:
>>>>>>> Hi, i have a serious problem with the architecture of a correlator

>>>>>>> for
>>>>> a
>>>>>>> planar antenna array (16 x 16).
>>>>>>> Theoretically i can't implent the normal expression sum(X*X^H) 
>>>>>>> because
>>>>> i
>>>>>>> would obtain a covariance matrix of 256 x 256. Then i can think
to
>>>>>>> implement the spatial smoothing technique, namely it takes an
>average
>>>>> of
>>>>>>> overlapped subarray, with the advantage to have a smaller
covariance
>>>>>>> matrix. This is right but is slow technique!! I need efficient
and
>>>>> fast
>>>>>>> method to compute the covariance matrix on FPGA. with a less 
>>>>>>> number of
>>>>>>> multiplier possible. Infact for a covariance matrix 16 x 16 i
need
>>>>> about
>>>>>>> 6000 multipliers! So i have seen the correlators based on
>>>>>>> hard-limiting
>>>>> (
>>>>>>> sign+xor + counter) at this link
>>>>>>>
>>>>>>>
>>>>>>
>https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&vedDYQFjACahUKEwjwsKjT257IAhVlgXIKHQKFCWw&url=http%3A%2F%2Fhandle.dtic.mil%2F100.2%2FADA337434&usgÂ¯QjCNG5QUylZORV9KFHYizyu1QJZSBM5A&bvm=bv.103627116,d.d2s&cad=rja
>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> but i don't know if this technique is right, on simulink is very
>>>>>> different
>>>>>>> from the results of normal correlator.
>>>>>>> Can someone help me?
>>>>>>
>>>>>> Even though your solution will be implemented in an FPGA, I'm not
>sure
>>>>>> the FPGA group is the best place to ask this question since it is 
>>>>>> about
>>>>>> the algorithm more than the FPGA implementation.  I am cross 
>>>>>> posting to
>>>>>> the DSP group to see if anyone there has experience with it.
>>>>>>
>>>>>> That said, you don't say what your data rate and processing rates
>are.
>>>>>> How often do you need to run this calculation?  If it is slow
enough
>>>>>> you
>>>>>
>>>>>> can use the same multipliers for many computation to produce one
>>>>>> result.
>>>>>
>>>>>>   Or will this be run on every data sample at a high rate?
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> Rick
>>>>>
>>>>> Yes, the sampling rate is higher than 80MSPS and i can't share
>>>>> resources.
>>>>> I posted it on dsp forum but nobody has answered yet.
>>>>
>>>> Yes, I saw that.  Looks like you beat me to it.  lol
>>>>
>>>> I don't know where else to seek advice.  Maybe talk to the FPGA
>>>> vendors?  I know they have various expertise in applications.  Is
this
>>>> something you will end up building?  If so, and it uses a lot of
>>>> resources, you should be able to get some application support.
>>>>
>>>> You know, 80 MHz is not so fast for multiplies or adds.  The
>>>> multiplier block in most newer FPGAs will run at 100's of MHz.  So
you
>>>> certainly should be able to multiplex the multiplier unit by 4x or
>>>> more.  But that really doesn't solve your problem if you want to do
it
>>>> on a single chip.  I haven't looked at the high end, but I'm pretty
>>>> sure they don't put 1500 multipliers on a chip.  But it may put you
in
>>>> the ballpark where you can do this with a small handful of large
>>>> FPGAs.  Very pricey though.
>>>>
>>>
>>> Actually you can get up to 1,920 DSP slices on a Kintex-7 and
>>> considerably more on the Virtex-7 and Virtex Ultrascale devices,
>>> however a "multiplier" may eat more than one DSP slice depending
>>> on the number of bits you want.  On the other hand they are supposed
>>> to run at 500 MHz in these parts.
>> 
>> Are those the $1000 chips?  I worked for a test equipment company once

>> and they used a $1500 chip in a product that sold for over $100 k. 
They 
>> initially only used about 20% of the part so they could add more stuff

>> as upgrades.  Lots of margin in a $100k product just like there's lots

>> of margin in a $1500 chip.
>> 
>
>The list price for the XC7K410T, which has 1,540 DSP slices starts at
>about $1,300.  A DSP slice includes a 25 x 18 bit signed multiplier.
>The list price (you can see it at Digikey) for the largest Kintex-7 is
>around $3,000.  Virtex-7 is more expensive.  I'm not suggesting this as
>a solution unless there's no other way, including using several devices
>which often saves money over using the largest available ones.  On the
>other hand you suggested that you can't get 1,500 multipliers in an
>FPGA, and I was just pointing out that in fact you can get that many and
>even more if you have the money to pay for it.  If you can figure out
>how to partition the design into say 3 or 4 pieces, you can use an
>XC7K160T with 600 DSP units starting at about $210 each.  This seems
>to be the sweet spot (for now) in price per DSP in that series.  An
>Artix XC7A200T is in the same price range with a bit more logic and
>740 DSP slices, but the fabric is a bit slower in that series.
>
>My guess is that Altera has a range of parts with similar multiplier
>counts, since they generally compete head to head with Xilinx and at
>this point the Xilinx 7-series is old news.
>
>-- 
>Gabor

A model of Virtex7 has more of 3000 multipliers at 600 MHz, but the
problem isn't the price but the way for compute or estimate efficiently
the large matrix.
---------------------------------------
Posted through http://www.FPGARelated.com

Article: 158279
Subject: Re: Correlator of a big antenna array on FPGA
From: rickman <gnuarm@gmail.com>
Date: Fri, 2 Oct 2015 16:50:48 -0400
Links: << >> << T >> << A >>

On 10/2/2015 9:20 AM, GaborSzakacs wrote:
>
> The list price for the XC7K410T, which has 1,540 DSP slices starts at
> about $1,300.  A DSP slice includes a 25 x 18 bit signed multiplier.
> The list price (you can see it at Digikey) for the largest Kintex-7 is
> around $3,000.  Virtex-7 is more expensive.  I'm not suggesting this as
> a solution unless there's no other way, including using several devices
> which often saves money over using the largest available ones.  On the
> other hand you suggested that you can't get 1,500 multipliers in an
> FPGA, and I was just pointing out that in fact you can get that many and
> even more if you have the money to pay for it.  If you can figure out
> how to partition the design into say 3 or 4 pieces, you can use an
> XC7K160T with 600 DSP units starting at about $210 each.  This seems
> to be the sweet spot (for now) in price per DSP in that series.  An
> Artix XC7A200T is in the same price range with a bit more logic and
> 740 DSP slices, but the fabric is a bit slower in that series.
>
> My guess is that Altera has a range of parts with similar multiplier
> counts, since they generally compete head to head with Xilinx and at
> this point the Xilinx 7-series is old news.

Yes, thank you for bringing me up to date.  I tend to work at the lower 
end where you are happy if the parts *have* multipliers.  lol

-- 

Rick

Article: 158280
Subject: System On Chip From Microsemi
From: rickman <gnuarm@gmail.com>
Date: Sat, 3 Oct 2015 02:01:22 -0400
Links: << >> << T >> << A >>

I guess I stopped looking at the Microsemi products some time back.  The 
SOC devices put out by Actel were ok, but the price was up there even 
for the smallest one, around $50.  I was looking on Digikey and it seems 
their prices have come down and the new Smart Fusion 2 devices are even 
lower.  The cheapest part is $16 qty at Digikey.

These SOCs don't have any analog unless you consider the crystal clock 
to be analog, lol.  Still, they have all the digital stuff you might 
want and are a lot cheaper than the Xilinx and Altera SOC lines.

BTW, I found that Mouser doesn't seem to carry Xilinx anymore.  When I 
search on Xilinx on the Mouser site they bring up the Altera page, lol.

-- 

Rick

Article: 158281
Subject: Re: DDR* SDRAM modules for simulation
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Sat, 3 Oct 2015 12:29:25 -0700 (PDT)
Links: << >> << T >> << A >>

I did a couple of DDR controllers and used Verilog models from Micron and t=
hey worked really well.  I think I made one change to the source.  The mode=
l was really slow, so I put in a Modelsim directive to allocate the main ar=
ray as a sparse matrix, so it would only allocate RAM (on the simulating co=
mputer) as it was accessed.  The models caught all sorts of obscure errors,=
 like not waiting 3.5 cycles to access a row in the same bank of a row that=
 was accessed within the last fortnight or whatever.

Article: 158282
Subject: Re: Question about partial multiplication result in transposed FIR filter
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Sat, 3 Oct 2015 12:35:03 -0700 (PDT)
Links: << >> << T >> << A >>

You might be referring to the technique of expressing a number in CSD (Canonical Signed Digits), which reduces the number of nonzero bits in a number.

Article: 158283
Subject: Re: Question about partial multiplication result in transposed FIR filter
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Sat, 3 Oct 2015 12:37:50 -0700 (PDT)
Links: << >> << T >> << A >>

I just built an optimized Galois Field vector multiplier, which multiplies =
a vector by a scalar.  As an experiment, I split it into two parts, one tha=
t was common to all elements of the vector, and then the parts that were un=
ique to each element of the vector.  I had assumed this is something the sy=
nthesizer would do anyway, but I was surprised to find that writing it the =
way I did cut down the number of LUTs by a big margin.

Article: 158284
Subject: Re: Correlator of a big antenna array on FPGA
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Sat, 3 Oct 2015 12:42:23 -0700 (PDT)
Links: << >> << T >> << A >>

I've had to work at the low end, where the part is always full and I have t=
o fake multiplication with lookup tables.  Now I'm at the other end, where =
the volumes are low and the customer doesn't care about FPGA price so the p=
arts are huge.  They must cost a fortune.  I still waste a lot of time of P=
AR issues, but it's wonderful having more gates, DSPs, and blockRAMs than I=
 could ever need.

Article: 158285
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Sat, 3 Oct 2015 15:51:34 -0400
Links: << >> << T >> << A >>

On 10/3/2015 3:37 PM, Kevin Neilson wrote:
> I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar.  As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector.  I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.

Why is it using LUTs instead of multipliers?  Are these numbers too 
small and too many to use the multipliers efficiently?

-- 

Rick

Article: 158286
Subject: Re: Correlator of a big antenna array on FPGA
From: rickman <gnuarm@gmail.com>
Date: Sat, 3 Oct 2015 15:56:18 -0400
Links: << >> << T >> << A >>

On 10/3/2015 3:42 PM, Kevin Neilson wrote:
> I've had to work at the low end, where the part is always full and I have to fake multiplication with lookup tables.  Now I'm at the other end, where the volumes are low and the customer doesn't care about FPGA price so the parts are huge.  They must cost a fortune.  I still waste a lot of time of PAR issues, but it's wonderful having more gates, DSPs, and blockRAMs than I could ever need.

Personally I enjoy the challenge of fitting tight designs.  To me trying 
to get a part to meet timing is not as much fun as getting a part to fit 
the device.  I find timing analysis to be very tedious as you get 
literally hundreds of failed path reports from what is basically the 
same endpoints, just many variations.  This makes it hard to see the 
next longer path that is also failing.  Reminds me of debugging a 
program one mistake at a time in the old days when my first pass would 
have many bugs... and the new days too sometimes.   lol

Fitting can have very interesting tradeoffs.  Often they are algorithmic 
and require learning new ways of calculating results.  I find that very 
interesting.

-- 

Rick

Article: 158287
Subject: Re: System On Chip From Microsemi
From: zoomboom718@gmail.com
Date: Sat, 3 Oct 2015 18:39:34 -0700 (PDT)
Links: << >> << T >> << A >>

I agree, the SmartFusion2 devices are actually very competitive and they ha=
ve some strengths that are absent from their competition.  They are flash-b=
ased, which has some distinct benefits:
* No external configuration device is required.
* The device is instant-on, i.e. you do not have this long dead configurati=
on period.
* The flash gate architecture provides inherent single-event-upset (SEU) im=
munity.  For long term reliability, SRAM-based FPGA's are vulnerable to err=
or events when struck by cosmic rays.  This is a strength, especially in ae=
ronautics/space.
* When all the clocks are stopped, the flash architecture consumes very lit=
tle standby current.  In general, these devices are super low power.

For an SoC, the SmartFusion2 security features are really superior.  Secure=
 key storage with active mesh protection and tamper detection, embedded AES=
256/SHA256, the physically uncloneable function (PUF) and the true random n=
umber generator of the S devices are ideal for secure machine-to-machine co=
mmunication and for protecting IP.
Also, they have certified anti-key-hacking mechanisms like differential pow=
er analysis (DPA) resistance.  Read up on how FPGA keys can be compromised =
with DPA:
http://www.microsemi.com/document-portal/doc_download/131563-protecting-fpg=
as-from-power-analysis

In terms of development, Avnet recently started selling a super low-cost de=
velopment board, the SmartFusion2 KickStart kit, that has a 10k gate SoC wi=
th a 166MHz Cortex M3 - the M2S010S.  It is only $59.95 and is a little USB=
-powered module in the Arduino form factor with some sensors and PMODs for =
expansion.  Their reference design examples make it pretty quick to get up =
and running.
http://www.em.avnet.com/en-us/design/drc/Pages/Microsemi-SmartFusion2-KickS=
tart-Development-Kit.aspx=20

SR

Article: 158288
Subject: Re: System On Chip From Microsemi
From: Meshenger <zoomboom718@gmail.com>
Date: Sat, 3 Oct 2015 18:55:18 -0700 (PDT)
Links: << >> << T >> << A >>

I agree, the SmartFusion2 devices are actually very competitive and they ha=
ve some strengths that are absent from their competition.  They are flash-b=
ased, which has some distinct benefits:
* No external configuration device is required.
* The device is instant-on, i.e. you do not have this long dead configurati=
on period.
* The flash gate architecture provides inherent single-event-upset (SEU) im=
munity.  For long term reliability, SRAM-based FPGA's are vulnerable to err=
or events when struck by cosmic rays.  This is a strength, especially in ae=
ronautics/space.
* When all the clocks are stopped, the flash architecture consumes very lit=
tle standby current.  In general, these devices are super low power.

For an SoC, the SmartFusion2 security features are really superior.  Secure=
 key storage with active mesh protection and tamper detection, embedded AES=
256/SHA256, the physically uncloneable function (PUF) and the true random n=
umber generator of the S devices are ideal for secure machine-to-machine co=
mmunication and for protecting IP.
Also, they have certified anti-key-hacking mechanisms like differential pow=
er analysis (DPA) resistance.  Read up on how FPGA keys can be compromised =
with DPA:
http://www.microsemi.com/document-portal/doc_download/131563-protecting-fpg=
as-from-power-analysis

In terms of development, Avnet recently started selling a super low-cost de=
velopment board, the SmartFusion2 KickStart kit, that has a 10k gate SoC wi=
th a 166MHz Cortex M3 - the M2S010S.  It is only $59.95 and is a little USB=
-powered module in the Arduino form factor with some sensors and PMODs for =
expansion. There is a Bluetooth LE module on the board and some Android & W=
indows demo's. Their reference design examples make it pretty quick to get =
up and running.
http://www.em.avnet.com/en-us/design/drc/Pages/Microsemi-SmartFusion2-KickS=
tart-Development-Kit.aspx

SR=20

Article: 158289
Subject: Re: System On Chip From Microsemi
From: rickman <gnuarm@gmail.com>
Date: Sat, 3 Oct 2015 22:23:19 -0400
Links: << >> << T >> << A >>

On 10/3/2015 9:39 PM, zoomboom718@gmail.com wrote:
> I agree, the SmartFusion2 devices are actually very competitive and they have some strengths that are absent from their competition.  They are flash-based, which has some distinct benefits:
> * No external configuration device is required.
> * The device is instant-on, i.e. you do not have this long dead configuration period.
> * The flash gate architecture provides inherent single-event-upset (SEU) immunity.  For long term reliability, SRAM-based FPGA's are vulnerable to error events when struck by cosmic rays.  This is a strength, especially in aeronautics/space.

Not trying to be retarded, as I have not checked the data sheet on this, 
but isn't the FPGA fabric SRAM based and loaded (albeit more quickly 
than a serial config) from the internal Flash?  I'm more familiar with 
Lattice Flash parts and that's what they do.  Instead of large fractions 
of a second the config time is a couple of ms.  The SRAM allows you to 
change the config from JTAG without flashing the part.  Don't the 
MicroSemi parts do that too?

I know Actel (now MicroSemi) is *very* familiar with the aerospace 
market.  I expect that is a large part of their sales.

> * When all the clocks are stopped, the flash architecture consumes very little standby current.  In general, these devices are super low power.

I *did* glance at the data sheet about this.  They are better than parts 
from the big two, but Lattice has parts that are much better than the 
numbers I saw.  Still, this is an SoC and Lattice isn't there yet.

> For an SoC, the SmartFusion2 security features are really superior.  Secure key storage with active mesh protection and tamper detection, embedded AES256/SHA256, the physically uncloneable function (PUF) and the true random number generator of the S devices are ideal for secure machine-to-machine communication and for protecting IP.
> Also, they have certified anti-key-hacking mechanisms like differential power analysis (DPA) resistance.  Read up on how FPGA keys can be compromised with DPA:
> http://www.microsemi.com/document-portal/doc_download/131563-protecting-fpgas-from-power-analysis

I won't say I understand it, but I have seen somethings about this.  But 
"true random number generator"???  My understanding is this is virtually 
impossible.  I haven't read about this.  Is it based on noise from a 
diode or something?  I recall a researcher trying that and it was good, 
but he could never find the source of a long term DC bias.

> In terms of development, Avnet recently started selling a super low-cost development board, the SmartFusion2 KickStart kit, that has a 10k gate SoC with a 166MHz Cortex M3 - the M2S010S.  It is only $59.95 and is a little USB-powered module in the Arduino form factor with some sensors and PMODs for expansion.  Their reference design examples make it pretty quick to get up and running.
> http://www.em.avnet.com/en-us/design/drc/Pages/Microsemi-SmartFusion2-KickStart-Development-Kit.aspx

I saw that and thought, DARN IT!  Often the manufacturer produces rather 
expensive eval boards (which MicroSemi did in this case) but Avnet spun 
a low cost one.  I was hoping to find a market for a new product.  But 
the low cost board is lacking a lot of I/O features like Ethernet. 
Maybe there is some potential for an add on board to bring it up to eval 
board functionality.  They have a development board that looks like it 
has every bell and whistle in the book!  I don't think I *want* to 
duplicate that.

I wish I had an app for this device.  I also wish it were a bit cheaper 
still.

BTW, they have training on the KickStart kit including a kit for $100. 
I'm not sure why they are asking for the $40 above the cost of the kit. 
  That isn't even paying the trainer to show up!

I'm having a little trouble finding much info on board routing the VF256 
package.  It seems to not be included in most of their info.  I guess it 
would be the same as the VF400 package?  Seems every chip maker has a 
different name for the same package.

-- 

Rick

Article: 158290
Subject: Re: Question about partial multiplication result in transposed FIR
From: Allan Herriman <allanherriman@hotmail.com>
Date: 04 Oct 2015 06:01:08 GMT
Links: << >> << T >> << A >>

On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:

> On 10/3/2015 3:37 PM, Kevin Neilson wrote:
>> I just built an optimized Galois Field vector multiplier, which
>> multiplies a vector by a scalar.  As an experiment, I split it into two
>> parts, one that was common to all elements of the vector, and then the
>> parts that were unique to each element of the vector.  I had assumed
>> this is something the synthesizer would do anyway, but I was surprised
>> to find that writing it the way I did cut down the number of LUTs by a
>> big margin.
> 
> Why is it using LUTs instead of multipliers?  Are these numbers too
> small and too many to use the multipliers efficiently?

Kevin is talking about Galois Field multipliers.  The integer multiplier 
blocks are useless for that.

I occasionally implement cryptographic primitives in FPGAs.  These often 
use a combination of linear mixing functions and (other stuff which 
provides nonlinearity, which I don't need to talk about here).  The 
linear mixing functions often contain GF multiplications (sometimes by 
constants).

It's possible to express the linear mixing functions in HDL behaviourally 
(e.g. including things that are recognisably GF multipliers), and it's 
also possible to express them as a sea of XOR gates.

My experience using the latest tools from Xilinx is that for a particular 
128 bit mixing function I was getting three times as many levels of logic 
from from the behavioural source as I was from the sea of XOR gates 
source, even though both described the same function.

BTW, one runs into similar problems when calculating CRCs of wide buses.  
The CRCs also simplify to XOR trees, but in this case we are calculating 
the remainder after a division, rather than a multiplication.  (And yes, 
the integer DSP blocks are useless for this too.)

Regards,
Allan

Article: 158291
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Sun, 4 Oct 2015 02:53:39 -0400
Links: << >> << T >> << A >>

On 10/4/2015 2:01 AM, Allan Herriman wrote:
> On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:
>
>> On 10/3/2015 3:37 PM, Kevin Neilson wrote:
>>> I just built an optimized Galois Field vector multiplier, which
>>> multiplies a vector by a scalar.  As an experiment, I split it into two
>>> parts, one that was common to all elements of the vector, and then the
>>> parts that were unique to each element of the vector.  I had assumed
>>> this is something the synthesizer would do anyway, but I was surprised
>>> to find that writing it the way I did cut down the number of LUTs by a
>>> big margin.
>>
>> Why is it using LUTs instead of multipliers?  Are these numbers too
>> small and too many to use the multipliers efficiently?
>
>
> Kevin is talking about Galois Field multipliers.  The integer multiplier
> blocks are useless for that.

We are talking modulo 2 multiplies at every bit, otherwise known as AND 
gates with no carry?  I'm a bit fuzzy on this.

Now I'm confused by Kevin's description.  If the vector is multiplied by 
a scalar, what parts are common and what parts are unique?  What parts 
of this are fixed vs. variable?  The only parts a tool can optimize are 
the fixed operands.  Or I am totally missing the concept.

> I occasionally implement cryptographic primitives in FPGAs.  These often
> use a combination of linear mixing functions and (other stuff which
> provides nonlinearity, which I don't need to talk about here).  The
> linear mixing functions often contain GF multiplications (sometimes by
> constants).
>
> It's possible to express the linear mixing functions in HDL behaviourally
> (e.g. including things that are recognisably GF multipliers), and it's
> also possible to express them as a sea of XOR gates.
>
> My experience using the latest tools from Xilinx is that for a particular
> 128 bit mixing function I was getting three times as many levels of logic
> from from the behavioural source as I was from the sea of XOR gates
> source, even though both described the same function.

I don't know enough of how the tools work to say what is going on.  I 
just know that when I dig into the output of the tools for poorly 
synthesized code, I find the problems are that my code doesn't specify 
the simple structure I had imagined it did.  So I fix my code.  :)

Mostly this has to do with trying to use the carry out the top of 
adders.  Sometimes I get two adders.

> BTW, one runs into similar problems when calculating CRCs of wide buses.
> The CRCs also simplify to XOR trees, but in this case we are calculating
> the remainder after a division, rather than a multiplication.  (And yes,
> the integer DSP blocks are useless for this too.)

Yep, you don't want a carry, so don't even think about using adders or 
multipliers.  They are not adders and multipliers in every type of algebra.

-- 

Rick

Article: 158292
Subject: Re: System On Chip From Microsemi
From: Thomas Stanka <usenet_nospam_valid@stanka-web.de>
Date: Mon, 5 Oct 2015 04:06:30 -0700 (PDT)
Links: << >> << T >> << A >>

Am Sonntag, 4. Oktober 2015 04:23:38 UTC+2 schrieb rickman:
> Not trying to be retarded, as I have not checked the data sheet on this, 
> but isn't the FPGA fabric SRAM based and loaded (albeit more quickly 
> than a serial config) from the internal Flash?  

The functional configuration of each cell is controlled by distributed flash. Else they would not be able to reach their SEE immunity for configuration, and would have trouble reaching their bootup times (aka instant-on).

regards,

Thomas

Article: 158293
Subject: Re: System On Chip From Microsemi
From: rickman <gnuarm@gmail.com>
Date: Mon, 5 Oct 2015 11:51:48 -0400
Links: << >> << T >> << A >>

On 10/5/2015 7:06 AM, Thomas Stanka wrote:
> Am Sonntag, 4. Oktober 2015 04:23:38 UTC+2 schrieb rickman:
>> Not trying to be retarded, as I have not checked the data sheet on
>> this, but isn't the FPGA fabric SRAM based and loaded (albeit more
>> quickly than a serial config) from the internal Flash?
>
> The functional configuration of each cell is controlled by
> distributed flash. Else they would not be able to reach their SEE
> immunity for configuration, and would have trouble reaching their
> bootup times (aka instant-on).

I guess I am just hardwired to think of the config memory as RAM.  But I 
don't see how they make the rest of the chip immune to SEE.  The logic 
units have a FF which must be immune which I don't see described.  I 
also don't see mention of the fabric memory being SEE immune.  I guess 
they bury that in some radiation related document somewhere.  I do see 
where the refer to certain part of the chip as only "SEU Resistant".

-- 

Rick

Article: 158294
Subject: Re: System On Chip From Microsemi
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Mon, 5 Oct 2015 10:12:21 -0700 (PDT)
Links: << >> << T >> << A >>


> 
> I won't say I understand it, but I have seen somethings about this.  But 
> "true random number generator"???  My understanding is this is virtually 
> impossible.  I haven't read about this.  Is it based on noise from a 
> diode or something?  I recall a researcher trying that and it was good, 
> but he could never find the source of a long term DC bias.
> 

I don't know how these guys do it, but you can make a decent true random number generator with ring oscillators.  I read one paper that described using several of these along with non-linear feedback shift registers to get a good random number.

Article: 158295
Subject: Re: Question about partial multiplication result in transposed FIR filter
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Mon, 5 Oct 2015 10:17:39 -0700 (PDT)
Links: << >> << T >> << A >>

Not only can I not use the DSP blocks, but the carry chains don't work eith=
er.  The only way to do a big XOR is with LUTs, and at my clock speeds, I c=
an only do about 3 LUT levels.  At least there are plenty of BRAMs in Virte=
x parts now, so it's easy to do logs, inverses, and exponentiation.

Article: 158296
Subject: Re: Question about partial multiplication result in transposed FIR filter
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Mon, 5 Oct 2015 10:29:59 -0700 (PDT)
Links: << >> << T >> << A >>

>=20
> We are talking modulo 2 multiplies at every bit, otherwise known as AND=
=20
> gates with no carry?  I'm a bit fuzzy on this.
>=20
> Now I'm confused by Kevin's description.  If the vector is multiplied by=
=20
> a scalar, what parts are common and what parts are unique?  What parts=20
> of this are fixed vs. variable?  The only parts a tool can optimize are=
=20
> the fixed operands.  Or I am totally missing the concept.
>=20
Say you're multiplying a by a vector [b c d].  Let's say we're using the fi=
eld GF(8) so a is 3 bits.  Now a can be thought of as ( a0*alpha^0 + a1*alp=
ha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and alpha is the primitive ele=
ment of the field.  Then a*b or a*c or a*d is just a sum of some combinatio=
n of those 3 values in the parentheses, depending upon the locations of the=
 1s in b, c, or d.  So you can premultiply the three values in the parenthe=
ses (the common part) and then take sums of subsets of those three (the ind=
ividual parts).  It's all a bunch of XORs at the end.  This is just a compl=
icated way of saying that by writing the HDL at a more explicit level, the =
synthesizer is better able to find common factors and use a lot fewer gates=
.

Article: 158297
Subject: Re: System On Chip From Microsemi
From: Meshenger <zoomboom718@gmail.com>
Date: Mon, 5 Oct 2015 10:37:17 -0700 (PDT)
Links: << >> << T >> << A >>

About SEU, I think the way this works is that yes as with any SRAM device t=
he data or settings can be disrupted by an SEU and there is nothing one can=
 do about that.  The device could possibly recover from that though.  But i=
f the configuration, i.e. the logic and connection grid, gets disrupted, th=
e chances of irreversible damage is much greater and less likely to be temp=
orary.  The flash-based gate structure protects logic configuration against=
 that.

The SF2 random number generator is clever and seeded by, amongst other sour=
ces, RAM power-up conditions, shown to be "random enough".  How random is r=
andom enough for cryptography?  This is a good article with some good comme=
nts in it:
http://www.eetimes.com/author.asp?section_id=3D36&doc_id=3D1326572
So "SP800-90 cryptographic-grade Non-Deterministic Random Bit Generator" mi=
ght be more correct than "true RNG".  So it follows recommendations in a NI=
ST special publication on seeding, checking and maintaining random bits.  Y=
ou would probably have to invent something new to improve on that.

In terms of Ethernet on the KickStart eval/development board, one can add a=
n Arduino shield for that.  There is Bluetooth on-board but not Ethernet.

SR

Article: 158298
Subject: Re: System On Chip From Microsemi
From: Jon Elson <jmelson@wustl.edu>
Date: Mon, 05 Oct 2015 14:01:39 -0500
Links: << >> << T >> << A >>

rickman wrote:


> BTW, I found that Mouser doesn't seem to carry Xilinx anymore.  When I
> search on Xilinx on the Mouser site they bring up the Altera page, lol.
> 
Xilinx is only carried through Digi-Key and Avnet, for the last few years.

Jon

Article: 158299
Subject: Re: Question about partial multiplication result in transposed FIR
From: rickman <gnuarm@gmail.com>
Date: Tue, 6 Oct 2015 23:02:27 -0400
Links: << >> << T >> << A >>

On 10/5/2015 1:29 PM, Kevin Neilson wrote:
>>
>> We are talking modulo 2 multiplies at every bit, otherwise known as
>> AND gates with no carry?  I'm a bit fuzzy on this.
>>
>> Now I'm confused by Kevin's description.  If the vector is
>> multiplied by a scalar, what parts are common and what parts are
>> unique?  What parts of this are fixed vs. variable?  The only parts
>> a tool can optimize are the fixed operands.  Or I am totally
>> missing the concept.
>>
> Say you're multiplying a by a vector [b c d].  Let's say we're using
> the field GF(8) so a is 3 bits.  Now a can be thought of as (
> a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
> alpha is the primitive element of the field.  Then a*b or a*c or a*d
> is just a sum of some combination of those 3 values in the
> parentheses, depending upon the locations of the 1s in b, c, or d.
> So you can premultiply the three values in the parentheses (the
> common part) and then take sums of subsets of those three (the
> individual parts).  It's all a bunch of XORs at the end.  This is
> just a complicated way of saying that by writing the HDL at a more
> explicit level, the synthesizer is better able to find common factors
> and use a lot fewer gates..

Ok, I'm not at all familiar with GFs.  I see now a bit of what you are 
saying.  But to be honest, I don't know the tools would have any trouble 
with the example you give.  The tools are pretty durn good at 
optimizing... *but*... there are two things to optimize for, size and 
performance.  They are sometimes mutually exclusive, sometimes not.  If 
you ask the tool to give you the optimum size, I don't think you will do 
better if you code it differently, while describing *exactly* the same 
behavior.

If you ask the tool to optimize for speed, the tool will feel free to 
duplicate logic if it allows higher performance, for example, by 
combining terms in different ways.  Or less logic may require a longer 
chain of LUTs which will be slower.  LUT sizes in FPGAs don't always 
match the logical breakdown so that speed or size can vary a lot 
depending on the partitioning.

-- 

Rick

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search