Messages from 161175

Article: 161175
Subject: Re: Altera Cyclone replacement
From: already5chosen@yahoo.com
Date: Thu, 14 Feb 2019 05:38:43 -0800 (PST)
Links: << >> << T >> << A >>

On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.com=
 wrote:
> On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.com =
wrote:
> > On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail=
.com wrote:
> > >=20
> > > Ok, if you are doing C in FPGA CPUs then you are in a different world=
 than the stuff I've worked on.  My projects use a CPU as a controller and =
often have very critical real time requirements.  While C doesn't prevent t=
hat, I prefer to just code in assembly language and more importantly, use a=
 CPU design that provides single cycle execution of all instructions.  That=
's why I like stack processors, they are easy to design, use a very simple =
instruction set and the assembly language can be very close to the Forth hi=
gh level language.=20
> > >=20
> >=20
> > Can you quantify criticality of your real-time requirements?
>=20
> Eh?  You are asking my requirement or asking how important it is? =20

How important they are. What happens if particular instruction most of the =
time=20
takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-le=
vel requirements impacted?

> Not sure how to answer that question.  I can only say that my CPU designs=
 give single cycle execution, so I can design with them the same way I desi=
gn the hardware in VHDL.=20
>=20
>=20
> > Also, even for most critical requirements, what's wrong with multiple c=
ycles per instructions as long as # of cycles is known up front?
>=20
> It increases interrupt latency which is not a problem if you aren't using=
 interrupts, a common technique for such embedded processors.

I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA ou=
rselves. Or throw multiple soft cores on multiple tasks. That why I am inte=
rested in *small* soft cores in the first place.

>  Otherwise multi-cycle instructions complicate the CPU instruction decode=
r. =20

I see no connection to decoder. May be, you mean microsequencer?
Generally, I disagree. At least for very fast clock rates it is easier to d=
esign non-pipelined or partially pipelined core where every instruction flo=
ws through several phases.

Or, may be, you think about variable-length instructions? That's again, ort=
hogonal to number of clocks per instruction. Anyway, I think that variable-=
length instructions are very cool, but not for 500-700 LUT4s budget. I woul=
d start to consider VLI for something like 1200 LUT4s.

> Using a short instruction format allows minimal decode logic.  Adding a c=
ycle counter increases the number of inputs to the instruction decode block=
 and so complicates the logic significantly.=20
>=20
>=20
> > Things like caches and branch predictors indeed cause variability (witc=
h by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycle=
s per instruction.
>=20
> Cache, branch predictors???  You have that with 1 kLUT CPUs???  I think w=
e design in very different worlds.=20

I don't *want* data caches in sort of tasks that I do with this small cores=
. Instruction cache is something else. I am not against them in "hard" MCUs=
.
In small soft cores that we are discussing right now they are impractical r=
ather than evil.
But static branch prediction is something else. I can see how static branch=
 prediction is practical in 700-800 LUT4s. I didn't have it implemented in =
my half-dozen (in the mean time the # is growing). But it is practical, esp=
. for applications that spend most of the time in very short loops.

> My program storage is inside the FPGA and runs at the full speed of the C=
PU.  The CPU is not pipelined (according to me, someone insisted that it wa=
s a 2 level pipeline, but with no pipeline delay,=20

I am starting to suspect that you have very special definition of "not pipe=
lined" that differs from definition used in literature.

> oh well) so no branch prediction needed.=20
>=20
>=20
> > > > >  Many stack based CPUs can be implemented in 1k LUT4s or less.  T=
hey can run fast, >100 MHz and typically are not pipelined. =20
> >=20
> > 1 cycle per instruction not pipelined means that stack can not be imple=
mented
> > in memory block(s). Which, in combination with 1K LUT4s means that eith=
er stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bit=
s). Either of it means that you need many more instructions (relatively to =
32-bit RISC with 32 or 16 registers) to complete the job.
>=20
> Huh?  So my block RAM stack is pipelined or are you saying I'm only imagi=
ning it runs in one clock cycle?  Instructions are things like=20
>=20
> ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (retur=
n from call), RETI (return from interrupt).  The interrupt pushes return ad=
dress to return stack and PSW to data stack in one cycle with no latency so=
, like the other instructions is single cycle, again making using it like d=
esigning with registers in the HDL code.=20
>=20
>=20
> > Also 1 cycle per instruction necessitates either strict Harvard memorie=
s or true dual-ported memories.
>=20
> Or both.  To get the block RAMs single cycle the read and write happen on=
 different phases of the main clock.  I think read is on falling edge while=
 write is on rising edge like the rest of the logic.  Instructions and data=
 are in physically separate memory within the same address map, but no way =
to use either one as the other mechanically.  Why would Harvard ever be a p=
roblem for an embedded CPU?=20
>=20

Less of the problem when you are in full control of software stack.
When you are not in full control, sometimes compilers like to place data, e=
sp. jump tables for implementing HLL switch/case construct, in program memo=
ry.
Still, even with full control of the code generation tools, sometimes you w=
ant
architecture consisting of tiny startup code that loads the bulk of the cod=
e from external memory, most commonly from SPI flash.
Another, less common possible reason is saving space by placing code and da=
ta in the same memory block. Esp. when blocks are relatively big and there =
are few of them.

>=20
> > And even with all that conditions in place, non-pipelined conditional b=
ranches at 100 MHz sound hard.=20
>=20
> Not hard when the CPU is simple and designed to be easy to implement rath=
er than designing it to be like all the other CPUs with complicated functio=
nality. =20
>=20

It is certainly easier when branching is based on arithmetic flags rather t=
han
on the content of register, like a case in MIPS derivatives, including Nios=
2 and RISC-V. But still hard. You have to wait for instruction to arrive fr=
om memory, decode an instruction, do logical operations on flags and select=
 between two alternatives based on result of logical operation, all in one =
cycle.
If branch is PC-relative, which is the case in nearly all popular 32-bit ar=
chitectures, you also have to do an address addition, all in the same cycle=
.

But even if it's somehow doable for PC-relative branches, I don't see how, =
assuming that stack is stored in block memory, it is doable for *indirect* =
jumps. I'd guess, you are somehow cutting corners here, most probably by re=
quiring the address of indirect jump to be in the top-of-stack register tha=
t is not in block memory.

>=20
> > Not impossible if your FPGA is very fast, like top-speed Arria-10, wher=
e you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz=
+. But it does look impossible in low speed grades budget parts, like slowe=
st speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that L=
attice Mach series is somewhat slower than even those.=20
>=20
> I only use the low grade parts.  I haven't used NIOS=20

Nios, not NIOS. The proper name and spelling is Nios2, because for a brief =
period in early 00s Altera had completely different architecture that was c=
alled Nios.

> and this processor won't get to 380 MHz I'm pretty sure.  Pipelining it w=
ould be counter it's design goals but might be practical, never thought abo=
ut it.=20
>=20
>=20
> > The only way that I can see non-pipelined conditional branches work at =
100 MHz in low end devices is if your architecture has branch delay slot. B=
ut that by itself is sort of pipelining, just instead of being done in HW, =
it is pipelining exposed to SW.
>=20
> Or the instruction is simple and runs fast.=20
>=20

I don't doubt that you did it, but answers like that smell hand-waving.

>=20
> > Besides, my current hobby interest is in 500-700 LUT4s rather than in 1=
000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available to=
o, so one can as well use OTS Nios2f which is pretty fast and validated to =
the level that hobbyist's cores can't even dream about.
>=20
> That's where my CPU lies, I think it was 600 LUT4s last time I checked. =
=20
>=20

Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count (=
5 variations, logical and arithmetic right shift, logical left shift, rotat=
e right, rotate left)?
Does it include zero-extended and sign-extended byte and half-word loads (f=
etches, in you language) ?
In my cores these two functions combined are the biggest block, bigger than=
 32-bit ALU, and comparable in size with result writeback mux.
Also, I assume that you cores have no multiplier, right?

> Rick C.

Article: 161176
Subject: Re: Altera Cyclone replacement
From: gnuarm.deletethisbit@gmail.com
Date: Thu, 14 Feb 2019 12:33:57 -0800 (PST)
Links: << >> << T >> << A >>

On Thursday, February 14, 2019 at 8:38:47 AM UTC-5, already...@yahoo.com wr=
ote:
> On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.c=
om wrote:
> > On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.co=
m wrote:
> > > On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gma=
il.com wrote:
> > > >=20
> > > > Ok, if you are doing C in FPGA CPUs then you are in a different wor=
ld than the stuff I've worked on.  My projects use a CPU as a controller an=
d often have very critical real time requirements.  While C doesn't prevent=
 that, I prefer to just code in assembly language and more importantly, use=
 a CPU design that provides single cycle execution of all instructions.  Th=
at's why I like stack processors, they are easy to design, use a very simpl=
e instruction set and the assembly language can be very close to the Forth =
high level language.=20
> > > >=20
> > >=20
> > > Can you quantify criticality of your real-time requirements?
> >=20
> > Eh?  You are asking my requirement or asking how important it is? =20
>=20
> How important they are. What happens if particular instruction most of th=
e time=20
> takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-=
level requirements impacted?

Of course, that depends on the application.  In some cases it would simply =
not work correctly because it was designed into the rest of the logic not e=
ntirely unlike a FSM.  In other cases it would make the timing indeterminat=
e which means it would make it harder to design the logic surrounding this =
piece. =20

> > Not sure how to answer that question.  I can only say that my CPU desig=
ns give single cycle execution, so I can design with them the same way I de=
sign the hardware in VHDL.=20
> >=20
> >=20
> > > Also, even for most critical requirements, what's wrong with multiple=
 cycles per instructions as long as # of cycles is known up front?
> >=20
> > It increases interrupt latency which is not a problem if you aren't usi=
ng interrupts, a common technique for such embedded processors.
>=20
> I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
> In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA =
ourselves. Or throw multiple soft cores on multiple tasks. That why I am in=
terested in *small* soft cores in the first place.

Yup, interrupts can be very bad.  But if you requirements are to do one thi=
ng in software that has real time requirements (such as service an ADC/DAC =
or fast UART) while the rest of the code is managing functions with much mo=
re relaxed real time requirements, using an interrupt can eliminate a CPU c=
ore or the design of a custom DMA with particular features that are easy in=
 software.=20

There are things that are easy to do in hardware and things that are easy t=
o do in software with some overlap.  Using a single CPU and many interrupts=
 fits into the domain of not so easy to do.  That doesn't make simple use o=
f interrupts a bad thing. =20

> >  Otherwise multi-cycle instructions complicate the CPU instruction deco=
der. =20
>=20
> I see no connection to decoder. May be, you mean microsequencer?

Decoder has outputs y(i) =3D f(x(j)) where x(j) is all the inputs and y(i) =
is all the outputs and f() is the function mapping inputs to outputs.  If y=
ou have multiple states for instructions the decoding function has more inp=
uts than if you only decode instructions and whatever state flags might be =
used such as carry or zero or interrupt input.=20

In general this will result in more complex instruction decoding.=20

> Generally, I disagree. At least for very fast clock rates it is easier to=
 design non-pipelined or partially pipelined core where every instruction f=
lows through several phases.

If by "easier" you mean possible, then yes.  That's why they use pipelining=
, to achieve clock speeds that otherwise can't be met.  But it is seldom si=
mple since pipelining is more than just adding registers.  Instructions int=
eract and on branches the pipeline has to be flushed, etc.=20

> Or, may be, you think about variable-length instructions? That's again, o=
rthogonal to number of clocks per instruction. Anyway, I think that variabl=
e-length instructions are very cool, but not for 500-700 LUT4s budget. I wo=
uld start to consider VLI for something like 1200 LUT4s.

Nope, just talking about using multiple clock cycles for instructions.  Usi=
ng variable number of clock cycles would be more complex in general and mul=
tiple length instructions even worse... in general.  There are always possi=
bilities to simplify some aspect of this by complicating some aspect of tha=
t.=20

> > Using a short instruction format allows minimal decode logic.  Adding a=
 cycle counter increases the number of inputs to the instruction decode blo=
ck and so complicates the logic significantly.=20
> >=20
> >=20
> > > Things like caches and branch predictors indeed cause variability (wi=
tch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cyc=
les per instruction.
> >=20
> > Cache, branch predictors???  You have that with 1 kLUT CPUs???  I think=
 we design in very different worlds.=20
>=20
> I don't *want* data caches in sort of tasks that I do with this small cor=
es. Instruction cache is something else. I am not against them in "hard" MC=
Us.
> In small soft cores that we are discussing right now they are impractical=
 rather than evil.

Or unneeded.  If the programs fits in the on chip memory, no cache is neede=
d.  What sort of programming are you doing in <1kLUT CPUs that would requir=
e slow off-chip program storage?=20

> But static branch prediction is something else. I can see how static bran=
ch prediction is practical in 700-800 LUT4s. I didn't have it implemented i=
n my half-dozen (in the mean time the # is growing). But it is practical, e=
sp. for applications that spend most of the time in very short loops.

If the jump instruction is one clock cycle and no pipeline, jump prediction=
 is not possible I think. =20

> > My program storage is inside the FPGA and runs at the full speed of the=
 CPU.  The CPU is not pipelined (according to me, someone insisted that it =
was a 2 level pipeline, but with no pipeline delay,=20
>=20
> I am starting to suspect that you have very special definition of "not pi=
pelined" that differs from definition used in literature.

Ok, not sure what that means.  Every instruction takes one clock cycle.  Wh=
ile a given instruction is being executed the next instruction is being fet=
ched, but the *actual* next instruction, not the "possible" next instructio=
n.  All branches happen during the branch instruction execution which fetch=
es the correct next instruction.=20

This guy said I was pipelining the fetch and execute...  I see no purpose i=
n calling that pipelining since it carries no baggage of any sort.=20

> > oh well) so no branch prediction needed.=20
> >=20
> >=20
> > > > > >  Many stack based CPUs can be implemented in 1k LUT4s or less. =
 They can run fast, >100 MHz and typically are not pipelined. =20
> > >=20
> > > 1 cycle per instruction not pipelined means that stack can not be imp=
lemented
> > > in memory block(s). Which, in combination with 1K LUT4s means that ei=
ther stack is very shallow or it is not wide (i.e. 16 bits rather than 32 b=
its). Either of it means that you need many more instructions (relatively t=
o 32-bit RISC with 32 or 16 registers) to complete the job.
> >=20
> > Huh?  So my block RAM stack is pipelined or are you saying I'm only ima=
gining it runs in one clock cycle?  Instructions are things like=20
> >=20
> > ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (ret=
urn from call), RETI (return from interrupt).  The interrupt pushes return =
address to return stack and PSW to data stack in one cycle with no latency =
so, like the other instructions is single cycle, again making using it like=
 designing with registers in the HDL code.=20
> >=20
> >=20
> > > Also 1 cycle per instruction necessitates either strict Harvard memor=
ies or true dual-ported memories.
> >=20
> > Or both.  To get the block RAMs single cycle the read and write happen =
on different phases of the main clock.  I think read is on falling edge whi=
le write is on rising edge like the rest of the logic.  Instructions and da=
ta are in physically separate memory within the same address map, but no wa=
y to use either one as the other mechanically.  Why would Harvard ever be a=
 problem for an embedded CPU?=20
> >=20
>=20
> Less of the problem when you are in full control of software stack.
> When you are not in full control, sometimes compilers like to place data,=
 esp. jump tables for implementing HLL switch/case construct, in program me=
mory.
> Still, even with full control of the code generation tools, sometimes you=
 want
> architecture consisting of tiny startup code that loads the bulk of the c=
ode from external memory, most commonly from SPI flash.
> Another, less common possible reason is saving space by placing code and =
data in the same memory block. Esp. when blocks are relatively big and ther=
e are few of them.

There is nothing to prevent loading code into program memory.  It's all one=
 address space and can be written to by machine code.  So I guess it's not =
really Harvard, it's just physically separate memory. Since instructions ar=
e not a word wide, I think the program memory does not implement a full wor=
d width.. to be honest, I don't recall.  I haven't used this CPU in years. =
 I've been programming in Forth on PCs more recently. =20

Another stack processor is the J1 which is used in a number of applications=
 and even had a TCP/IP stack implemented in about 8 kW (kB?) (kinstructions=
?).  You can find info on it with a google search.  It is every bit as smal=
l as mine and a lot better documented and programmed in Forth while mine is=
 programmed in assembly which is similar to Forth. =20

> > > And even with all that conditions in place, non-pipelined conditional=
 branches at 100 MHz sound hard.=20
> >=20
> > Not hard when the CPU is simple and designed to be easy to implement ra=
ther than designing it to be like all the other CPUs with complicated funct=
ionality. =20
> >=20
>=20
> It is certainly easier when branching is based on arithmetic flags rather=
 than
> on the content of register, like a case in MIPS derivatives, including Ni=
os2 and RISC-V. But still hard. You have to wait for instruction to arrive =
from memory, decode an instruction, do logical operations on flags and sele=
ct between two alternatives based on result of logical operation, all in on=
e cycle.
> If branch is PC-relative, which is the case in nearly all popular 32-bit =
architectures, you also have to do an address addition, all in the same cyc=
le.

I guess this is where I disagree on the pipelining aspect of my design.  I =
register the current instruction so the memory fetch is in the previous cyc=
le based on that instruction.  So my delay path starts with the instruction=
, not the instruction pointer.  The instruction decode for each section of =
the CPU is in parallel of course.  The three sections of the CPU are the in=
struction fetch, the data path and the address path.  The data path and add=
ress path roughly correspond to the data and return stacks in Forth.  In my=
 CPU they can operate separately and the return stack can perform simple ma=
th like increment/decrement/test since it handles addressing memory.  In Fo=
rth everything is done on the data stack other than holding the return addr=
esses, managing DO loop counts and user specific operations. =20

My CPU has both PC relative addressing and absolute addressing.  One way I =
optimize for speed is by careful management of the low level implementation=
.  For example I use an adder as a multiplexor when it's not adding.  A+0 i=
s A, 0+B is B, A+B is well, A+B. =20

> But even if it's somehow doable for PC-relative branches, I don't see how=
, assuming that stack is stored in block memory, it is doable for *indirect=
* jumps. I'd guess, you are somehow cutting corners here, most probably by =
requiring the address of indirect jump to be in the top-of-stack register t=
hat is not in block memory.

Indirect addressing???  Indirect addressing requires multiple instructions,=
 yes.  The return stack is used for address calculations typically and that=
 stack is fed directly into the instruction fetch logic... it is the "retur=
n" stack (or address unit, your choice) after all.=20

> > > Not impossible if your FPGA is very fast, like top-speed Arria-10, wh=
ere you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 M=
Hz+. But it does look impossible in low speed grades budget parts, like slo=
west speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that=
 Lattice Mach series is somewhat slower than even those.=20
> >=20
> > I only use the low grade parts.  I haven't used NIOS=20
>=20
> Nios, not NIOS. The proper name and spelling is Nios2, because for a brie=
f period in early 00s Altera had completely different architecture that was=
 called Nios.

I haven't used those processors either.=20

> > and this processor won't get to 380 MHz I'm pretty sure.  Pipelining it=
 would be counter it's design goals but might be practical, never thought a=
bout it.=20
> >=20
> >=20
> > > The only way that I can see non-pipelined conditional branches work a=
t 100 MHz in low end devices is if your architecture has branch delay slot.=
 But that by itself is sort of pipelining, just instead of being done in HW=
, it is pipelining exposed to SW.
> >=20
> > Or the instruction is simple and runs fast.=20
> >=20
>=20
> I don't doubt that you did it, but answers like that smell hand-waving.

Ok, whatever that means.=20

> > > Besides, my current hobby interest is in 500-700 LUT4s rather than in=
 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available =
too, so one can as well use OTS Nios2f which is pretty fast and validated t=
o the level that hobbyist's cores can't even dream about.
> >=20
> > That's where my CPU lies, I think it was 600 LUT4s last time I checked.=
 =20
> >=20
>=20
> Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count=
 (5 variations, logical and arithmetic right shift, logical left shift, rot=
ate right, rotate left)?

There are shift instructions.  It does not have a barrel shifter if that is=
 what you are asking.  A barrel shifter is not really a CPU.  It is a CPU f=
eature and is large and slow.  Why slow down the rest of the CPU with a ver=
y slow feature?  That is the sort of thing that should be external hardware=
. =20

When they design CPU chips, they have already made compromises that require=
 larger, slower logic which require pipelining.  The barrel shifter is perf=
ect for pipelining, so it fits right in.=20

> Does it include zero-extended and sign-extended byte and half-word loads =
(fetches, in you language) ?

I don't recall, but I'll say no.  I do recall some form of sign extension, =
but I may be thinking of setting the top of stack by the flags.  Forth has =
words that treat the word on the top of stack as a word, so the mapping is =
better if this is implemented.  I'm not convinced this is really better tha=
n using the flags directly in the asm, but for now I'm compromising.  I'm n=
ot really a compiler writer, so...=20

> In my cores these two functions combined are the biggest block, bigger th=
an 32-bit ALU, and comparable in size with result writeback mux.

Sure, the barrel shifter is O(n^^2) like a multiplier.  That's why in small=
 CPUs it is often done in loops.  Since loops can be made efficient with th=
e right instructions that's a good way to go.  If you really need the optim=
um speed for barrel shifting, then I guess a large block of logic and pipel=
ining is the way to go.=20

I needed to implement multiplications, but they are on 24 bit words that ar=
e being shifted into and out of a CODEC bit serial.  I found a software shi=
ft and add to work perfectly well, no need for special hardware. =20

Boman was using his J1 for video work (don't recall the details) but the Mi=
croblaze was too slow and used too much memory.  The J1 did the same functi=
ons faster and in less code with generic instructions, nothing unique to th=
e application if I remember correctly... not that the Microblaze is the gol=
d standard.=20

> Also, I assume that you cores have no multiplier, right?

By "cores" you mean CPUs?  Core actually, remember the interrupt, one CPU, =
one interrupt.   Yes, no hard multiplier as yet.  The pure hardware impleme=
ntation of the CODEC app used shift and add in hardware as well but new fea=
tures were needed and space was running out in the small FPGA, 3 kLUTs.  Th=
e slower, simpler stuff could be ported to software easily for an overall r=
eduction in LUT4 usage along with the new features.=20

I don't typically try to compete with the functionality of ARMs with my CPU=
 designs.  To me they are FPGA logic adjuncts.  So I try to make them as si=
mple as the other logic.=20

I wrote some code for a DDS in software once as a benchmark for CPU instruc=
tion set designs.  The shortest and fastest I came up with was a hybrid bet=
ween a stack CPU and a register CPU where objects near the top of stack cou=
ld be addressed rather than having to always move things around to put the =
nouns where the verbs could reach them.  I have no idea how to program that=
 in anything other than assembly which would be ok with me.  I used an exce=
l spread sheet to analyze the 50 to 90 instructions in this routine.  It wo=
uld be interesting to write an assembler that would produce the same output=
s.=20

Rick C.

Article: 161177
Subject: Re: Altera Cyclone replacement
From: Hul Tytus <ht@panix.com>
Date: Thu, 14 Feb 2019 21:15:55 +0000 (UTC)
Links: << >> << T >> << A >>

32 bit RISC mcus with 32 registers... do you have any actual devices in 
mind?

Hul

already5chosen@yahoo.com wrote:
> On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
> > 
> > Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language. 
> > 

> Can you quantify criticality of your real-time requirements?

> Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
> Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

> > 
> > > >  Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.  

> 1 cycle per instruction not pipelined means that stack can not be implemented
> in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

> Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

> And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those. 
> The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

> Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

Article: 161178
Subject: Re: Altera Cyclone replacement
From: already5chosen@yahoo.com
Date: Thu, 14 Feb 2019 16:07:01 -0800 (PST)
Links: << >> << T >> << A >>

On Thursday, February 14, 2019 at 11:15:58 PM UTC+2, Hul Tytus wrote:
> 32 bit RISC mcus with 32 registers... do you have any actual devices in 
> mind?
> 
> Hul
> 

First, I don't like to answer to top-poster.
Next time I wouldn't answer.

The discussion was primarily about soft cores.
Two most popular soft cores Nios2 and Microblaze are 32-bit RISCs with 32 registers.

In "hard" MCUs there are MIPS-based products from Microchip.

More recently there appeared few RISC-V MCUs. Probbaly more is going to follow.

In the past there were popular PPC-based MCU devices from various vendors. They are less popular today, but still exist. Freescale (now NXP) e200 core variants are designed specifically for MCU applications.
https://www.nxp.com/products/product-selector:PRODUCT-SELECTOR#/category/c731_c381_c248

So, not the whole 32-bit MCU world is ARM Cortex-M. Just most of it ;-)

Article: 161179
Subject: Re: Altera Cyclone replacement
From: "A.P.Richelieu" <aprichelieu@gmail.com>
Date: Fri, 15 Feb 2019 19:41:39 +0100
Links: << >> << T >> << A >>

Den 2019-02-14 kl. 11:07, skrev already5chosen@yahoo.com:
> On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
>>
>> Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on.  My projects use a CPU as a controller and often have very critical real time requirements.  While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions.  That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.
>>
> 
> Can you quantify criticality of your real-time requirements?
> 
> Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
> Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.
> 
>>
>>>>   Many stack based CPUs can be implemented in 1k LUT4s or less.  They can run fast, >100 MHz and typically are not pipelined.
> 
> 1 cycle per instruction not pipelined means that stack can not be implemented
> in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.
> 
> Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.
> 
> And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.
> The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.
> 
> Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.
> 

I think the best way to get exact performance is to implement a 
multithreaded architecture.
This is not the smallest CPU architecture, but the pipeline will run at
very high frequency.

The Multithreaded architecture I have used has
a classic three stage pipeline, fetch, decode, execute,
so there are three instructions active all the time.

The architecture implements ONLY 1 clock cycle in each stage.

Many CPUs implement multicycle functionality, by having statemachines
inside the decode stage.
Thge decode stage can either control the execute stage (the datapath)
directly by decoding the instruction in the fetch stage output,
or it can control the execute stage from one of several statemachines
implementing things like interrupt entry, interrupt exit etc.

The datapath can easily require 80-120 control signals,
so each statemachine needs to have the same number of state registers.
On top of that you need to multiplex all the statemachines together.
This is a considerable amount of logic.

I do it a little bit differently. The CPU has an instruction set
which is basically 16 bit + immediates. This gives room for 16 registers
if you want to have a decent instruction set. 8 bit instruction
and 2 x 4 bit register addresses.

The instruction decoder support an extended 22 bit instruction set.
This gives room for a 10 bit extended instruction set, and 2 x 6 bit
register addresses.
The extended register address space is used for two purposes.
1. To address special registers like the PSR
2. To address a constant ROM, for a few useful constants.

The fetch stage can fetch instructions from two places.
1. The instruction queue(2). The instruction queue only supports 16 bit 
instructions with 16/32 bit immediates.
2. A small ROM which provides 22 bit instructions (with 22 bit immediates)

Whenever something happens which normally would require a multicycle
instruction, the thread makes a subroutine jump (0 clock cycle jump)
into the ROM, and executes 22 bit instructions.

A typical use would be an interrupt.
To clear the interrupt flag, you want to clear one bit in the PSR.

The instruction ROM contains
     ANDC PSR, const22   ; AND constantROM[22] with PSR.
                         ; ConstantROM[22] == 0xFFFFFEFF
                         ; Clear bit 9 (I) of PSR

To implement multithreading, I need a single decoder,
but multiple register banks, one per thread.
Several special purpose registers per thread (like PSR)
is also needed.

I also need multiple instruction queues (one per thread)

To speed up the pipeline, it is important to follow a simple rule.
A thread cannot ever execute in a cycle, if the instruction
depends in anyway on the result of the previous instruction.
If that rule is followed, you do not need to feedback
the result of an ALU operation to the ALU.

The simplest way to follow the rule is to never
let a thread execute during two adjacent clock cycles.
This limits the performance of a thread to max 1/2 that of
what the CPU is capable of but at the same time,
there is less logic in the critical path, so you
can increase the clock frequency.

Now you suddenly can run code with exact properties.
You can say that I want to execute 524 instructions
per millisecond, and that is what the CPU will do.

You can let all the interrupts be executed in one thread,
so you do not disturb the time critical threads.

The architecture is well suited for FPGA work since
you can use standard dual port RAMs for registers.

I use two dual port RAMs to implement the register banks (each has one 
read port and one write port)
The writes are connected together, so you have in effect a register
memory with 1 write port and 2 read ports.

If the CPU architectural model has, lets say, 16 registers x 32,
and you use 2 x (256 x 32) dual port RAMs, you have storage for
16 threads. 2 x (16 CPUs x 16 registers x 32 bits)
If you use 512 x 32 bit DPRAMs you have room for 32 threads.

If you want to study a real example look at the MIPS multithreaded cores
https://www.mips.com/products/architectures/ase/multi-threading/

They decided to build that after I presented my research to their CTO.
They had more focus on the performance than the real time control
which is a pity.
FPGA designers do not have that limitation.

AP

Article: 161180
Subject: Cyclone V decimation
From: Piotr Wyderski <peter.pan@neverland.mil>
Date: Sat, 23 Feb 2019 08:31:59 +0100
Links: << >> << T >> << A >>

Hi,

the input signal is 14 bits signed@750ksps. I would like to decimate it 
by a modest factor of ~3000. What would be the best way of doing it on a 
Cyclone V, resource-wise? My usual approach would be a cascade of CIC
decimators followed by a FIR corrector, but since there are the DSP 
blocks, I don't feel it to be the "right" (albeit correct) approach. I'm 
new to the V family and lack the proper intuitions, so could someone 
more versed
suggest me a good direction?

In fact, there will be 12 such channels, all going in sync,
so maybe a considerable resouce sharing can be achieved?

	Best regards, Piotr

Article: 161181
Subject: Re: Cyclone V decimation
From: gnuarm.deletethisbit@gmail.com
Date: Sat, 23 Feb 2019 06:18:04 -0800 (PST)
Links: << >> << T >> << A >>

On Saturday, February 23, 2019 at 2:32:04 AM UTC-5, Piotr Wyderski wrote:
> Hi,
> 
> the input signal is 14 bits signed@750ksps. I would like to decimate it 
> by a modest factor of ~3000. What would be the best way of doing it on a 
> Cyclone V, resource-wise? My usual approach would be a cascade of CIC
> decimators followed by a FIR corrector, but since there are the DSP 
> blocks, I don't feel it to be the "right" (albeit correct) approach. I'm 
> new to the V family and lack the proper intuitions, so could someone 
> more versed
> suggest me a good direction?
> 
> In fact, there will be 12 such channels, all going in sync,
> so maybe a considerable resouce sharing can be achieved?
> 
> 	Best regards, Piotr

To determine the "right" approach, you need to define "right" in some engineering terms.  So what aspects of the design and implementation are important to your goals? 

Rick C.

Article: 161182
Subject: Can MIPS Leapfrog RISC-V?
From: gnuarm.deletethisbit@gmail.com
Date: Sat, 23 Feb 2019 08:15:02 -0800 (PST)
Links: << >> << T >> << A >>

I saw this article and found it interesting.  But what is behind it?  If th=
e ISA is open source, they give up a revenue stream.  How do they replace t=
hat?  The article talked about Wave Computing having AI software and gettin=
g some sort of synergy from the MIPS architecture being open.  I don't foll=
ow.=20

I wonder how well the MIPS ISA can be implemented in an FPGA.  The ones lis=
ted in Jim Brakefield's lists seem to be a bit larger than other RISC desig=
ns.=20

Rick C.

Article: 161183
Subject: Re: Cyclone V decimation
From: Piotr Wyderski <peter.pan@neverland.mil>
Date: Sat, 23 Feb 2019 17:17:30 +0100
Links: << >> << T >> << A >>

gnuarm.deletethisbit@gmail.com wrote:

> To determine the "right" approach, you need to define "right" in some engineering terms.  So what aspects of the design and implementation are important to your goals?

Minimisation of resource usage, or in other words, a decimation 
technique that maps best onto the underlying primitives. I believe
those 200+ DSP (multiply-accumulate) blocks are good for something...

	Best regards, Piotr

Article: 161184
Subject: Re: Cyclone V decimation
From: already5chosen@yahoo.com
Date: Sat, 23 Feb 2019 12:20:35 -0800 (PST)
Links: << >> << T >> << A >>

On Saturday, February 23, 2019 at 6:17:28 PM UTC+2, Piotr Wyderski wrote:
> gnuarm.deletethisbit@gmail.com wrote:
>=20
> > To determine the "right" approach, you need to define "right" in some e=
ngineering terms.  So what aspects of the design and implementation are imp=
ortant to your goals?
>=20
> Minimisation of resource usage, or in other words, a decimation=20
> technique that maps best onto the underlying primitives. I believe
> those 200+ DSP (multiply-accumulate) blocks are good for something...
>=20
> 	Best regards, Piotr

If all you want is minimization of resource usage then just do CIC.

Something else makes sense only if you want very flat pass band and very sh=
arp transition between pass band and stop band.

The problem with using generic FIR for decimation is not computation, which=
 for your requirements would be minimal, but storage, both for coefficients=
 and for delay line. Decimation by 3000 would need something like 15K coeff=
icients for good filter shape or twice as many for very good shape. Coeffic=
ients storage could be cut in half due to filter's symmetry, but I am not a=
ware of similar trick for delay line. So, overall you will need just 1 DSP =
block, but 40 to 80 M10K blocks.
Of course, you always can trade storage for simplicity, by building you dec=
imation chain as a cascade, probably sizing each stage for delay line to fi=
t in 1 M10K block. Then the whole chain will take 3 stages and only 6 M10K =
blocks and filter shape could still be excellent. Or, may be, even 2 M10K b=
locks if you are ready to complicate a control machine a little more by pla=
cing all delay lines in a common M10K and doing the same for coefficients, =
 But it is worth an increased complexity? I am not sure.
And then there is variant in the middle - cascade of 2 stages instead of 3.=
 Then each delay line and each set of FIR taps will fit in M9K, but two del=
ay line wouldn't fit. So, with a bit of control acrobatics you could fit th=
e whole cascade in 3 M9K blocks. Still, do it only if you care about shape =
of the filter , but don't do it for resources alone.

Article: 161185
Subject: Re: Cyclone V decimation
From: Piotr Wyderski <peter.pan@neverland.mil>
Date: Sat, 23 Feb 2019 22:58:20 +0100
Links: << >> << T >> << A >>

already5chosen@yahoo.com wrote:

> If all you want is minimization of resource usage then just do CIC > Something else makes sense only if you want very flat pass band and 
very sharp transition between pass band and stop band.

There is very little to no energy in the upper part of the band. The 
high ADC speed is there for other reasons. Therefore, CIC will be more
than enough, at least in the first stages of the cascade. I don't know
yet if it would be sufficient for the final stage, but this is a detail
that can be tweaked in a later phase.

So I have a licensing type of a question: can I instantiate DSP blocks
in Quartus Lite? I know the DSP builder is an extra paid tool, but I 
don't need it -- a purely Verilog instantiation would be sufficient.
This block appears to have a decent accumulator, so it could relieve the 
ALMs otherwise needed by the register-hungry CIC.

Thank you!

	Best regards, Piotr

Article: 161186
Subject: Re: Cyclone V decimation
From: Kevin Neilson <kevin.neilson@xilinx.com>
Date: Sat, 23 Feb 2019 17:59:29 -0800 (PST)
Links: << >> << T >> << A >>

First of all, since your sample rates are pretty low, I'd see if it's possi=
ble to use a DSP chip instead of an FPGA.  Everything is easier in software=
.

Everything depends on your specs, which you have not stated.  Namely:  what=
 is the attenuation of the stopband, and what is the slope between the pass=
band and the stopband?  You say there is not much in the upper frequencies,=
 so this makes it sound like your filtering requirements are very low.  If =
there is nothing much at all up there, you don't even need to filter.  Just=
 decimate.  Take every nth sample.

The point of the CIC is to reduce the need for multipliers, but you have pl=
enty of multipliers and low sample rates.  The CIC has big sidelobes.  It m=
ight be better to do a cascade of FIRs each with low numbers of taps.

Article: 161187
Subject: Re: Cyclone V decimation
From: gnuarm.deletethisbit@gmail.com
Date: Sat, 23 Feb 2019 21:24:19 -0800 (PST)
Links: << >> << T >> << A >>

On Saturday, February 23, 2019 at 11:17:28 AM UTC-5, Piotr Wyderski wrote:
> gnuarm.deletethisbit@gmail.com wrote:
>=20
> > To determine the "right" approach, you need to define "right" in some e=
ngineering terms.  So what aspects of the design and implementation are imp=
ortant to your goals?
>=20
> Minimisation of resource usage, or in other words, a decimation=20
> technique that maps best onto the underlying primitives. I believe
> those 200+ DSP (multiply-accumulate) blocks are good for something...
>=20
> 	Best regards, Piotr

Is that your only criterion?  Along with the 200+ DSP blocks I would expect=
 the chip has many thousands of LUTs and FFs.  Why focus on DSP block usage=
? =20

I don't see a problem of using the CIC decimators if they otherwise work th=
e way you want.  A CIC filter had sharp nulls a particular points but doesn=
't do so much elsewhere while being very logic and energy efficient.  They =
are typically finished by a relatively short FIR so the aggregate delay is =
not so large.  Doing it all in a single filter would create a much longer d=
elay, no? =20

Other than the power usage of a large decimating FIR filter, I can't think =
of other trade offs.=20

Rick C.

Article: 161188
Subject: Re: Cyclone V decimation
From: Piotr Wyderski <peter.pan@neverland.mil>
Date: Sun, 24 Feb 2019 07:23:17 +0100
Links: << >> << T >> << A >>

gnuarm.deletethisbit@gmail.com wrote:

> Is that your only criterion?

Well, basiclly, yes, it is the only degree of freedom. In other words:
I can design any filtering structure that satisfies my requirements from 
the signal processing point of view, but not all structures are equally
welcome by the FPGA, let alone an FPGA with DSP slices. Hence my question.

I've already done it with a multistage CIC alone, but the hardware
was much simpler and CIC approach was the only viable one.

 > Along with the 200+ DSP blocks I would expect the chip has many 
thousands of LUTs and FFs.  Why focus on DSP block usage?

One reason is to learn them, other is the ability to use a smaller chip. 
A DSP block is composed of two multipliers and an accumulator. The 
accumulator is what a CIC needs. There will be plenty of other functions 
occupying that FFs.

	Best regards, Piotr

Article: 161189
Subject: Re: Cyclone V decimation
From: gnuarm.deletethisbit@gmail.com
Date: Sun, 24 Feb 2019 07:03:39 -0800 (PST)
Links: << >> << T >> << A >>

On Sunday, February 24, 2019 at 1:23:21 AM UTC-5, Piotr Wyderski wrote:
> gnuarm.deletethisbit@gmail.com wrote:
>=20
> > Is that your only criterion?
>=20
> Well, basiclly, yes, it is the only degree of freedom. In other words:
> I can design any filtering structure that satisfies my requirements from=
=20
> the signal processing point of view, but not all structures are equally
> welcome by the FPGA, let alone an FPGA with DSP slices. Hence my question=
.
>=20
> I've already done it with a multistage CIC alone, but the hardware
> was much simpler and CIC approach was the only viable one.
>=20
>  > Along with the 200+ DSP blocks I would expect the chip has many=20
> thousands of LUTs and FFs.  Why focus on DSP block usage?
>=20
> One reason is to learn them, other is the ability to use a smaller chip.=
=20
> A DSP block is composed of two multipliers and an accumulator. The=20
> accumulator is what a CIC needs. There will be plenty of other functions=
=20
> occupying that FFs.

You haven't given us much to go on.  As some have pointed out you can do th=
e decimation in multiple stages and use smaller FIR filters at each point, =
or use on ginormous FIR filter.  In both cases a polyphase organization wil=
l reduce the number of calculations needed.  Or you can use the CIC filter =
as a front end.  I don't know any of the details, so I have no way of calcu=
lating the resource usage. =20

I think it is pretty obvious what the trade offs are.  Squeeze here and thi=
s toothpaste comes out there.  Squeeze there and other toothpaste comes out=
 somewhere else. =20

To know where to squeeze and how hard the numbers are important.=20

Rick C.

Article: 161190
Subject: Re: Cyclone V decimation
From: Piotr Wyderski <peter.pan@neverland.mil>
Date: Mon, 25 Feb 2019 08:36:36 +0100
Links: << >> << T >> << A >>

already5chosen@yahoo.com wrote:

> If all you want is minimization of resource usage then just do CIC.

As an afterthought: given the number of channels, their relative slow 
speed and the requirement of lockstep processing, perhaps a bit-serial
CIC would be a good idea?

Other parts of the design can benefit greatly from massive application 
of this approach and it would be a powerful cerebral decalcifier. I think
it is worth doing even if just to learn it makes no sense.

Thank you all for your help!

	Best regards, Piotr

Article: 161191
Subject: =?UTF-8?Q?Your_existing_VHDL_testbench_says_=E2=80=98Hello_world=E2=80=99_?=
From: Espen Tallaksen <espen.tallaksen@bitvis.no>
Date: Mon, 25 Feb 2019 07:08:16 -0800 (PST)
Links: << >> << T >> << A >>

 - and that also includes preparing the use of any UVVM Utility Library com=
mand available, - for logging, checking signal values and signal stability,=
 waiting for signal changes, values or stability, clock generation, synchro=
nization, and lots of other very useful testbench functionality.


The *exhaustive* list of what to do:

1. Download from Github https://github.com/UVVM/UVVM
2. Compile Utility Library as follows:
   a) Inside your simulator go to =E2=80=98uvvm_util/sim=E2=80=99
   b) execute: =E2=80=98source ../script/compile_src.do=E2=80=99
3. Include the library inside your testbench by adding the following lines =
before your testbench entity declaration:
      library uvvm_util;
      context uvvm_util.uvvm_util_context;
4. You may now enter any utility library command inside your testbench proc=
esses (or subprograms)
   e.g. log("Hello world");

----

You find a full Quick reference for all these commands inside the download =
(or here: https://github.com/UVVM/UVVM/blob/master/uvvm_util/doc/util_quick=
_ref.pdf). This includes a command overview - followed by detailed info per=
 command - including type overloads, description and examples.

Invest another 4 minutes and you are ready to run your first high level tra=
nsaction commands on AXI4-lite, AXI4-stream, Avalon, SPI, I2C, UART and mor=
e. All you need to do is to compile and include the relevant Verification I=
P (e.g. bitvis_vip_uart) in the same way as for bullets 2 and 3 above, and =
then just execute any UART transaction command as given in the relevant qui=
ck reference (e.g. uart_transmit(x"5A", "Transmitting my first byte", clk, =
tx);

UVVM is the fastest growing FPGA (and ASIC) verification methodology world-=
wide (acc. to Wilson Research). Invest 4 minutes and get started now. (Ther=
e is plenty documentation on Github. Check out the README file for more inf=
o and links to Powerpoints, Documentation, Webinars, etc..

Article: 161192
Subject: Re: Cyclone V decimation
From: gnuarm.deletethisbit@gmail.com
Date: Mon, 25 Feb 2019 07:10:44 -0800 (PST)
Links: << >> << T >> << A >>

On Monday, February 25, 2019 at 2:36:33 AM UTC-5, Piotr Wyderski wrote:
> already5chosen@yahoo.com wrote:
>=20
> > If all you want is minimization of resource usage then just do CIC.
>=20
> As an afterthought: given the number of channels, their relative slow=20
> speed and the requirement of lockstep processing, perhaps a bit-serial
> CIC would be a good idea?
>=20
> Other parts of the design can benefit greatly from massive application=20
> of this approach and it would be a powerful cerebral decalcifier. I think
> it is worth doing even if just to learn it makes no sense.
>=20
> Thank you all for your help!

When  I have looked at performing bit serial calculations I've found it to =
not be a large savings of logic and often using more FFs.  If you use some =
form of RAM, either distributed or block, the FF savings can be good.  I su=
ppose the Xilinx LUT shift registers come in handy for this.  I think they =
are still the only ones doing that. =20

I suppose once you get your head wrapped around the bit serial thing, it ca=
n be easy to do.  It can make it a bit harder to extend the precision at ea=
ch stage since that means the bit count changes and so the timing. =20

Rick C.

Article: 161193
Subject: Re: Cyclone V decimation
From: Rob Gaddi <rgaddi@highlandtechnology.invalid>
Date: Mon, 25 Feb 2019 10:09:31 -0800
Links: << >> << T >> << A >>

On 2/22/19 11:31 PM, Piotr Wyderski wrote:
> Hi,
> 
> the input signal is 14 bits signed@750ksps. I would like to decimate it 
> by a modest factor of ~3000. What would be the best way of doing it on a 
> Cyclone V, resource-wise? My usual approach would be a cascade of CIC
> decimators followed by a FIR corrector, but since there are the DSP 
> blocks, I don't feel it to be the "right" (albeit correct) approach. I'm 
> new to the V family and lack the proper intuitions, so could someone 
> more versed
> suggest me a good direction?
> 
> In fact, there will be 12 such channels, all going in sync,
> so maybe a considerable resouce sharing can be achieved?
> 
>      Best regards, Piotr

This may be a better question over at comp.dsp.

That said, and given what you've said in other responses, your best 
answer may be to use a polyphase decimating FIR filter.  In effect, 
you'd use a 12000 tap FIR filter, but only 4 taps of it at a time.

Understanding Digital Signal Processing (Lyons, 2011) has a good enough 
treatment on the subject for a general purpose DSP book.  Multirate 
Digital Signal Processing (Crochiere and Rabiner, 1983) has an excellent 
and extremely rigorous treatment on the subject, but is out-of-print and 
a far less general book in general.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.

Article: 161194
Subject: Re: Cyclone V decimation
From: Benjamin Couillard <benjamin.couillard@gmail.com>
Date: Mon, 25 Feb 2019 13:37:58 -0800 (PST)
Links: << >> << T >> << A >>

Le samedi 23 f=C3=A9vrier 2019 02:32:04 UTC-5, Piotr Wyderski a =C3=A9crit=
=C2=A0:
> Hi,
>=20
> the input signal is 14 bits signed@750ksps. I would like to decimate it=
=20
> by a modest factor of ~3000. What would be the best way of doing it on a=
=20
> Cyclone V, resource-wise? My usual approach would be a cascade of CIC
> decimators followed by a FIR corrector, but since there are the DSP=20
> blocks, I don't feel it to be the "right" (albeit correct) approach. I'm=
=20
> new to the V family and lack the proper intuitions, so could someone=20
> more versed
> suggest me a good direction?
>=20
> In fact, there will be 12 such channels, all going in sync,
> so maybe a considerable resouce sharing can be achieved?
>=20
> 	Best regards, Piotr

You could also use halfband FIR filters, they are really efficient. Again, =
I really recommed Rick Lyon DSP book, it is a really good book, it is not t=
oo mathy. Basically a 16-tap halfband filter will only use 4  multipliers i=
nstead of 16.

Assuming you decimate by 2048 i.e 2^11, you would need abut 44 multipliers.=
 Furthermore, you can time-multiplex and reuse the multipliers, so you coul=
d probably get by using one hardware multiplier per stage for a total of 11=
 multipliers.

Article: 161195
Subject: Re: Cyclone V decimation
From: lasselangwadtchristensen@gmail.com
Date: Mon, 25 Feb 2019 14:35:32 -0800 (PST)
Links: << >> << T >> << A >>

mandag den 25. februar 2019 kl. 22.38.02 UTC+1 skrev Benjamin Couillard:
> Le samedi 23 f=C3=A9vrier 2019 02:32:04 UTC-5, Piotr Wyderski a =C3=A9cri=
t=C2=A0:
> > Hi,
> >=20
> > the input signal is 14 bits signed@750ksps. I would like to decimate it=
=20
> > by a modest factor of ~3000. What would be the best way of doing it on =
a=20
> > Cyclone V, resource-wise? My usual approach would be a cascade of CIC
> > decimators followed by a FIR corrector, but since there are the DSP=20
> > blocks, I don't feel it to be the "right" (albeit correct) approach. I'=
m=20
> > new to the V family and lack the proper intuitions, so could someone=20
> > more versed
> > suggest me a good direction?
> >=20
> > In fact, there will be 12 such channels, all going in sync,
> > so maybe a considerable resouce sharing can be achieved?
> >=20
> > 	Best regards, Piotr
>=20
> You could also use halfband FIR filters, they are really efficient. Again=
, I really recommed Rick Lyon DSP book, it is a really good book, it is not=
 too mathy. Basically a 16-tap halfband filter will only use 4  multipliers=
 instead of 16.
>=20
> Assuming you decimate by 2048 i.e 2^11, you would need abut 44 multiplier=
s. Furthermore, you can time-multiplex and reuse the multipliers, so you co=
uld probably get by using one hardware multiplier per stage for a total of =
11 multipliers.

with each stage running at half the rate of the previous it should be=20
possible to stagger the calculations so you only need (slightly less)=20
than twice the first stage

1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1....
-2---2---2---2---2---2---2---2---....
---3-------3-------3-------3-----....
-------4---------------4---------....

Article: 161196
Subject: Re: Cyclone V decimation
From: Piotr Wyderski <peter.pan@neverland.mil>
Date: Fri, 1 Mar 2019 08:40:45 +0100
Links: << >> << T >> << A >>

gnuarm.deletethisbit@gmail.com wrote:

> When  I have looked at performing bit serial calculations I've found it to not be a large savings of logic and often using more FFs.

You are right, several initial attempts indicate that the savings are 
minor if I apply time multiplexing carefully. It was a refreshing
experience, though, so no time wasted.

The large decimation factor implies the final bandwidth is narrow, so 
even a very modest 4-stage decimating by 4 CIC filter has about 100dB
of attenuation around the +/-20kHz DC image frequencies. There will be
considerable aliasing above that, but I'm going to filter it out anyway
later, so why bother. The subsequent filters will work at a much lower 
data rate, so I can bump up their order or even change their topology to
something other than a CIC.

Lesson learned: narrow-band CIC attenuation doesn't depend on the filter 
order considerably. Obvious when you think about it, but for some reason
it wasn't.

OK, I have my answer, thank you all for your contribution!

	Best regards, Piotr

Article: 161197
Subject: Anyone have files from the old Xilinx FTP?
From: Tim Regeant <timbuck2@posteo.us>
Date: Tue, 12 Mar 2019 22:45:03 -0400
Links: << >> << T >> << A >>

Hi all,

Looking for someone who has FTP files from 1997 for the XACT Foundation 
v6.0.2 update.

Web Archive has the files listed here: 
https://web.archive.org/web/19970616112705/http://www.xilinx.com/support/techsup/ftp/htm_index/sw_foundation.htm

Anyone capture these files?

Thanks.

Article: 161198
Subject: Re: Anyone have files from the old Xilinx FTP?
From: Anssi Saari <as@sci.fi>
Date: Wed, 13 Mar 2019 09:27:03 +0200
Links: << >> << T >> << A >>

Tim Regeant <timbuck2@posteo.us> writes:

> Hi all,
>
> Looking for someone who has FTP files from 1997 for the XACT
> Foundation v6.0.2 update.
>
> Web Archive has the files listed here:
> https://web.archive.org/web/19970616112705/http://www.xilinx.com/support/techsup/ftp/htm_index/sw_foundation.htm
>
> Anyone capture these files?

I don't have those files but have you asked Xilinx? I find FPGA
companies are fairly good about providing old versions of their
software. For example, last year Lattice provided ispLEVER 5.1 when I
asked, just so we could try to recreate an old CPLD which used that
specific version. It took some time but eventually they did provide that
exact version.

Article: 161199
Subject: Re: Anyone have files from the old Xilinx FTP?
From: gnuarm.deletethisbit@gmail.com
Date: Wed, 13 Mar 2019 08:50:56 -0700 (PDT)
Links: << >> << T >> << A >>

On Tuesday, March 12, 2019 at 10:45:13 PM UTC-4, Tim Regeant wrote:
> Hi all,
>=20
> Looking for someone who has FTP files from 1997 for the XACT Foundation=
=20
> v6.0.2 update.
>=20
> Web Archive has the files listed here:=20
> https://web.archive.org/web/19970616112705/http://www.xilinx.com/support/=
techsup/ftp/htm_index/sw_foundation.htm
>=20
> Anyone capture these files?
>=20
> Thanks.

I seem to recall that I had copied of XACT from around that time frame and =
sent them to someone.  I boxed up the several boxes of software (possibly o=
n floppy?) and shipped them out.  If that person is still posting in this g=
roup perhaps he could help you?=20

Rick C.

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search