1. IMPORTANT:
    We launched a new online community and this space is now closed. This community will be available as a read-only resources until further notice.
    JOIN US HERE

Core question - FFT

Discussion in 'Building With Reaktor' started by joop, Oct 7, 2005.

Thread Status:
Not open for further replies.
  1. joop

    joop NI Product Owner

    Messages:
    333
    I have yet to get a solid look at what the core has to offer, but can core be used to add FFT processes (like filtering partials above a threshold or convolution) to Reaktor?

    Is anybody actively investigating this?


    Just curious. Thanks.
     
  2. kid_sputnik

    kid_sputnik NI Product Owner

    Messages:
    3,552
    um, gabriel?...
     
  3. lxint

    lxint NI Product Owner

    Messages:
    764
  4. TheBayer

    TheBayer NI Product Owner

    Messages:
    38
    Though the core cells seem efficient enough to do this, you quickly run into "Compilation error: structure is too long." problems. Unfortunately I see no way to break up the process to get past this. The data that needs to be passed would need to go outside of the core cell into another to circumvent this from what I can see. I suspect with some fine tuning I could get to 64 point FFT, but I fear that would be with no operations and no IFFT. Thus it'd be pretty useless for doing spectral manipulations. I'm thinking of submitting a list of requested features/refinements to NI.

    I think they had a great idea with Core, but I think they really dropped the ball with some of the limitations. I.e. Just before the core cell with the FFT stopped functioning it was eating 2% CPU. So performance wise it was doing great, but their limits mean that no one will be able to realize a greater part of the flexibility that Core claimed to give.
     
  5. sowari

    sowari Moderator Moderator

    Messages:
    27,759
    if you want to name these features/refinements and also list the limitations. i will happily forward it to the beta test team.

    and by all means submit them to NI as well, but a 2 pronged attack might be better ;-)

    sowari
     
  6. TheBayer

    TheBayer NI Product Owner

    Messages:
    38
    So far this is my list. Maybe some are based on my ignorance, I don't know for sure. I've organized them from what I see as the biggest or most fundamental problems throgh the more wish oriented problems/requests. Some of these I think should be quite simple to do, others will be rather nasty. Even getting 1-4 or 1-5 would make Reaktor a ton more powerful.

    Reaktor 5 feature problems/requests:

    1. There is no core cell type that allows you to take in audio and output events. You can certainly do this by hooking stuff external to the cell to try to filter the events, but this means duplicate events won't be seen. The audio core cell makes the most sense, just have a different output type. Multiple events per control cycle become the last event seen.

    2. There is SR.C and SR.R, there really should be an CR.C and CR.R (control rate clock and rate). Passing in a clock and feeding it to all the macros is a real pain when making low speed event systems.

    2a. When you load a macro into an event core cell it should convert any SR.C and SR.R to CR.C and CR.R respectively. This would allow many of the Expert and Standard macros to function correctly in Event Cells.

    3. Make a real Random Number core builtin that gets the system random. It's too hard to get something random across a polyphonic signal as they either start with the same seed or anything algrothmic you do based on voice number gives you similar patterns.

    4. All core macros should be compiled independantly, not as a complete cell. This would allow leveraging to larger cells without constant recompile overhead. Then only recompile the guts of the macro that changed (not parents except maybe trivially, and not children.)

    4a. A macro or a core cell should have an "Optimize" operation on it. This will disable audio, compile fully optimized (like it does currently) everything under it, and present a status meter and a cancel button as it compiles the core elements as a complete cell or macro.

    4b. There might need to be some sort of cache file that is the compiled macro or maybe a container for the compiled macros that aren't saved seperately.

    5. Okay, you knew this was coming... Linked Macros. The link points at a file, any changes to the file are replicated into the macros. If it is too intensive to poll the file or set a watch on it, then provide global menu item to "Update Linked Macros." By default a macro will load non-linked, it'll retain the information of where it was loaded from. The context menu will have an checkmark item "Link Macro". Any time you start building a complicated system invariably you find you made a mistake in some fundamental block. Re-duplicating and rewiring becomes a very painful chore.

    6. FFT/IFFT Core elements that take an array in and output and array of phases and magnitudes (That said we could build these if 4&5 are done, but there is doubtless a better/faster way to do it in C/C++. Spectral manipulation is one of the biggest things missing from Reaktor.

    7. Array manipulations. Rotate, initialize a range, move a block, block multiply, block add. This will be needed for spectral manipulations.

    8. Iteration... though not totally a must, this would greatly simplify many processes. I know it gives people a good way to get stuck, but it is still a rather big thing to not have. If you wanted to make this safe, there could be something that assigned a maximum iteration limit to prevent hanging.

    There was another problem brought up about core cells not being allowed in event loops; this one worries me too. However I've not yet played around where that'd happen so I can't say for sure where I'd put it in the list. Maybe as I play with things more I'll be able to get a better idea why the limitation exists.
    If anyone at NI is reading this, good work on Core Cells... they're a huge step in the right direction.
     
  7. sowari

    sowari Moderator Moderator

    Messages:
    27,759
    ok, i have passed this along to the beta team list.

    i not sure if it is good timing or not, meaning i think issues of Reaktor in Pro Tools are being looked at, at the moment.

    so, it is probable that your ideas will not be implimented in the next update...but hopefully Core issues will become the centre of focus, for future updates.

    until then, it is great that Reaktor users are pushing the envelope.

    also, if you want to email NI with your Core ideas, please do so.

    sowari
     
  8. Belial

    Belial NI Product Owner

    Messages:
    102
    could someone put in the user library a core cell with the example of FFT please? i've been interested in how it would look when inplemented for a while now. i want to try sometime but my maths aint up to scratch yet, so someone elses implementation would be very useful
     
  9. TheBayer

    TheBayer NI Product Owner

    Messages:
    38
    This is the 64 point FFT I was working on, it refuses to compile but you can see the structure. I suspect I could get this working actually, but 64 point is still way too little for what I want. It is built up of Radix-4 FFTs which cuts down on the multiply operations as compared to the more common Radix-2 FFT.

    I couldn't upload it so I had to make a link from my webserver:

    http://www.ericvonbayer.us/music/reaktor/FFT64.zip
     
  10. lxint

    lxint NI Product Owner

    Messages:
    764
    thanks

    some ideas that will probably not solve the problem anyway but will still reduce the structure :

    the log2(n) macro could be replaced with a constant for now ( and always, since you'd probably use a fixed size in the end anyway )

    maybe the bit reversal resorting could be done in a sepearate cell, and the dilution and recombination in another, which gets the reordered signal in an audio stream - maybe youll need several other cells for this that run in parallel - and the final conversion for real and img. could be moved the outside to the primary level

    I *dont* think this is they way to go btw but still seems possible to be done like that

    some other things to consider are, if its not possible to replace routers with lookuptables that store indices for intermediate buffers etc - the initial reordering could be done completely with a lookuptable instead of bitshifting
    I dont think this makes any difference speedwise since you only need to move one samle at a time ( I think ?)

    then I forgot what the problem with nested iterations had been again - do you need more iterations than samples or isnt that always less - so maybe there are some other approaches that could work

    sorry if some if this in the end doesn't makes sense, didnt dig the structure completely nor am I sure how I would go about that ( and probably would rather try a different transform... ) and I am not even sure what radix4 means in this context..
     
  11. TheBayer

    TheBayer NI Product Owner

    Messages:
    38
    Well, I'll start in reverse question wise as that makes most sense. Radix-4 means that rather than processing 2 values (Radix-2) at a time you are processing 4. The benefit of this process is the lowest level structure for the FFT can be very simple. This is because the Twiddle multiplication is by (1,-i,-1,i) for (0,90,180,270) which accounts for all the coefficients. This means rather than doing several multiplies you can just pass the inputs to the outputs with at most a negatrion in them. As the lowest order is the process done the most, this basicly removes 2*<num points> multiplies which is far from insignificant. Also as you move up from the bottom pass it reduces the amount of work lost, so even there the number of multiplies are about 3/4. So as you can see there is a very good reason fro Radix-4. Also note the number of processing blocks is lower, I.e. Radix-2 for FFT-n is (log2(n)-1)*n, Radix-4 is (log4(n)-1*4).


    log2(n) doesn't need to be there, it was just expedient for now. It could certainly come out, but I doubt that is a major part of the problem.

    Moreover there is a lot of logic in the "Cplx * Tf(n)" block that could be removed. This would be less efficient but simpler. This might just get me to a 64-FFT, but I suspect no further.

    The Bit Reversal is done on the lowest stage via a lookup table and incurs basicly no signal path length. So I don't think this would buy me anything. Many paths in this system would be evaluated only once (if I follow how Core works properly) so the overhead in most places should be minimal. This seemed to be true when I was at the FFT-16 stage.

    Most of the places I have routers it is to allow for signal path optimization. I.e. to gate out complex math where possible when the offset matches certain criteria, namely "Cplx * Tf(n)".

    To process the data you have to make log4(n) passes on 1/4th the combinations of points (4 points per combination gives you all points).

    The problem with breaking stuff up to seperate cells is that you can't pass arrays between cells, the best I could ever do is shift intermediate data out and in to the next cell. This would add n-samples of latency to every cell used. This is really not a good solution.
     
  12. lxint

    lxint NI Product Owner

    Messages:
    764
    thanks for the info

    it would be helpful if there were some official statement about the structure limit, cause you can reach it easily with a less complicated structure, but it's not at all obvious where the actual limit is, or what's causing the restriction, and rewiring
    seems change this to a tiny extend in some cases

    it's sort of strange to start building and then having to test every now and then if you've already reached the limit ( and if all the efforts had been a waste of time )

    it would also be of interest to know if this could be changed with an update or rather not
     
  13. TheBayer

    TheBayer NI Product Owner

    Messages:
    38
    I definately agree... it sucks to no know what limits are really there.

    PS - My FFT is utterly bugged. I just realized one part of the structure was really wrong. I'll post a new version when it is resolved. I'm also trying to see if I can manage to actually get it to run even under current limitations... I realized a way to remove some complexity.
     
  14. lxint

    lxint NI Product Owner

    Messages:
    764
    ok I think I've found something that could help here too

    what I am thinking about at the moment are wavelet transforms, ( or 'wavelet packet transform'- WPT-, in this case), not sure if my idea will work, but I think it will, and a similar approach could be adopted for FFT I assume

    the WPT needs N*log2(N) op's per block as well ( actually more, it's L*N*log2N, but this L can be neglected for this consideration, L is an arbitrary filter required )

    bascially the transform works like this: from two adjacent samples a single highband and a single lowband sample are created ( in the simplest case highpass = difference
    [1,-1] lowpass = median [0.5,0.5]) for a longer block than 2 samples, and a 'deeper' transform, this step is repeated over, so a block of 8 samples turns in to 2 blocks of 4 high and 4 low band samples, which in the next level will turn into 4 blocks, length 2, for a high-high, high-low, low-high and low-low band which will become 8 bands in the last step ( in the classical WT, only the low bands are
    decomposited that way, so you end up with octave spaced bands. intersting about both is that
    you don't need overlaps for a high temporal resolution - each band is subsampled on a different
    timescale, according to it's frequency witdh )

    here's a not very descriptive sketch for a size 8 transform:
    Code:
    input	depth1	depth2	depth3( deepest you can get with 8 samples )
    
    in.1	low.1	ll.1	lll.1
    in.2	hi .1	lh.1	llh.1
    in.3	low.2	hl.1	lhl.1
    in.4	hi .2	hh.1	lhh.1
    in.5	low.3	ll.2	hll.1
    in.6	hi .3	lh.2	hlh.1
    in.7	low.4	hl.2	hhl.1
    in.8	hi .4	hh.2	hhh.1
    
    the problem in reaktor here is the same as for FFT I think: how to break the algorithm into an even workload per sample
    here, in the first level, 2 adjacent samples are used per output sample, for the next level these are separated 2 samples apart, and so on ( you cant tell this from the poor sketch ), while the number of operations is the same per level ( N operations for log2 N levels.. ) - and, most or all values of each level have to be processed before the next level can be worked on

    a simple r5 adaption seems to be to write the input block into a memory, and to perform the transforms for each level on seperate clock cycles, replacing the old with the new values in the same memory ( that's probably not clear from my description, but this seems to be straight forward to me)

    for a 256 block, this would mean N=256 operations in parallel,done log2N = 8 times per block,
    - but 256 operations will most likely already hit the structure length limit, while at the same time you only need 8 clock ticks for the complete 256 sample block transform

    obviously it would be desirable to interchange this ratio, doing only 8 (log2N) operations in each of the 256
    ( N) ticks.. and this seems to be possible for the WPT - you simply have to go down in the columns, processing log2N operations in parallel per tick in the example of my ascii sketch this would mean, per sample, 1.[in1,in2,in3] 2.[in4,in5,in6]...8.[hlh1,hhl1,hhh1]

    not sure if there will be a problem when a 'fold' between differnt levels of depth occurs ( the problem could be, that an output of the deeper level requires a value from the higher level that isnt processed yet - I dont think this problem occurs, but I'd need a really large sheet of paper to figure this out for a depth like 10 or higher .. )

    not sure if the same thing is possible with FFT ( nor if this had been exactly your approach - I dont think so, cause this method should result in a rather simple structure, not nested deeply but completely 'flat' ), and I dont know if this description makes sense for anybody, anyway

    estimating the workload of a WPT size 512,depth 9,and filters length 8, this would mean 72 ops per sample,
    (no windowing or overlap required here), adding some overhead for indexing etc this should equal 10-12 "delay 4p" macros for analysis, and for FFT this could be half of that with a size of 1024 and an overlap of 50%... ...if anything like this applies for FFT at all, but I think the butterfly follows a similar pattern, anyway it should be possible to break up the nested iterations that are usually used for FFT in a similar way
     
  15. lxint

    lxint NI Product Owner

    Messages:
    764
    it can, I just got the basic structure working, a real time 1024 size FFT needs about 4% CPU on a P2.66 in reaktor ( no overlaps, no inverse FFT, just one single block by block transform )
    still have to debug a little, and try a longer FFT too, will post it later, maybe monday
     
  16. lxint

    lxint NI Product Owner

    Messages:
    764
    I just had to re-read this, actually I didnt see this when I read it the first time

    not sure to what extend this can be aplied to the layout I posted except for the first stage, which would require just one router, for later stages I dont see if routers are required and if they are, if it would be possible to use just one router that routes the clock,

    also I still havent tried where the structure limit comes into play, you can have forward and backward transform
    in one cell but I dont know where the actual limit is, could be that thats just it

    you didn't get back on this any more.. this has become another of one of my "talks to himself in public threads"

    no idea if you're interested to improve this further or not
    anyway, it's obvious that latency cant be zero but it would be interesting to know what the minimum is
     
  17. TheBayer

    TheBayer NI Product Owner

    Messages:
    38
    Well, I'm sort of interested in improving this as much as we can. Though I'll admit that this is a now and then project for me. I have a demanding software engineering job, and interests in Music and Photography, so it sort of rotates over the course of a month or two.

    The radix-4 wouldn't do anything to lower the latency. It would however likely improve your CPU efficiency a bit more. It isn't necessary for all layers to be the same, so we could easily convert your bottom layer to radix-4 as that is where the most gain is really anyways.

    My goal in the long run would be to have spectral manipulation blocks as well as FFT and IFFT endpoints. I haven't gotten a chance to look at your code yet (strangely the forum didn't email me that something had been posted.) It'd be fun to see where we can go from here.

    To lower latency I suspect the only way is to do more FFTs over a buffer that you are rotating data through. There might be ways to do this without completely doing a new FFT, but I suspect you have to play games about doing it only on Radix boundaries and keeping several layers of buffered data. Though I'll admit the thought of doing this in Reaktor gives me a splitting headache.
     
  18. lxint

    lxint NI Product Owner

    Messages:
    764
    a short description how the struture is layed out, for one transform :

    one block is read in, while the previous block is processed, with a fixed minimum number of simultanous operations per sample required

    all memory read / write locations in lookup tables, which makes it faster ( and much more easy set up & to debug... ) this also shortens the structure, cause you only need 2 reads where otherwise increments and conditions etc would be required
    also the structure is easy to change: to double the operations per sample you simply copy and paste it, and concatenate the new blocks in the right order, since additional busses seem to match if you copy+paste all together ( didnt verify if this is really true ) and adjust the block interval
    seems to be a general scheme to solve similar things for reaktor

    so you have 2 blocks latency, one for reading, one for processing.

    in the processing block, the latency can by decreased by increasing the number of ops per sample, until the structure limit is reached ( seems unlikely to get any further by a different structure since it's max ops with the minimum logic required )

    for the reading block, your hint on radix-4 suggests that for FFT you can process the first layer while reading the data in, which would reduce latency since you have less operations in the following processing block, for deeper stages I assume it just increase the struture length too much to make advantage of this

    a quick test I made indicates that 160 complex multiplies incl. required read/writes could be possible per sample, but I dont know what will happen if I make any further changes ( and further changes would be required then to use it)

    anyway, not sure if I will look into this further for now, I want to know if the wavelet schemes really can be solved the same way - if not it would be funny that it solved this instead
     
Thread Status:
Not open for further replies.