AU SPI question (slave mode)

Update: with the Teensy 4.1 it’s slightly different with hardware SPI 1: both directions only work reliably up to 16 MHz. So I get around 1,8 MB/sec.

I use that same teensy with an ILI9341 display on hardware SPI 0 at 30 MHz without problems.

The Teensy project on Github is here.

Another update: I wrote a more representative SPI echo test with the Teensy 4.1, and 17 MHz also seems rock solid.

AU SPI echo source

Teensy SPI echo source

Another update:

If I output the SPI data out bit one clock cycle after the reading edge (in Mode 1) instead of waiting for the writing edge in spi_peripheral.luc, the speed increases from 17 MHz to 21 MHz. Makes sense I think.

With this change this goes up to 23 MHz in Mode 2, while Mode 1 and Mode 3 totally fail (also makes sense).

Final update: a flaky SDI jumper wire connection caused problems (replacing a ground wire added 2 MHz).

The current code works reliably up to 23 MHz without outputting SDO “early” in modes 1 and 2.

Perhaps not using 10 cm jumper wires but shorter soldered connections could even go up a bit higher.

teensy spi speed test in action

1 Like

I had been meaning to ask what kind of connection quality you were working with. At the frequencies you’re trying to go at, I would blame jumper wire connections before I would blame the chips.

The connections to the Ili9341 display were identical and gave no problems at 30 MHz. But I think the timing constraints with bit banging and sampling sck instead of hardware spi made the jumper wire connection quality more critical.
The main problem was a flaky ground wire connection that caused problems above 16 MHz with the Teensy.

Using the clock wizard and running the spi_peripheral at 200 MHz I can reliably (for hours without errors) go to 29 MHz with the Teensy (github).

Being a total newbie with FPGA it’s possible that my attempts at crossing the clock domain are suboptimal (if not clumsy).

To what extent the 10 cm jumper wires between the Teensy and the Br are a limiting factor is difficult to say without a scope.

With a 300 MHz (clock wizard) SPI sample clock it works reliably up to 39 MHz.

So the jumper wires are probably influencing the signal quality enough to prevent the AU from running the SPI at the theoretical maximum of 1/4th the sampling frequency.

I also tried 400 MHz but that did not work at any frequency.

EDIT: at 300 MHz an occasional bit error occurred, probably due to the wiring…

Make sure your design is passing timing. I was going to add a check for this in Alchitry Labs V2 but I don’t remember off the top of my head if it checks or not right now.

Just scroll up through the build logs a little and look for something along the lines of “all constraints met”

If timing fails to be met, it’ll still make a .bit file but it may or may not work depending on how bad it failed and the temperature/chip you got.

I’ll have to wire up a demo and see if I can reproduce your results. I have a nice scope and the tools to debug it. I haven’t heavily used the spi slave module so it likely has plenty of room for improvement.

Thanks for the advice, I’ll check the log.

But don’t waste any time on this, I’m just playing with something I didn’t know anything about some weeks ago.

I’m planning to work on this today. It should be possible to get SPI working super fast (100MHz+) with the caveat that you’ll need a dummy byte between write->read transitions to give the FPGA some time to respond. This seems pretty common among flash chips and the like with “fast” SPI modes.

I’m looking forward to your findings!

Meanwhile I have discovered that I can’t use an asynchronous fifo to cross clock domains, as it takes too long to check the full/empty flags: the extra clock cycle on the fast clock (200 MHz) ruins the SPI timing.

I solved it with a synchronous fifo that uses a clock divider on the read side so that the top can correctly read the SPI data at 100 MHz.

github spi_echo.luc

    ...snip...
       // drive outputs
        data_out = out_buf.q
        data_rdy = rdy.q
        
        // make sure the first byte is ready when SPI transfer starts
        spi.data_in = tx_buf.q
        
        // sync the 200 MHz fifo with the 100 MHz output
        clock_syncer.d = clock_syncer.q + 1
        if (clock_syncer.q == 2b01) {
            if (! fifo_out.empty) {
                data_out = fifo_out.dout
                out_buf.d = fifo_out.dout
                data_rdy = 1b1
                rdy.d = 1b1
                fifo_out.rget = 1b1
                clock_syncer.d = 2b00
            }
        } else {
            rdy.d = 1b0
        }
        
        // store any received spi char in output fifo
        if (have_output.q) {
            if (!fifo_out.full) {
                fifo_out.din = tx_buf.q
                fifo_out.wput = 1b1
                have_output.d = 1b0
            }
        } 
    ...snip...

I think it was a good first FPGA project for learning some basic stuff.

What you are doing wouldn’t be considered reliable. You’re getting away with it because the MCMM will phase align the clocks but you don’t know if you’re changing outputs on the rising or falling edge of the 100MHz clock. It may be fine with either but you’re not designing around that.

I spent some time trying to get a better SPI module working and wasn’t successful. I’ll have to think about it a bit more. It’s tricky crossing clock domains fast especially when one clock isn’t consistent.

In the meantime, I checked out the version I posted a bit higher up. It looks fairly good at 1/4 clock speeds but the round trip delay for the first bit cuts it a bit close.


Because of this, I’d say you’re safe around 1/5th the main clock which is I think about what you were getting.

I understand that it is not reliable but the asynchronous fifo simply does not work at all here, so I see no other option. It was suggested in a Vivado forum, and it works reliably at all speeds I tested as long as the fast clock is an integral multiple of the slow clock (the clocks are in phase as this is the default in the clock wizard).

About the 1/5th: as everything I’ve read seems to agree that you need “at least” SCK x 4, I can live with that.

I actually think I’m wrong and if they’re both from the same MMCM the tools will figure it out and get the timing correct.

Why not just clock everything at 200MHz though?

I got a fast version working. It requires that sck be connected to a clock pin though. You also can’t output data during the first byte (it’s always 0).

Data is asked for during bits 1-5 of the previous byte. This also means you can’t immediately respond to the first byte. The second byte will often be a dummy byte because of this.

I think it should be able to get close to sck being close in frequency to the main clock. I only tested it to 25MHz as my setup was a bit spicy and signal integrity failed after that.

EDIT: I should’ve also mentioned that this requires Alchitry Labs 2.0.22 which isn’t out yet as I added arst to DFFs to support asynchronous resets. Changing the arst to a normal rst should still work as long as you always send 8 bits at a time.

module spi_fast_peripheral (
    input clk,             // clock
    input rst,             // reset
    input cs,              // SPI chip select
    input sdi,             // SPI data in
    output sdo,            // SPI data out
    input sck,             // SPI SCK
    output cs_sync,        // CS synced to system clock
    output need_data,      // 1 if data in buffer is empty
    input data_out_valid,  // 1 when data_out is valid
    input data_out[8],     // data to send
    output data_in_valid,  // transfer done
    output data_in[8],     // data received
) {
    .clk(clk) {
        dff cs_pipe[3]
        
        .rst(rst) {
            dff data_in_sync[3][8]
            dff data_in_valid_sync[4]
            
            dff data_out_buffer[8]
            dff data_out_full
            
            dff bit_out_ct_is_0_sync[3]
            dff bit_out_ct_is_1_to_5_sync[3]
        }
    }
    
    sig data_in_flag_rst
    .clk(sck) {
        dff data_in_flag(.arst(data_in_flag_rst))
    }
    .arst(cs) {
        .clk(sck) { // read edge
            dff data_in_buffer[8]
            dff data_in_shift[8]
            dff bit_in_ct[3]
        }
        .clk(sck) { // write edge
            dff data_out_shift[8]
            dff bit_out_ct[3]
        }
    }
    
    always {
        /* WRITE EDGE */
        
        data_out_shift.d = c{data_out_shift.q[6:0], 0}
        bit_out_ct.d = bit_out_ct.q + 1
        sdo = data_out_shift.q[7]
        if (bit_out_ct.q == 7 && data_out_full.q) {
            data_out_shift.d = data_out_buffer.q
        } 
        
        /* READ EDGE */
        
        data_in_shift.d = c{data_in_shift.q[6:0], sdi}
        
        bit_in_ct.d = bit_in_ct.q + 1
        if (bit_in_ct.q == 7) {
            data_in_buffer.d = c{data_in_shift.q[6:0], sdi}
            data_in_flag.d = 1
        }
        
        /* SYS CLOCK */
        
        bit_out_ct_is_0_sync.d = c{bit_out_ct_is_0_sync.q[1:0], bit_out_ct.q == 0}
        bit_out_ct_is_1_to_5_sync.d = c{bit_out_ct_is_1_to_5_sync.q[1:0], bit_out_ct.q >= 1 && bit_out_ct.q <= 5}
        
        if (bit_out_ct_is_0_sync.q[2]) {
            data_out_full.d = 0
        }
        
        sig accept_data = bit_out_ct_is_1_to_5_sync.q[2] && !data_out_full.q
        need_data = accept_data
        if (accept_data && data_out_valid) {
            data_out_buffer.d = data_out
            data_out_full.d = 1
        }
        
        data_in_sync.d = {data_in_sync.q[1], data_in_sync.q[0], data_in_buffer.q}
        data_in_valid_sync.d = c{data_in_valid_sync.q[2:0], data_in_flag.q}
        
        data_in = data_in_sync.q[2]
        data_in_valid = 0
        data_in_flag_rst = 0
        if (data_in_valid_sync.q[3:2] == 2b01) {
            data_in_valid = 1
            data_in_flag_rst = 1
        }
        
        cs_pipe.d = c{cs_pipe.q[1:0], cs}
        cs_sync = cs_pipe.q[1]
    }
}