AU SPI question (slave mode)

dheijl · December 29, 2024, 3:17pm

Update: with the Teensy 4.1 it’s slightly different with hardware SPI 1: both directions only work reliably up to 16 MHz. So I get around 1,8 MB/sec.

I use that same teensy with an ILI9341 display on hardware SPI 0 at 30 MHz without problems.

The Teensy project on Github is here.

dheijl · December 30, 2024, 8:06pm

Another update: I wrote a more representative SPI echo test with the Teensy 4.1, and 17 MHz also seems rock solid.

AU SPI echo source

Teensy SPI echo source

dheijl · January 3, 2025, 8:02pm

Another update:

If I output the SPI data out bit one clock cycle after the reading edge (in Mode 1) instead of waiting for the writing edge in spi_peripheral.luc, the speed increases from 17 MHz to 21 MHz. Makes sense I think.

With this change this goes up to 23 MHz in Mode 2, while Mode 1 and Mode 3 totally fail (also makes sense).

dheijl · January 4, 2025, 3:19pm

Final update: a flaky SDI jumper wire connection caused problems (replacing a ground wire added 2 MHz).

The current code works reliably up to 23 MHz without outputting SDO “early” in modes 1 and 2.

Perhaps not using 10 cm jumper wires but shorter soldered connections could even go up a bit higher.

teensy spi speed test in action

Jflanagan · January 4, 2025, 9:49pm

I had been meaning to ask what kind of connection quality you were working with. At the frequencies you’re trying to go at, I would blame jumper wire connections before I would blame the chips.

dheijl · January 4, 2025, 11:03pm

The connections to the Ili9341 display were identical and gave no problems at 30 MHz. But I think the timing constraints with bit banging and sampling sck instead of hardware spi made the jumper wire connection quality more critical.
The main problem was a flaky ground wire connection that caused problems above 16 MHz with the Teensy.

dheijl · January 11, 2025, 1:58pm

Using the clock wizard and running the spi_peripheral at 200 MHz I can reliably (for hours without errors) go to 29 MHz with the Teensy (github).

Being a total newbie with FPGA it’s possible that my attempts at crossing the clock domain are suboptimal (if not clumsy).

To what extent the 10 cm jumper wires between the Teensy and the Br are a limiting factor is difficult to say without a scope.

dheijl · January 12, 2025, 4:36pm

With a 300 MHz (clock wizard) SPI sample clock it works reliably up to 39 MHz.

So the jumper wires are probably influencing the signal quality enough to prevent the AU from running the SPI at the theoretical maximum of 1/4th the sampling frequency.

I also tried 400 MHz but that did not work at any frequency.

EDIT: at 300 MHz an occasional bit error occurred, probably due to the wiring…

alchitry · January 12, 2025, 9:16pm

Make sure your design is passing timing. I was going to add a check for this in Alchitry Labs V2 but I don’t remember off the top of my head if it checks or not right now.

Just scroll up through the build logs a little and look for something along the lines of “all constraints met”

If timing fails to be met, it’ll still make a .bit file but it may or may not work depending on how bad it failed and the temperature/chip you got.

I’ll have to wire up a demo and see if I can reproduce your results. I have a nice scope and the tools to debug it. I haven’t heavily used the spi slave module so it likely has plenty of room for improvement.

dheijl · January 12, 2025, 10:09pm

Thanks for the advice, I’ll check the log.

But don’t waste any time on this, I’m just playing with something I didn’t know anything about some weeks ago.

alchitry · January 13, 2025, 3:18pm

I’m planning to work on this today. It should be possible to get SPI working super fast (100MHz+) with the caveat that you’ll need a dummy byte between write->read transitions to give the FPGA some time to respond. This seems pretty common among flash chips and the like with “fast” SPI modes.

dheijl · January 13, 2025, 6:40pm

I’m looking forward to your findings!

Meanwhile I have discovered that I can’t use an asynchronous fifo to cross clock domains, as it takes too long to check the full/empty flags: the extra clock cycle on the fast clock (200 MHz) ruins the SPI timing.

I solved it with a synchronous fifo that uses a clock divider on the read side so that the top can correctly read the SPI data at 100 MHz.

github spi_echo.luc

    ...snip...
       // drive outputs
        data_out = out_buf.q
        data_rdy = rdy.q
        
        // make sure the first byte is ready when SPI transfer starts
        spi.data_in = tx_buf.q
        
        // sync the 200 MHz fifo with the 100 MHz output
        clock_syncer.d = clock_syncer.q + 1
        if (clock_syncer.q == 2b01) {
            if (! fifo_out.empty) {
                data_out = fifo_out.dout
                out_buf.d = fifo_out.dout
                data_rdy = 1b1
                rdy.d = 1b1
                fifo_out.rget = 1b1
                clock_syncer.d = 2b00
            }
        } else {
            rdy.d = 1b0
        }
        
        // store any received spi char in output fifo
        if (have_output.q) {
            if (!fifo_out.full) {
                fifo_out.din = tx_buf.q
                fifo_out.wput = 1b1
                have_output.d = 1b0
            }
        } 
    ...snip...

I think it was a good first FPGA project for learning some basic stuff.

alchitry · January 13, 2025, 9:37pm

What you are doing wouldn’t be considered reliable. You’re getting away with it because the MCMM will phase align the clocks but you don’t know if you’re changing outputs on the rising or falling edge of the 100MHz clock. It may be fine with either but you’re not designing around that.

I spent some time trying to get a better SPI module working and wasn’t successful. I’ll have to think about it a bit more. It’s tricky crossing clock domains fast especially when one clock isn’t consistent.

In the meantime, I checked out the version I posted a bit higher up. It looks fairly good at 1/4 clock speeds but the round trip delay for the first bit cuts it a bit close.

Because of this, I’d say you’re safe around 1/5th the main clock which is I think about what you were getting.

dheijl · January 13, 2025, 10:10pm

I understand that it is not reliable but the asynchronous fifo simply does not work at all here, so I see no other option. It was suggested in a Vivado forum, and it works reliably at all speeds I tested as long as the fast clock is an integral multiple of the slow clock (the clocks are in phase as this is the default in the clock wizard).

About the 1/5th: as everything I’ve read seems to agree that you need “at least” SCK x 4, I can live with that.

alchitry · January 13, 2025, 11:35pm

I actually think I’m wrong and if they’re both from the same MMCM the tools will figure it out and get the timing correct.

Why not just clock everything at 200MHz though?

alchitry · January 14, 2025, 12:31am

I got a fast version working. It requires that sck be connected to a clock pin though. You also can’t output data during the first byte (it’s always 0).

Data is asked for during bits 1-5 of the previous byte. This also means you can’t immediately respond to the first byte. The second byte will often be a dummy byte because of this.

I think it should be able to get close to sck being close in frequency to the main clock. I only tested it to 25MHz as my setup was a bit spicy and signal integrity failed after that.

EDIT: I should’ve also mentioned that this requires Alchitry Labs 2.0.22 which isn’t out yet as I added arst to DFFs to support asynchronous resets. Changing the arst to a normal rst should still work as long as you always send 8 bits at a time.

module spi_fast_peripheral (
    input clk,             // clock
    input rst,             // reset
    input cs,              // SPI chip select
    input sdi,             // SPI data in
    output sdo,            // SPI data out
    input sck,             // SPI SCK
    output cs_sync,        // CS synced to system clock
    output need_data,      // 1 if data in buffer is empty
    input data_out_valid,  // 1 when data_out is valid
    input data_out[8],     // data to send
    output data_in_valid,  // transfer done
    output data_in[8],     // data received
) {
    .clk(clk) {
        dff cs_pipe[3]
        
        .rst(rst) {
            dff data_in_sync[3][8]
            dff data_in_valid_sync[4]
            
            dff data_out_buffer[8]
            dff data_out_full
            
            dff bit_out_ct_is_0_sync[3]
            dff bit_out_ct_is_1_to_5_sync[3]
        }
    }
    
    sig data_in_flag_rst
    .clk(sck) {
        dff data_in_flag(.arst(data_in_flag_rst))
    }
    .arst(cs) {
        .clk(sck) { // read edge
            dff data_in_buffer[8]
            dff data_in_shift[8]
            dff bit_in_ct[3]
        }
        .clk(sck) { // write edge
            dff data_out_shift[8]
            dff bit_out_ct[3]
        }
    }
    
    always {
        /* WRITE EDGE */
        
        data_out_shift.d = c{data_out_shift.q[6:0], 0}
        bit_out_ct.d = bit_out_ct.q + 1
        sdo = data_out_shift.q[7]
        if (bit_out_ct.q == 7 && data_out_full.q) {
            data_out_shift.d = data_out_buffer.q
        } 
        
        /* READ EDGE */
        
        data_in_shift.d = c{data_in_shift.q[6:0], sdi}
        
        bit_in_ct.d = bit_in_ct.q + 1
        if (bit_in_ct.q == 7) {
            data_in_buffer.d = c{data_in_shift.q[6:0], sdi}
            data_in_flag.d = 1
        }
        
        /* SYS CLOCK */
        
        bit_out_ct_is_0_sync.d = c{bit_out_ct_is_0_sync.q[1:0], bit_out_ct.q == 0}
        bit_out_ct_is_1_to_5_sync.d = c{bit_out_ct_is_1_to_5_sync.q[1:0], bit_out_ct.q >= 1 && bit_out_ct.q <= 5}
        
        if (bit_out_ct_is_0_sync.q[2]) {
            data_out_full.d = 0
        }
        
        sig accept_data = bit_out_ct_is_1_to_5_sync.q[2] && !data_out_full.q
        need_data = accept_data
        if (accept_data && data_out_valid) {
            data_out_buffer.d = data_out
            data_out_full.d = 1
        }
        
        data_in_sync.d = {data_in_sync.q[1], data_in_sync.q[0], data_in_buffer.q}
        data_in_valid_sync.d = c{data_in_valid_sync.q[2:0], data_in_flag.q}
        
        data_in = data_in_sync.q[2]
        data_in_valid = 0
        data_in_flag_rst = 0
        if (data_in_valid_sync.q[3:2] == 2b01) {
            data_in_valid = 1
            data_in_flag_rst = 1
        }
        
        cs_pipe.d = c{cs_pipe.q[1:0], cs}
        cs_sync = cs_pipe.q[1]
    }
}

dheijl · January 14, 2025, 11:45am

No real reason, except perhaps me looking for a clock domain to cross or a clock domain wanting to be crossed

dheijl · January 14, 2025, 11:54am

Back in December I also tried this Verilog imlementation but it refused to build because the SCK pin I used was not suitable to be used as clock pin (I forget the exact wording of the diagnostic) as it was in the “wrong side” of the fabric or something like that. But I had already ordered the Teensy so I decided to try the Teensy first before I tried to understand that.

I suppose that your version will also need an appropriate choice for the SCK pin on the AU as it is used as a clock?

alchitry · January 14, 2025, 7:16pm

The old schematic is kind of a pain to check which pins are clocks but if you look at the FPGA pins you’ll see MRCC and SRCC. Those pins that are also the P variant can be used as single ended clock inputs. This is easier to see on this pinout doc.

This module clocks so few DFFs off sck it is probably safe in most cases to not use a dedicated clock pin and to route it through a BUFR.

You can do this by creating the Verilog module.

module bufr (
    input in,
    output out
);
    BUFR buffer(.I(in), .O(out)); // Xilinx built-in primitive
endmodule

Then routing sck through it.

bufr sck_buffer(.in(sck))
// use sck_buffer.out instead of sck

I updated the module some more and tested it out at 50MHz after improving my test setup.

You can see it starts by replying with 0, then 0xFF (hard coded in). After that it is echoing the data received which happens two bytes later.

50MHz seems to be approaching the limits of what my crude setup through the Br Wide can do.

However, I synthesized a 50MHz signal to run the module off so this shows it working at a 1:1 clock ratio.

This module will be included in the components library of 2.0.22.

dheijl · January 14, 2025, 7:41pm

These pins seems to correspond with the Sparkfun doc where they are labeled IO N GCLK and IO P GCLK.

Do I need something special to define one of these P pins as a clock input in the .acf file? I’d like to test that verilog implementation again, even though it explicitly states that it needs 4 times the SCK as sample clock.

Looking forward to the next release of Labs V2 too!

PS that wide Br looks very convenient to stack something else (IO or FT) on top!