Project F

FPGA Memory Types

Published · Updated

Designing with FPGAs involves many types of memory, some familiar from other devices, but some that are specific to FPGAs. This how to gives a quick overview of the different flavours, together with their strengths and weaknesses, and some sample designs. This guide includes external memory types, such as SRAM and HBM, that are used in CPUs and GPUs, so much of what is said here is generally applicable, but the focus is on FPGAs. You might also be interested in Initialize Memory in Verilog.

To give a sense of the memory capability of small FPGAs, I’ve included the memory capabilities for Lattice iCE40 UP5K and Xilinx Spartan XC7S25. I use Kib for 1024 bits and KiB for 1024 bytes.

You can find memory designs in the Project F Verilog Library.

This is a draft post. More content to follow.

Contents

Memory Terminology

  • Address Width - how many bits are needed to address all the elements in the memory array
  • Bandwidth - how much data can be transferred by a memory interface each second
  • (Data) Width - how many bits there are in each element
  • Depth - how many elements there are in the memory array
  • Organisation - depth x width (see below)
  • Latency - how long it takes for a memory interface to start returning data

ProTip: In SystemVerilog you can use $clog2 to calculate the address width from the depth.

Memory Organisation

Memory is described by its organisation: depth x width

For example, a 4 Mib (megabit) SRAM could be organised as 512K x 8, which means it has 524288 (512 x 1024) locations, each of which holds 8 bits.

512K x 8 has 219 locations, so the address bus is 19 bits with an 8-bit data bus.

The same memory capacity could also be organised as 256K x 16, which means it has 262144 (256 x 1024) locations, each of which holds 16 bits.

256K x 16 has 218 locations, so the address bus is 18 bits with a 16-bit data bus.

The 16-bit organisation can transfer twice as much data per clock cycle but requires 7 more signals. In general, wider memories increase bandwidth but require more signals.

Flip-Flops

Flip-flops are the state keepers of FPGAs. Each flip-flop holds one bit.

When you create a simple counter, you’re using flip-flops:

    reg [15:0] cnt = 0;  // 16 flip-flops
    always @(posedge clk) begin
        cnt <= cnt + 1;  // add one to the counter on every positive clock edge
    end

Flip-flops let you break complex logic into multiple steps (pipelining) to run at a higher clock speed. Without flip-flops, signals would have to traverse your whole design in a single clock cycle; one complex piece of logic would slow everything down.

Other uses for flip-flops include in state machines, delaying signals, and forming CPU registers.

Flip-flops are great for saving state, but they’re not suited to all but the smallest memories: their numbers are limited, they’re spread throughout the FPGA (making routing larger memories hard), and they don’t support multiple ports.

The Lattice iCE40 UP5K has 5280 flip-flops, while the Xilinx Spartan 7S25 has 29200.

Distributed RAM

Distributed ram is built with LUTs. LUTs are usually used to create the logic of your design, but can also support memory in some FPGAs. Distributed ram is, as its name suggests, distributed throughout the FPGA. A single 6-input LUT can store 64 bits, while a 4-input LUT can store 16 bits.

Distributed ram is read asynchronously, but written to synchronously (requires a clock). Writes are limited to a single port, but you can read from up to four ports in some FPGAs. Distributed ram is flexible in the data width it supports, for example, if dealing with 32-level data you use a width of 5 bits.

Given the asynchronous nature of reads, distributed ram is ideal for fast buffers: you can read a value immediately, rather than waiting for the next clock tick. You can also use distributed ram to create small ROMs. However, distributed ram is not suited to large memories, you’ll get better performance (and lower power consumption) for memories larger than about 128 bits (based on Xilinx 7 Series) using block ram (see next section).

In Xilinx 7 Series FPGAs, only LUTs in SLICEM blocks may be used as memory. A Spartan 7S25 FPGA has 14600 6-input LUTs, of which 5000 are SLICEM, so you have a maximum of: 5000 x 64 bits = 320000 bits.

For more on Xilinx distributed ram see UG474: 7 Series FPGAs Configurable Logic Block.

The following SystemVerilog example shows a ROM module using distributed ram:

module rom_async #(
    parameter WIDTH=8, 
    parameter DEPTH=256, 
    parameter INIT_F="",
    localparam ADDRW=$clog2(DEPTH)
    ) (
    input wire logic [ADDRW-1:0] addr,
    output     logic [WIDTH-1:0] data
    );

    logic [WIDTH-1:0] memory [DEPTH];

    initial begin
        if (INIT_F != 0) begin
            $display("Creating rom_async from init file '%s'.", INIT_F);
            $readmemh(INIT_F, memory);
        end
    end

    always_comb data = memory[addr];
endmodule

To learn more about loading memory with $readmemh and $readmemb see Initialize Memory in Verilog.

Block RAM

Block ram (BRAM) is implemented using dedicated ram circuitry within the FPGA and is ideal for memories from a few hundred bits up to hundreds of kilobits. But BRAM is not just better than distributed ram for larger memories: I’d go so far as to say that block RAM is one of the great things about developing hardware with FPGAs.

Being written…

Xilinx BRAM

Xilinx 7 series FPGAs have 36 Kib BRAM blocks with two ports and up to 72 bit data width. Each port has an independent clock, so you can share data across clock domains. For example, you could have a RISC-V CPU accessing a block ram at 180 MHz, while your custom hardware accesses the same block at 133 MHz.

BRAMs are quite flexible in organisation; with two read/write ports (true dual-port BRAM) you can have data widths of 1, 2, 4, 9, 18, and 36 bits. The reasons the latter widths are multiples of 9 is because of ECC (error correction) support. For example, if you have 4-bit data then the organisation would be 8K x 4 (only 32 of the 36 Kib are used in this case). If you restrict yourself to one read and one write port (simple dual-port BRAM) then you can have 72-bit wide data.

Each 36 Kib block may also be split into two independent 18 Kib BRAMs, though the maximum data width of each is reduced.

  • Inference vs Primitives
  • Larger memories with multiple blocks…
  • Byte-level writes…
  • Block ram columns…
  • Collisions occur when both ports access the same address…
  • Output registers for higher clock speeds…

For more details on Xilinx BRAM see UG473: 7 Series FPGAs Memory Resources.

BRAM Data Sheet Capacity
Be careful when interpreting the headline BRAM capacity on FPGA data sheets. For example, if you look at Xilinx’s 7 Series Overview, you’ll see that the Spartan S25 has 1620 Kib of block ram. However, you shouldn’t think of this as having ~200 KiB of ram like you would on a microcontroller. This figure includes the 9th bit usually used for error correction; for 32-bit data the capacity is 1440 Kib. More importantly, BRAM is composed of many small blocks spread across the FPGA: you can’t expect to combine them all together into one big memory and get good performance.

iCE40 BRAM

iCE40 5UP has 4 Kib BRAMs…

Being written…

For more details on Lattice BRAM see Memory Usage Guide for iCE40 Devices.

FIFOs

First in, first out…

Example Modules

The following example shows a dual-por block ram module for iCE40 (a port can only be read or write on this FPGA):

New example to be added

The following example shows a simple Xilinx block ram module using a single read and a single write port (ports may be both read and write):

module bram_basic_xc7 #(
    parameter WIDTH=8, 
    parameter DEPTH=256, 
    parameter INIT_F="",
    localparam ADDRW=$clog2(DEPTH)
    ) (
    input wire logic clk,                       // clock (port a & b)
    input wire logic we,                        // write enable (port a)
    input wire logic [ADDRW-1:0] addr_write,    // write address (port a)
    input wire logic [ADDRW-1:0] addr_read,     // read address (port b)
    input wire logic [WIDTH-1:0] data_in,       // data in (port a)
    output     logic [WIDTH-1:0] data_out       // data out (port b)
    );

    /* verilator lint_off MULTIDRIVEN */
    logic [WIDTH-1:0] memory [DEPTH];
    /* verilator lint_on MULTIDRIVEN */

    initial begin
        if (INIT_F != 0) begin
            $display("Loading memory init file '%s' into bram_basic.", INIT_F);
            $readmemh(INIT_F, memory);
        end
    end

    // Port A: Sync Write
    always_ff @(posedge clk) begin
        if (we) begin
            memory[addr_write] <= data_in;
        end
    end

    // Port B: Sync Read
    always_ff @(posedge clk) begin
        data_out <= memory[addr_read];
    end
endmodule

The following example shows a synchronous ROM module that should work with both Xilinx and Lattice BRAM:

module rom_sync #(
    parameter WIDTH=8, 
    parameter DEPTH=512, 
    parameter INIT_F="",
    localparam ADDRW=$clog2(DEPTH)
    ) (
    input wire logic clk,
    input wire logic [ADDRW-1:0] addr,
    output     logic [WIDTH-1:0] data
    );

    logic [WIDTH-1:0] memory [DEPTH];

    initial begin
        if (INIT_F != 0) begin
            $display("Creating rom_sync from init file '%s'.", INIT_F);
            $readmemh(INIT_F, memory);
        end
    end

    always_ff @(posedge clk) begin
        data <= memory[addr];
    end
endmodule

UltraRAM

UltraRAM is a type of memory available in Xilinx UltraScale and UltraScale+ FPGAs. UltraRAM is like block ram on steroids: bigger but less agile. The blocks are 288 Kib in size (eight times larger than regular BRAM): combining all the blocks in a column gives you up to 36 Mib of memory to play with. However, while UltraRAM is dual-ported, it does not support independent clocks on each port, and is a fixed 72 bits wide.

High Bandwidth Memory

Being written…

Better known as HBM - Bandwidth, bandwidth and more bandwidth…

Found only on high-end FPGAs and graphics cards…

Static RAM

Being written…

If you were to imagine a memory chip, you’d probably imagine something like static ram (SRAM). You provide the address of a the element you want on the address bus, wait some nanoseconds and the data is returned on the data bus.

SRAM’s big issue is cost, which also limits the capacities its available in.

Few recent FPGA dev boards include SRAM, which is a shame, as this ram is ideal for beginners and low-latency hardware designs. One that does is the Digilent Cmod A7, which features 4 Mib (512K x 8) of 8ns SRAM.

Static RAM is available in both synchronous and asynchronous types…

Asynchronous SRAM

Async SRAM doesn’t use a clock… speed in nanoseconds

Async SRAM is most commonly ~3V, which work well with current FPGA designs.

4 Mib asynchronous SRAMs ICs are cheap (for SRAM) and widely available at speeds of 10 ns with a 8 or 16-bit data bus. A 4 Mib (512 KiB) 10ns SRAM costs around $2 in small quantities. A 4 Mib (512K x 8) SRAM has 8 data pins and 19 address pins, and typically three control pins: write enable, output enable, and chip select.

Synchronous SRAM

Faster SRAM is clocked… speed in megahertz… not always full random I/O… pipelined vs flow-through…

The iCE40 UP FPGAs include Synchronous SRAM, see SPRAM on iCE40 FPGA.

Dynamic RAM

Being written…

DRAM is the everyday memory that forms the main memory in your PC. Dynamic ram needs to be refreshed periodically, otherwise the data is lost. It’s cheap and offers plenty of bandwidth, but DRAM is complex to interface with: you’re almost certainly going to have to use vendor IP blocks and a cache to make DRAM usable. If you’re accessing data sequentially, or interfacing with a CPU via a cache, then DRAM works well, but for random I/O the high latency is a significant issue.

Pseudo SRAM

Being written…

SRAM is expensive, and DRAM is complex, and both use a significant number of I/O pins. By using a self-refreshing circuit and a simplified interface, similar to SPI flash, a new type of ram can be created. This type of ram is usually referred to as Pseudo SRAM.

The two interfaces you’ll come across are HyperRAM, created by Cypress, and xSPI (OctalSPI), standardised by JEDEC in early 2020. In both cases you only need 11 or 12 FPGA pins to interface with the ram. As of summer 2020 this ram is available in capacities up to 256 Mib (32 MiB) in 1.8V and 3V. 64 Mib (8 MiB) parts costs around $3 in small quantities.

We haven’t seen this ram on many FPGA dev boards, but Kevin Hubbard’s open source HyperRAM Pmod has proved popular, and is available pre-assembled from 1BitSquared. I hope we’ll see an updated version with faster xSPI ram in due course.

What’s Next?

Check out my FPGA demos or the FPGA graphics tutorials.

Have a question or suggestion? Contact @WillFlux or join me on Project F Discussions or 1BitSquared Discord. If you like what I do, consider sponsoring me on GitHub. Thank you.

The wonderful image of the Micron MT4C1024 DRAM used in the social media card for this post comes from Zeptobars and is licensed under a Creative Commons licence.