Fixed Point Numbers in Verilog

Welcome to the FPGA Cookbook.

This is part of a series of handy recipes to solve common FPGA development problems. Look out for more FPGA cookbook posts soon. You want to use fixed point numbers in Verilog.

Sometimes you need more precision than integers can provide, but floating point is hard (try reading IEEE 754). You could use a library or IP block, but simple fixed point maths can often get the job done with little effort. Furthermore, most FPGAs have dedicated DSP blocks that make multiplication and addition of integers really fast; we can take advantage of that with a fixed-point approach.

Feedback to @WillFlux is most welcome. Last updated August 2018.

What is a Fixed Point Number?

In a regular binary integer the bits represent powers of two, with the least significant bit being 1.

For example, decimal 13 is 1101 in binary: 8 + 4 + 1 = 13.

With decimal numbers we're used to the idea of using a decimal separator, a point or comma, to separate integer and fractional parts. 1001 is one thousand and one, whereas 10.01 is ten and one hundredth.

We can do the same thing in binary and use the bits to represent any powers of two we like.

For example we can think of 4.75 as being 4 + 1/2 + 1/4.

In binary terms this can be visualized as:

8   4   2   1  1/2 1/4 1/8 1/16
-------------------------------
0   1   0   0   1   1   0   0

Or, with a handy point to mark the fractional part: 0100.1100.

We're choosing to interpret the value as being fixed point, but from a FPGA logic point of view it's just an 8-bit integer. This doesn't matter: provided we are consistent in the position of the fixed point we'll get the expected result from mathematical operations. See the end of this post if you have existing integers you need to convert to fixed point.

Q Notation

To express the number of integer and fractional bits we use Q number format: Qi.f where i is the number of integer bits and f is the number of fractional bits. 0100.1100 has four integer and four fractional bits, so is Q4.4. We'll mostly use Q4.4 in this post to keep the examples manageable.

Maths (or Math) Just Works!

All the usual binary maths work when used with fixed point numbers. Verilog can generally synthesize addition, subtraction, and multiplication on an FPGA. Division cannot be synthesized automatically, but we can multiply by fractional numbers, e.g. multiply by 0.1 instead of dividing by 10.

Addition works in exactly the same way as for integers:

. 0011.1010        3.6250
+ 0100.0001      + 4.0625
= 0111.1011      = 7.6875

We can also do subtraction using two's complement to express the negative number:

. 0011.1010        3.6250
+ 1110.1000      - 1.5000
= 0010.0010      = 2.1250

How did we get the two's compliment for -1.5?

1.5 = 1 + 1/2 = 0001.1000

Start:  0001.1000 (1.5)
Invert: 1110.0111
Result: 1110.1000 (-1.5)

Multiplication

Multiplication works as expected too, but the product contains twice the number of bits. Multiplying two Q4.4 numbers results in a Q8.8 product:

. 0011.0100        3.2500
x 0010.0001        2.0625

You can do the long multiplication to produce:

00000110.10110100 = 6.703125

We can round this to our original Q4.4 format by taking the eight middle bits:

0000[0110.1011]0100 = 0110.1011 = 6.6875

You can see we lost some precision there. This often happens when you convert to the original precision after multiplication. If our result had been 8 or higher it would have overflowed our Q4.4 format because the largest digit we can represent with four bits in two's compliment is 7. Range and precision are discussed in more detail below.

Division

Division is hard. You can't synthesize it with regular Verilog unless it's by a power of two, which uses a right shift. A right shift truncates the result, so we lose the fractional part

0111    7
>>1    right-shift 1 bit
0011    3

However, with fractional numbers we can do accurate division by a constant using multiplication.

For example, to divide by 2 we multiple by 0.5:

. 0111.1000             7.5000
x 0000.1000             0.5000
= 00000011.11000000     3.7500

Back in Q4.4: 0011.1100. In this case our answer is exact.

Alas, if your denominator is not a constant this is no help. I am working on a separate post on general division, but for now let's consider the range and precision of our numbers.

In Verilog

You can't use a point '.' in Verilog binary literals, but you can use an underscore (which is safely ignored). The following module demonstrates all the calculations we performed above:

module fixedtest();
reg signed [7:0] a;
reg signed [7:0] b;
reg signed [7:0] c;
reg signed [15:0] ab;  // large enough for product

localparam sf = 2.0**-4.0;  // Q4.4 scaling factor is 2^-4

initial begin
\$display("Fixed Point Examples by TimeToExplore.net.");

a = 8'b0011_1010;  // 3.6250
b = 8'b0100_0001;  // 4.0625
c = a + b;         // 0111.1011 = 7.6875
\$display("%f + %f = %f", \$itor(a)*sf, \$itor(b)*sf, \$itor(c)*sf);

a = 8'b0011_1010;  // 3.6250
b = 8'b1110_1000;  // -1.5000
c = a + b;         // 0010.0010 = 2.1250
\$display("%f + %f = %f", \$itor(a)*sf, \$itor(b)*sf, \$itor(c)*sf);

a = 8'b0011_0100;  // 3.2500
b = 8'b0010_0001;  // 2.0625
ab = a * b;        // 00000110.10110100 = 6.703125
c = ab[11:4];      // take middle 8 bits: 0110.1011 = 6.6875
\$display("%f x %f = %f", \$itor(a)*sf, \$itor(b)*sf, \$itor(c)*sf);

a = 8'b0111_1000;  // 7.5000
b = 8'b0000_1000;  // 0.5000
ab = a * b;        // 00000011.11000000 = 3.7500
c = ab[11:4];      // take middle 8 bits: 0011.1100 = 3.7500
\$display("%f x %f = %f", \$itor(a)*sf, \$itor(b)*sf, \$itor(c)*sf);
end
endmodule

The \$itor function converts an integer into a real number we can display. We divide our results by 24 to account for the lower four bits being fractional. For example, 00101000 would normally be interpreted as 40 (32+8), but we want it to be 2.5 (2 + 1/2). If we divide 40 by 24=16 we get the desired 2.5.

Large Numbers, Small Numbers

If the numbers you're using are very small or large you can improve your accuracy by scaling them so they fit comfortably into the range of your fixed point numbers.

Let's say we want to perform the following: 3/256 x 7.

The smallest fraction we can represent in Q4.4 is 1/16, so 3/256 can't be expressed. Rather than use a larger precision we can scale the smaller number up. If we multiply 3/256 by 26 we get 0.75, which comfortably fits within our precision. Because we've multiplied one of our numbers by 26 we need to add this to the scaling factor, making it 210 (that's 4 bits for our Q4.4 format and 6 bits for adjusted scaling):

a = 8'b0000_1100;  // 0.7500  (0.75 = 3/256 x 2^6)
b = 8'b0111_0000;  // 7.0000
ab = a * b;        // 00000101.01000000
c = ab[11:4];      // take middle 8 bits: 0101.1010
\$display("%f", \$itor(c)*2.0**-10.0);  // divide result by 2^10

We get the result 0.082031, which is close to the exact value of 0.08203125. To improve on the accuracy further would require using more bits.

Range & Precision

You can get a larger range by using more integer bits and better precision by using more fractional bits. However, using more bits will increase the amount of FPGA logic required. Consider what your problem requires and the capabilities of your FPGA.

Q4.4

Using two's compliment a 4-bit value ranges from -8 (1000) to +7 (0111). A 4-bit fraction can represent numbers as small as 1/16 (0001) and as large as 15/16 (1111).

Range:      -8 to 7.9375 (7 + 15/16)
Precision:  0.0625 (1/16)

Q16.16

Using two's compliment a 16-bit value ranges from -32,768 to +32,767. A 16-bit fraction can represent numbers as small as 1/65,536 and as large as 65,535/65,536.

Range:      -32,768 to 32,767.9999847...
Precision:  0.0000152...

According to Wikipedia, Doom used a Q16.16 representation:

"...for all of its non-integer computations, including map system, geometry, rendering, player movement etc. This was done in order for the game to be playable on 386 and 486SX CPUs without an FPU. For compatibility reasons, this representation is still used in modern Doom source ports."

Overflow

All results so far have fitted in our Q4.4 representation, but this doesn't always happen. Let's consider what happens when we multiply 6.5 by 4:

. 0110.1000        6.5000
x 0100.0000        4.0000
00011010.00000000 26.0000

If we take the eight middle bits we get 1010.0000, which is -6. Our Q4.4 representation can't hold 26. For a product, if any of the first four bits contain 1 then we've overflowed. This is easy to check for if required. You can also check for a loss of precision by looking for any 1s in the four least significant bits.

Overflow can also happen with addition:

. 0110.1010        6.6250
+ 0100.0001     +  4.0625
= 1010.1011     = -5.3125 ?!

And if our result is less than -8 it will overflow in the negative direction:

. 1010.0000       -6.0000
+ 1010.0000     + -6.0000
= 0100.0000     =  4.0000 ?!

Thankfully it's not too hard to detect this. If both our operands are positive and the result is negative then overflow must have occurred. Similarly if both our operands are negative and the result is positive then it has overflowed too.

Converting to and from Integers

Conversion to and from regular binary integers simply requires using the appropriate left or right shift. For example, if we're using Q16.16 format we need to left-shift an integer 16 positions to create the Q16.16 fixed point number:

101010  // decimal 42
<< 16   // left shift 16 positions
101010000000000000000
0000000000101010.0000000000000000  // with all bits shown in Q16.16 notation

To make long binary numbers easier to read it's common practice to include spaces. For example we could represent decimal 42 in Q16.16 as:

0000 0000 0010 1010.0000 0000 0000 0000

We can then perform all the usual arithmetic operators on it with other Q16.16 numbers. For example, let's add decimal 2.875:

0000 0000 0010 1010.0000 0000 0000 0000  //  42
0000 0000 0000 0010.1110 0000 0000 0000  // + 2.875
0000 0000 0010 1100.1110 0000 0000 0000  //  44.875

If you need to convert a Q16.16 number back to a regular integer you right shift 16 positions.

0000 0000 0001 0111.1110 0000 0000 0000  // 44.875
>> 16  // right shift 16 positions
0000 0000 0010 1100 // 44

This removes (truncates) the fractional part, so the answer ends up being 101100 (decimal 44).