Performing Fixed-Point Arithmetic

Fixed-Point Arithmetic

Addition and subtraction

Whenever you add two fixed-point numbers, you may need a carry bit to correctly represent the result. For this reason, when adding two B-bit numbers (with the same scaling), the resulting value has an extra bit compared to the two operands used.

a = fi(0.234375,0,4,6);
c = a+a

c = 

    0.4688

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Unsigned
            WordLength: 5
        FractionLength: 6

a.bin

ans =

1111

c.bin

ans =

11110

If you add or subtract two numbers with different precision, the radix point first needs to be aligned to perform the operation. The result is that there is a difference of more than one bit between the result of the operation and the operands.

a = fi(pi,1,16,13);
b = fi(0.1,1,12,14);
c = a + b

c = 

    3.2416

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 18
        FractionLength: 14

Multiplication

In general, a full precision product requires a word length equal to the sum of the word length of the operands. In the following example, note that the word length of the product c is equal to the word length of a plus the word length of b. The fraction length of c is also equal to the fraction length of a plus the fraction length of b.

a = fi(pi,1,20), b = fi(exp(1),1,16)

a = 

    3.1416

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 20
        FractionLength: 17

b = 

    2.7183

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 13

c = a*b

c = 

    8.5397

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 36
        FractionLength: 30

Math with other built in data types

Note that in C, the result of an operation between an integer data type and a double data type promotes to a double. However, in MATLAB^®, the result of an operation between a built-in integer data type and a double data type is an integer. In this respect, the fi object behaves like the built-in integer data types in MATLAB.

When doing addition between fi and double, the double is cast to a fi with the same numerictype as the fi input. The result of the operation is a fi. When doing multiplication between fi and double, the double is cast to a fi with the same word length and signedness of the fi, and best precision fraction length. The result of the operation is a fi.

a = fi(pi);

a = 

    3.1416

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 13

b = 0.5 * a

b = 

    1.5708

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 32
        FractionLength: 28

When doing arithmetic between a fi and one of the built-in integer data types, [u]int[8, 16, 32], the word length and signedness of the integer are preserved. The result of the operation is a fi.

a = fi(pi);
b = int8(2) * a

b = 

    6.2832

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 24
        FractionLength: 13

When doing arithmetic between a fi and a logical data type, the logical is treated as an unsigned fi object with a value of 0 or 1, and word length 1. The result of the operation is a fi object.

a = fi(pi);
b = logical(1);
c = a*b

c = 

    3.1416

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 17
        FractionLength: 13

The fimath Object

fimath properties define the rules for performing arithmetic operations on fi objects, including math, rounding, and overflow properties. A fi object can have a local fimath object, or it can use the default fimath properties. You can attach a fimath object to a fi object by using setfimath. Alternatively, you can specify fimath properties in the fi constructor at creation. When a fi object has a local fimath , rather than using the default properties, the display of the fi object shows the fimath properties. In this example, a has the ProductMode property specified in the constructor.

 a = fi(5,1,16,4,'ProductMode','KeepMSB')

a = 

     5

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 16
        FractionLength: 4

        RoundingMethod: Nearest
        OverflowAction: Saturate
           ProductMode: KeepMSB
     ProductWordLength: 32
               SumMode: FullPrecision

The ProductMode property of a is set to KeepMSB while the remaining fimath properties use the default values.

Note

For more information on the fimath object, its properties, and their default values, see fimath Object Properties.

Bit Growth

The following table shows the bit growth of fi objects, A and B, when their SumMode and ProductMode properties use the default fimath value, FullPrecision.

	A	B	Sum = A+B	Prod = A*B
Format	`fi(v_A,s₁,w₁,f₁)`	`fi(v_B,s₂,w₂,f₂)`	—	—
Sign	`s₁`	`s₂`	`S_sum` = (`s₁`\|\|`s₂`)	`S_product` = (`s₁`\|\|`s₂`)
Integer bits	`I₁= w₁-f₁-s₁`	`I₂= w₂-f₂-s₂`	`I_sum = max(w₁-f₁, w₂-f₂) + 1 - S_sum`	`I_product = (w₁ + w₂) - (f₁ + f₂)`
Fraction bits	`f₁`	`f₂`	`F_sum = max(f₁, f₂)`	`F_product = f₁ + f₂`
Total bits	`w₁`	`w₂`	`S_sum + I_sum + F_sum`	`w₁ + w₂`

This example shows how bit growth can occur in a for-loop.

T.acc = fi([],1,32,0);
T.x = fi([],1,16,0);

x = cast(1:3,'like',T.x);
acc = zeros(1,1,'like',T.acc);

for n = 1:length(x)
    acc = acc + x(n)
end

acc = 

     1
      s33,0

acc = 

     3
      s34,0

acc = 

     6
      s35,0

The word length of acc increases with each iteration of the loop. This increase causes two problems: One is that code generation does not allow changing data types in a loop. The other is that, if the loop is long enough, you run out of memory in MATLAB. See Controlling Bit Growth for some strategies to avoid this problem.

Controlling Bit Growth

Using fimath

By specifying the fimath properties of a fi object, you can control the bit growth as operations are performed on the object.

F = fimath('SumMode', 'SpecifyPrecision', 'SumWordLength', 8,...
 'SumFractionLength', 0);
a = fi(8,1,8,0, F);
b = fi(3, 1, 8, 0);
c = a+b

c = 

    11

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 8
        FractionLength: 0

        RoundingMethod: Nearest
        OverflowAction: Saturate
           ProductMode: FullPrecision
               SumMode: SpecifyPrecision
         SumWordLength: 8
     SumFractionLength: 0
         CastBeforeSum: true

The fi object a has a local fimath object F. F specifies the word length and fraction length of the sum. Under the default fimath settings, the output, c, normally has word length 9, and fraction length 0. However because a had a local fimath object, the resulting fi object has word length 8 and fraction length 0.

You can also use fimath properties to control bit growth in a for-loop.

F = fimath('SumMode', 'SpecifyPrecision','SumWordLength',32,...
'SumFractionLength',0);
T.acc = fi([],1,32,0,F);
T.x = fi([],1,16,0);

x = cast(1:3,'like',T.x);
acc = zeros(1,1,'like',T.acc);

for n = 1:length(x)
    acc = acc + x(n)
end

acc = 

     1
      s32,0

acc = 

     3
      s32,0

acc = 

     6
      s32,0

Unlike when T.acc was using the default fimath properties, the bit growth of acc is now restricted. Thus, the word length of acc stays at 32.

Subscripted Assignment

Another way to control bit growth is by using subscripted assignment. a(I) = b assigns the values of b into the elements of a specified by the subscript vector, I, while retaining the numerictype of a.

T.acc = fi([],1,32,0);
T.x = fi([],1,16,0);

x = cast(1:3,'like',T.x);
acc = zeros(1,1,'like',T.acc);

% Assign in to acc without changing its type
for n = 1:length(x)
    acc(:) = acc + x(n)
end

acc (:) = acc + x(n) dictates that the values at subscript vector, (:), change. However, the numerictype of output acc remains the same. Because acc is a scalar, you also receive the same output if you use (1) as the subscript vector.

  for n = 1:numel(x)
    acc(1) = acc + x(n);
  end

acc = 

     1
      s32,0

acc = 

     3
      s32,0

acc = 

     6
      s32,0

The numerictype of acc remains the same at each iteration of the for-loop.

Subscripted assignment can also help you control bit growth in a function. In the function, cumulative_sum, the numerictype of y does not change, but the values in the elements specified by n do.

function y = cumulative_sum(x)
% CUMULATIVE_SUM Cumulative sum of elements 
% of a vector.
%
%   For vectors, Y = cumulative_sum(X) is a 
%   vector containing the cumulative sum of 
%   the elements of X.  The type of Y is the type of X.
    y = zeros(size(x),'like',x);
    y(1) = x(1);
    for n = 2:length(x)
        y(n) = y(n-1) + x(n);
    end
end

y = cumulative_sum(fi([1:10],1,8,0))

y = 

     1     3     6    10    15    21    28    36    45    55

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 8
        FractionLength: 0

Note

For more information on subscripted assignment, see the subsasgn function.

`accumpos` and `accumneg`

Another way you can control bit growth is by using the accumpos and accumneg functions to perform addition and subtraction operations. Similar to using subscripted assignment, accumpos and accumneg preserve the data type of one of its input fi objects while allowing you to specify a rounding method, and overflow action in the input values.

For more information on how to implement accumpos and accumneg, see Avoid Multiword Operations in Generated Code

Overflows and Rounding

When performing fixed-point arithmetic, consider the possibility and consequences of overflow. The fimath object specifies the overflow and rounding modes used when performing arithmetic operations.

Overflows

Overflows can occur when the result of an operation exceeds the maximum or minimum representable value. The fimath object has an OverflowAction property which offers two ways of dealing with overflows: saturation and wrap. If you set OverflowAction to saturate, overflows are saturated to the maximum or minimum value in the range. If you set OverflowAction to wrap, any overflows wrap using modulo arithmetic, if unsigned, or two’s complement wrap, if signed.

For more information on how to detect overflow see Underflow and Overflow Logging Using fipref.

Rounding

There are several factors to consider when choosing a rounding method, including cost, bias, and whether or not there is a possibility of overflow. Fixed-Point Designer™ software offers several different rounding functions to meet the requirements of your design.

Rounding Method	Description	Cost	Bias	Possibility of Overflow
`ceil`	Rounds to the closest representable number in the direction of positive infinity.	Low	Large positive	Yes
`convergent`	Rounds to the closest representable number. In the case of a tie, `convergent` rounds to the nearest even number. This approach is the least-biased rounding method provided by the toolbox.	High	Unbiased	Yes
`floor`	Rounds to the closest representable number in the direction of negative infinity, equivalent to two’s complement truncation.	Low	Large negative	No
`nearest`	Rounds to the closest representable number. In the case of a tie, `nearest` rounds to the closest representable number in the direction of positive infinity. This rounding method is the default for `fi` object creation and `fi` arithmetic.	Moderate	Small positive	Yes
`round`	Rounds to the closest representable number. In the case of a tie, the `round` method rounds: Positive numbers to the closest representable number in the direction of positive infinity. Negative numbers to the closest representable number in the direction of negative infinity.	High	Small negative for negative samples Unbiased for samples with evenly distributed positive and negative values Small positive for positive samples	Yes
`fix`	Rounds to the closest representable number in the direction of zero.	Low	Large positive for negative samples Unbiased for samples with evenly distributed positive and negative values Large negative for positive samples	No