Precision lost when combining Int32 integers with single precision numerical numbers

I have a column data A composed of Int32 numbers, and another column data B composed of single precision numbers. When I try to put them into one array C, my single precision numbers were botchered into integers.
C = [A, B];
Why is Matlab set up this way? Due to the loss of precision, my final calcualted values are way off. It took me quite some time to find out this is the reason.

 Accepted Answer

Why is MATLAB set up this way? Because you can't please all of the people, all of the time. Suppose a numeric vector could have elements that are all different different numeric classes. Something like:
X = [pi, single(2.1), uint16(3)]
X = 1×3
3 2 3
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
whos X
Name Size Bytes Class Attributes X 1x3 6 uint16
etc. X will be a UINT16 by default here. But if the elements could retain their class information (not in the form of a cell array, which DOES retain the class information for each element), then any computation would be come IMMENSELY SLOW.
A huge benefiit of MATLAB is it runs blazingly fast when doing double precision computation, especially on large arrays. But if the code needed to check each element, and deal with the class of that number, then it would not be at all fast. And then almost everyone would be unhappy. As such, MATLAB is designed to store all numeric vectors using one class. concatenation operators make the decision which class to use, based on some simple rules.
So what happened to you?
In your case, you were combining int32 numbers with singles, by way of concatenation (horzcat)
Y1 = [int32(2), single(3.2)]
Y1 = 1×2
2 3
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
whos Y1
Name Size Bytes Class Attributes Y1 1x2 8 int32
The rule is that when you concatenate integers and singles together, you get an integer result.
If you really want to retain the information about each element, then you needed to use a cell array.
Z = {int32(2), single(3.2)}
Z = 1×2 cell array
{[2]} {[3.2000]}
As you can see, MATLAB now retains all the information you want for each element. The problem is, you can't do numerical operations using cell arrays.

More Answers (3)

The general rule is that when you combine numbers of two different types, that the result is the type that is considered more restricted. Integer is considered more restrictive than float

I wasn't there when the decision was made, but I suspect it would be because in Matlab, numeric literals are always double floats. That goes back to the origins of Matlab, before integer types like int32 type were even introduced.
Once you're locked in with literals that are always double, it becomes inconvenient to prioritize precision. If the rule was to promote the lower precision operand to higher precision, then you would get an automatic explosion in RAM usage every time you did the simplest operations between large integer arrays and literal scalars, e.g.,.
A=int8(5000); %integer
C=A+1;
You could avoid this by remembering to convert your literal scalars, as in,
C=A+int8(1)
but not only is this incredibly cumbersome, it would also have forced people to rewrite their old code from before the days when integers were introduced.

Hi @Leon,

I read your comments and Hope interpret them carefully. In MATLAB, when you create an array that combines different data types, it attempts to promote all elements to a common type that can accommodate all values without loss of information. Since both Int32 and single types occupy 4 bytes, MATLAB defaults to converting the entire array to the type that is capable of representing all elements. In this case, it promotes to Int32, causing your single-precision floating-point numbers to lose their fractional parts and be represented as integers.

Here’s a deeper look at how this works:

1. Data Types in MATLAB Int32: A 32-bit signed integer that can represent whole numbers from -2,147,483,648 to 2,147,483,647.

Single: A 32-bit floating-point number that can represent a much wider range of values but includes fractions.

2. Array Concatenation Behavior When concatenating arrays like [A, B], MATLAB checks the types of both arrays. Since A is Int32 and B is single, it opts for Int32 for the entire resulting array C. The conversion effectively truncates any decimal portion of your single-precision numbers in B, leading to inaccuracies in further calculations.

Solutions and Recommendations

To address this issue effectively, consider the following approaches:

1. Explicit Type Conversion: Before concatenating your arrays, convert both arrays to a common type that preserves precision. For example:

C = [int32(A), single(B)];

This ensures that both arrays are treated as single-precision floating points in the resulting array.

2. Using Cell Arrays: If maintaining different data types is essential for your application, consider using cell arrays:

C = {A, B};

This allows you to keep the data types separate but still access them together.

3. Review Data Types Before Operations: Always check the data types using the class ( ) function before performing operations that combine different types. This can prevent unexpected behavior during calculations.

Hope this helps.

9 Comments

it attempts to promote all elements to a common type that can accommodate all values without loss of information
Not true.
A = uint8(11);
B = uint64(1234);
C = [A,B]
C = 1×2
11 255
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
class(C)
ans = 'uint8'
If the algorithm were "without loss of information" then C would have come out as uint64, since uint64 can accomodate all of uint8.
types = {'single', 'double', 'uint8', 'int8', 'uint16', 'int16', 'uint32', 'int32', 'uint64', 'int64'};
nt = length(types);
for J = 1 : nt - 1
tJ = types{J};
A = cast(pi, tJ);
for K = J+1 : nt
tK = types{K};
B = cast(-123, tK);
C = [A,B];
fprintf('%-6s + %-6s = %-6s\n', tJ, tK, class(C));
end
end
single + double = single single + uint8 = uint8 single + int8 = int8 single + uint16 = uint16 single + int16 = int16 single + uint32 = uint32 single + int32 = int32 single + uint64 = uint64 single + int64 = int64 double + uint8 = uint8 double + int8 = int8 double + uint16 = uint16 double + int16 = int16 double + uint32 = uint32 double + int32 = int32 double + uint64 = uint64 double + int64 = int64 uint8 + int8 = uint8 uint8 + uint16 = uint8 uint8 + int16 = uint8 uint8 + uint32 = uint8 uint8 + int32 = uint8 uint8 + uint64 = uint8 uint8 + int64 = uint8 int8 + uint16 = int8 int8 + int16 = int8 int8 + uint32 = int8 int8 + int32 = int8 int8 + uint64 = int8 int8 + int64 = int8 uint16 + int16 = uint16 uint16 + uint32 = uint16 uint16 + int32 = uint16 uint16 + uint64 = uint16 uint16 + int64 = uint16 int16 + uint32 = int16 int16 + int32 = int16 int16 + uint64 = int16 int16 + int64 = int16 uint32 + int32 = uint32 uint32 + uint64 = uint32 uint32 + int64 = uint32 int32 + uint64 = int32 int32 + int64 = int32 uint64 + int64 = uint64
So the algorithm is:
  • integer type wins over float type
  • if two integers are combined, the one with the fewest bits wins
  • if two integers with the same bits are combined, uint wins over int
As others have pointed out, this doesn't work. It's what OP is already doing.
C = [int32(A), single(B)]; % the result will be int32, not single
I should point out though, that the obvious fix might not be so obviously problematic:
A = intmax('int32') - 64 % a large integer
A = int32 2147483583
B = single(sqrt(2)) % a fractional number
B = single 1.4142
% by default, the output class is the integer class
C = [A B]; % B is rounded
mat2str(C)
ans = '[2147483583 1]'
% you can force the output to be single instead
% but single can't represent all integers across the full range of int32
C = [single(A), single(B)]; % A is rounded
mat2str(C)
ans = '[2147483520 1.41421353816986]'
% but double can
C = [double(A), double(B)]; % A and B should be unchanged
mat2str(C)
ans = '[2147483583 1.41421353816986]'
C = [int32(A), single(B)];
This ensures that both arrays are treated as single-precision floating points in the resulting array.
No it does not. That explicitly converts A to int32 and explicitly converts B to single precision. Then after that the [] operator implicitly converts the single(B) to int32, same as int32(A).
To treat both as floating point you would need
[single(A), single(B)]
to be clear, or just
[single(A), B]
if you want to rely on the fact that B is single precision.
if two integers are combined, the one with the fewest bits wins
if two integers with the same bits are combined, uint wins over int
No. If you concatenate two integer arrays together, the resulting array will be of the class of the left-most integer array.
integerTypes = reshape(["", "u"] + "int" + [8; 16; 32; 64], 1, 8);
results = array2table(repmat("", 8, 8), ...
VariableNames = integerTypes, ...
RowNames = integerTypes);
for type1 = integerTypes
A = ones(1, type1);
for type2 = integerTypes
B = ones(1, type2);
C = [A, B];
results{type1, type2} = string(class(C));
end
end
The rows of the results table represent the type of A (the first array being concatenated together) and the variables represent the type of B. You can see that each row's values are always equal to the row name of the table (the type of A.)
results
results = 8×8 table
int8 int16 int32 int64 uint8 uint16 uint32 uint64 ________ ________ ________ ________ ________ ________ ________ ________ int8 "int8" "int8" "int8" "int8" "int8" "int8" "int8" "int8" int16 "int16" "int16" "int16" "int16" "int16" "int16" "int16" "int16" int32 "int32" "int32" "int32" "int32" "int32" "int32" "int32" "int32" int64 "int64" "int64" "int64" "int64" "int64" "int64" "int64" "int64" uint8 "uint8" "uint8" "uint8" "uint8" "uint8" "uint8" "uint8" "uint8" uint16 "uint16" "uint16" "uint16" "uint16" "uint16" "uint16" "uint16" "uint16" uint32 "uint32" "uint32" "uint32" "uint32" "uint32" "uint32" "uint32" "uint32" uint64 "uint64" "uint64" "uint64" "uint64" "uint64" "uint64" "uint64" "uint64"
Many thanks for all the super helpful comments. Now I know why.
If my priority is processing speed, should I convert everything to single or double? Thanks.
For large enough arrays, single precision is faster. For arrays of only about 1000 x 1000, the difference in speed is quite small.
A = rand(10000,1000);
B = rand(10000,1000);
sA = single(A);
sB = single(B);
tic; C = A + B; t1 = toc
t1 = 0.0074
tic; sC = sA + sB; t2 = toc
t2 = 0.0046
Note: if you happen to be using gpuArray(), then on all NVIDEA systems, single precision is notably faster than double precision. The exact relative speed depends only details of the model of GPU; the best case is 4:1 single to double (only a small number of systems!), and the worst case is 32:1 single to double (the ratio of most of the systems.) A couple of generations ago, 24:1 was the common ratio.
Glad to know single is faster than double. Thanks.

Sign in to comment.

Products

Release

R2024b

Tags

Asked:

on 21 Jun 2025

Commented:

on 24 Jun 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!