191 views (last 30 days)

This is a rather lengthy question, previously asked on http://stackoverflow.com/questions/13382155/is-indexing-vectors-in-matlab-inefficient. Since I did not get any definitive answer, I decided to ask it here, where it is more likely that MATLAB engineers will look.

Background

My question is motivated by simple observations, which somewhat undermine the beliefs/assumptions often held/made by experienced MATLAB users:

- MATLAB is very well optimized when it comes to the built-in functions and the fundamental language features, such as indexing vectors and matrices.
- Loops in MATLAB are slow (despite the JIT) and should generally be avoided if the algorithm can be expressed in a native, 'vectorized' manner.

The bottom line: core MATLAB functionality is efficient and trying to outperform it using MATLAB code is hard, if not impossible.

Investigating performance of vector indexing

The example codes shown below are as fundamental as it gets: I assign a scalar value to all vector entries. First, I allocate an empty vector x:

tic; x = zeros(1e8,1); toc

Elapsed time is 0.260525 seconds.

Having x I would like to set all its entries to the same value. In practice you would do it differently, e.g., x = value*ones(1e8,1), but the point here is to investigate the performance of vector indexing. The simplest way is to write:

tic; x(:) = 1; toc

Elapsed time is 0.094316 seconds.

Let's call it method 1 (from the value assigned to x). It seems to be very fast (faster at least than memory allocation). Because the only thing I do here is operate on memory, I can estimate the efficiency of this code by calculating the obtained effective memory bandwidth and comparing it to the hardware memory bandwidth of my computer:

eff_bandwidth = numel(x) * 8 bytes per double * 2 / time

In the above, I multiply by 2 because unless SSE streaming is used, setting values in memory requires that the vector is both read from and written to the memory. In the above example:

eff_bandwidth(1) = 1e8*8*2/0.094316 = 17 Gb/s

STREAM-benchmarked memory bandwidth ( https://www.cs.virginia.edu/stream/ ) of my computer is around 17.9 Gb/s, so indeed - MATLAB delivers close to peak performance in this case! So far, so good.

Method 1 is suitable if you want to set all vector elements to some value. But if you want to access elements every step entries, you need to substitute the : with e.g., 1:step:end. Below is a direct speed comparison with method 1:

tic; x(1:end) = 2; toc

Elapsed time is 0.496476 seconds.

While you would not expect it to perform any different, method 2 is clearly big trouble: factor 5 slowdown for no reason. My suspicion is that in this case MATLAB explicitly allocates the index vector (1:end). This is somewhat confirmed by using explicit vector size instead of end:

tic; x(1:1e8) = 3; toc

Elapsed time is 0.482083 seconds.

Methods 2 and 3 perform equally bad.

Another possibility is to explicitly create an index vector id and use it to index x. This gives you the most flexible indexing capabilities. In our case:

tic;

id = 1:1e8; % colon(1,1e8);

x(id) = 4;

toc

Elapsed time is 1.208419 seconds.

Now that is really something - 12 times slowdown compared to method 1! I understand it should perform worse than method 1 because of the additional memory used for id, but why is it so much worse than methods 2 and 3?

Let's give the loops a try - as hopeless as it may sound.

tic;

for i=1:numel(x)

x(i) = 5;

end

toc

Elapsed time is 0.788944 seconds.

A big surprise - a loop beats a `vectorized` method 4, but is still slower than methods 1, 2 and 3. It turns out that in this particular case you can do it better:

tic;

for i=1:1e8

x(i) = 6;

end

toc

Elapsed time is 0.321246 seconds.

And that is the probably the most bizarre outcome of this study - a MATLAB-written loop significantly outperforms native vector indexing. That should certainly not be so. Note that the JIT'ed loop is still 3 times slower than the theoretical peak almost obtained by method 1. So there is still plenty of room for improvement. It is just surprising (a stronger word would be more suitable) that usual 'vectorized' indexing (`1:end`) is even slower.

My attempt to get more insights into the problem

I do not have an answer to all the problems, but I do have some refined speculations on methods 2, 3 and 4.

Regarding methods 2 and 3. It does indeed seem that MATLAB allocates memory for the vector indices and fills it with values from 1 to 1e8. To understand it, lets see what is going on. By default, MATLAB uses double as its data type. Allocating the index array takes the same time as allocating x

tic; x = zeros(1e8,1); toc

Elapsed time is 0.260525 seconds.

For now, the index array contains only zeros. Assigning values to the x vector in an optimal way, as in method 1, takes 0.094316 seconds. Now, the index vector has to be read from the memory so that it can be used in indexing. That is additional 0.094316/2 seconds. Recall that in x(:)=1 vector x has to be both read from and written to the memory. So only reading it takes half the time. Assuming this is all that is done in x(1:end)=value, the total time of methods 2 and 3 should be

t = 0.260525+0.094316+0.094316/2 = 0.402

It is almost correct, but not quite. I can only speculate, but filling the index vector with values is probably done as an additional step and takes additional 0.094316 seconds. Hence, t=0.4963, which more or less fits with the time of methods 2 and 3.

These are only speculations, but they do seem to confirm that MATLAB explicitly creates index vectors when doing native vector indexing. Personally, I consider this to be a performance bug. MATLABs JIT compiler should be smart enough to understand this trivial construct and convert it to a call to a correct internal function. As it is now, on the todays memory bandwidth bounded architectures indexing performs at around 20% theoretical peak.

Regarding method 4, one can try to analyze the performance of the individual steps as follows:

tic;

id = 1:1e8; % colon(1,1e8);

toc

tic

x(id) = 4;

toc

Elapsed time is 0.475243 seconds.

Elapsed time is 0.763450 seconds.

The first step, allocation and filling the values of the index vector takes the same time as methods 2 and 3 alone. It seems that it is way too much - it should take at most the time needed to allocate the memory and to set the values ( 0.260525s+0.094316s = 0.3548s ), so there is an additional overhead of 0.12 seconds somewhere, which I can not understand. The second part ( x(id) = 4 ) looks also very inefficient: it should take the time needed to set the values of x, and to read the id vector ( 0.094316s+0.094316/2s = 0.1415s ) plus some error checks on the id values. Programed in C, the two steps take:

id = 1:n 0.214259

x(id) = 4 0.219768

The code used checks that a double index in fact represents an integer, and that it fits the size of x:

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

static struct timeval tb, te;

void tic()

{

gettimeofday(&tb, NULL);

}

void toc(const char *idtxt)

{

long s,u;

gettimeofday(&te, NULL);

s=te.tv_sec-tb.tv_sec;

u=te.tv_usec-tb.tv_usec;

printf("%-30s%10li.%.6li\n", idtxt,

(s*1000000+u)/1000000, (s*1000000+u)%1000000);

}

int main(int argc, char *argv[])

{

double *x = NULL;

double *id = NULL;

int i, n;

// vector size is a command line parameter

n = atoi(argv[1]);

printf("x size %i\n", n);

// not included in timing in MATLAB

x = calloc(sizeof(double),n);

memset(x, 0, sizeof(double)*n);

// create index vector

tic();

id = malloc(sizeof(double)*n);

for(i=0; i<n; i++) id[i] = i;

toc("id = 1:n");

// use id to index x and set all entries to 4

tic();

for(i=0; i<n; i++) {

long iid = (long)id[i];

if(iid>=0 && iid<n && (double)iid==id[i]){

x[iid] = 4;

} else break;

}

toc("x(id) = 4");

}

The second step takes longer than the expected 0.1415s - that is due to the necessity of error checks on id values. The overhead of the checks seems too large to me - most likely my code is NOT optimal and it could be written better. Even so, the time required is 0.4340s , not 1.208419s. What MATLAB does under the hood - I have no idea. Maybe it is necessary to do it, I just don't see it.

Questions

- is simple indexing in MATLAB very inefficient (methods 2, 3, and 4 are slower than method 1), or did I miss something?
- why is method 4 (so much) slower than methods 2 and 3?
- why does using 1e8 instead of numel(x) as a loop bound speed up the code by factor 2?

As I see it now, we can not do anything to improve the situation working within the MATLAB framework. Can we hope the issues will be fixed in some new versions of MATLAB? Or are the described performance problems not considered issues?

Test code

function test

tic; x = zeros(1,1e8); toc

tic; x(:) = 1; toc

tic; x(1:end) = 2; toc

tic; x(1:1e8) = 3; toc

tic;

id = 1:1e8; % colon(1,1e8);

x(id) = 4;

toc

tic;

for i=1:numel(x)

x(i) = 5;

end

toc

tic;

for i=1:1e8

x(i) = 6;

end

toc

end

Edit The following tests (modifications to methods 1, 2, and 3 using subsasgn) have been suggested by Daniel

S = struct('type', '()', 'subs', {{':'}});

tic; x = subsasgn(x, S, 1); toc

S = struct('type', '()', 'subs', {{'1:end'}});

tic; x = subsasgn(x, S, 2); toc

S = struct('type', '()', 'subs', {{'1:1e8'}});

tic; x = subsasgn(x, S, 3); toc

Elapsed time is 0.451859 seconds.

Elapsed time is 0.363623 seconds.

Elapsed time is 0.359140 seconds.

Now I am even more lost. Definitely, x(:) works bad now. But interestingly, methods 2 and 3 work faster than the direct implementation, which suggests that subasgn is not used directly when running x(1:end) and x(1:1e8). Since my results differ from what Daniel observed, I should mention that I use Matlab 2011b on Linux/Ubuntu.

Jan
on 23 Nov 2012

Edited: Jan
on 23 Nov 2012

Loops can be faster than vectorized solutions, when the allocation of temporary arrays is more expensive thant the calculations.

Another difference between id=1:1e8; x(id)=4 and x(1:1e8)=4 is the range check: While in the in the first case Matlab checks 1e8 times if the index is inside the boundaries, for the 2nd case only two tests are performed.

Unfortunately this concerns logical indexing also, at least it looks like this. A naive C-Mex function, which checks the length of the logical index ones, is by a fixed noticable factor faster than doing this in Matlab directly. I'm going to post a version in the FEX.

[EDITED]

Version 1:

id = 1:1e8;

x(id) = 4;

This creates 1:1e8 explicitly. In the assignment the elements pf x and the index vector are stored in the processor caches. In addition a boundary check is performed for each index (and this requires much more time with my Win64/MSVC2008 or MSVC2010 or OWC1.8 setup than in your measurements). ==> This is slow.

Version 2:

for i = 1:numel(x)

x(i) = 5;

end

Matlab's JIT accelerates this and the loops avoid the explicit creation of the index vector. In addition the (not existing) index vector does not pollute the caches. It is documented, that "numel(x)" is evaluated once only when the loop starts.

Version 3:

for i = 1:1e8

x(i) = 5;

end

Obviously the JIT can use the fixed width to improve the speed compared to the version 2. Fine. Perhaps the boundary checks are omitted.

Version 4:

x(:) = 5;

Now Matlab can show its power: Because this cannot be affected by any side effects, the JIT calls an efficient method.

For an interpreted language it is not surprising, that the other methods do not have the theoretical memory bandwidth. The power of Matlab is not the performance, but the fast prototyping. This is useful, when the time for programming and debugging exceeds the run-time.

Therefore I do not see a performance bug. The JIT is not as powerful as it could theoretically be. But Matlab is a very conservative programming environment. Improving the JIT mean modifying it, and this can cause bugs due to the complexity of this task. This concerns other optimizers also, e.g. in C++ compilers: The optimization flags influence the results of large numerical programs. As long as the effects are "small", this is not seen as a bug. But Matlab's JIT tries not to change the results. And therefore it has less power.

Btw.:

tic;

id = 1:1e6;

x(id) = 4;

toc % Elapsed time is 0.042795 seconds.

tic;

id = uint32(1):uint32(1e6);

x(id) = 4;

toc % Elapsed time is 0.033133 seconds.

tic;

id = true(1,1e6); % Logical indexing

x(id) = 4;

toc & Elapsed time is 0.025935 seconds.

Matt J
on 26 Nov 2012

@Jan: I haven't seen angainor join you and Daniel in that claim. In fact, we mentioned tests (in Comments to Daniel under my Answer) that seem to refute it. I definitely see x(1:1e8)=2 producing a memory allocation spike in the Task Manager, even when x is already pre-allocated.

And I'm still not seeing the connection between the undocumentedmatlab material and this issue. Some operations are not done inplace, but is there a theory that subsasgn operations are not?

Jan
on 26 Nov 2012

@Matt J: I've copied the wrong link. Actually I meant http://undocumentedmatlab.com/blog/matrix-processing-performance/.

I see that there are arguments for both theories about the (avoided) explicit allocation in "x(1:1e8)". All I've heard about the JIT let me think, that it depends on the specific code. E.g. it can matter if the code runs in the command line or in a function. Therefore I still think, that this discussion is theoretical only, not to say fruitless. We cannot influence the JIT directly, the JIT differs from release to release, TMW has to consider thousnads of different aspects when modifying it. And finally filling the memory with a scalar is most likely not a bottleneck in any real-world program, because this redundant information could be handles much more efficiently.

@angainor: Please lead me back to the actual point: How can we help you or what is the core of your message?

Matt J
on 22 Nov 2012

Edited: Matt J
on 22 Nov 2012

Never mind any of that. Try explaining this:

x = zeros(1e8,1);

id = 1:1e8;

tic;

for i=id

x(i) = 4;

end

toc

%Elapsed time is 101.974125 seconds.

I think the bottom line is this. Vectorized methods are generally going to be slow because they always require you to generate and allocate memory for a dedicated index array. Some vectorized situations are accelerated more than others, however, like

x(1:1e8)=2

because MATLAB recognizes that 1:1e8 is only temporarily required and does a better job of caching it.

For-loops can in some cases do better than vectorized methods because the JIT can recognize certain patterns and situations where the loop is jumping in regular increments and doesn't need to allocate memory for a dedicated index array.

As for why numel(x) slows a for-loop down, I imagine the situation is similar to C/C++. At every iteration of the loop, i must be compared to numel(x), which is stored in some temporary variable, to check if the loop is supposed to terminate. Variables are not cached as well as constants like 1e8.

Matt J
on 24 Nov 2012

I just executed the statements while looking at MATLABs memory consumption and see that the memory usage clearly goes up by almost 800Mb.

Confirmed! (in R2012a 64-bit)

Well I guess the plot thickens.

performance is optimal, hence it is possible and there is no inherent problem with MATLAB - it can do things well.

Yes, it's possible and maybe it's on their TODO list, but the question is how much development work it would take TMW to make the optimization you're talking about. Basically what you seem to be proving is that all indexing operations, even on native matrices/vectors, generate index vectors and pass those vectors to SUBSASGN, except of course in the particular case where it passes only the string ':'. The reason it is possible to optimize x(:) is because when the ':' string gets passed to SUBSASGN, it can be easily recognized as a special case. What you're now asking is that the JIT recognize colon expressions separately and pass a new/different kind of symbol to subsasgn to encode that. Not sure how much development effort that would require or how much extra burden that would put on the JIT.

There are also many other sub-optimalities you could point out to TMW. What about the following performance difference, for example? Does it make sense that something so slightly different than x(:) should be about a factor of 2 slower?

tic;

x(:)=2;

toc;

%Elapsed time is 0.054764 seconds.

tic;

x(:,1)=3;

toc;

%Elapsed time is 0.098193 seconds.

Daniel Shub
on 23 Nov 2012

MATLAB is not as simple as you suggest. You were looking for the extra read and write overhead you observe with x(1:end) and x(1:1e8). I would suggest that this overhead is from subsasgn and I would add two additional tests to your list. According to the documentation x(:) should be equivalent to

S = struct('type', '()', 'subs', {{':'}});

tic; x = subsasgn(x, S, 4); toc

and the other three examples should be equivalent to

S = struct('type', '()', 'subs', {{1:1e8}});

tic; x = subsasgn(x, S, 5); toc

On my machine

S = struct('type', '()', 'subs', {{':'}});

tic; x = subsasgn(x, S, 4); toc

is slower than

x(:) = 1;

and

S = struct('type', '()', 'subs', {{1:1e8}});

tic; x = subsasgn(x, S, 5); toc

is slower than

x(1:end) = 2;

and

x(1:1e8) = 3;

but the same speed as

id = 1:1e8; % colon(1,1e8);

x(id) = 4;

This suggests that MATLAB speeds up x(:), x(1:end) and x(1:1e8), possibly by avoiding the overhead of subsasgn. I think this is probably the case of x(:). For x(1:end) and x(1:1e8), however, I think subsasgn is being used since

S = struct('type', '()', 'subs', {{':'}});

tic; x = subsasgn(x, S, 4); toc

takes about the same time as both

x(1:end) = 2;

and

x(1:1e8) = 3;

suggesting some sort of minimum overhead with the subsasgn route. This to me suggests that MATLAB is not calling subsasgn like the documentation says. This is not surprising since subsasgn is a tricky beast in that it cannot be overloaded and the JIT helps where it can.

As for the loops who knows what the JIT is doing. Some things are optimized and some things are not. Some things that were optimized in one version become un-optimized in another version.

Matt J
on 23 Nov 2012

If x(1:1e8) is not creating a temporary index array 1:1e8, as you argued in your Comment to me, then I don't see how it could be calling subsasgn. If it's calling subsasgn, what 'subs' data is being passed to it? You mean it's translating x(1:1e8) into x(:) and then calling SUBSASGN with subs=':'? In my timing results it's taking a really long time to do that translation, on the order of .01 sec.

Incidentally, I'm not clear what you meant by "subsasgn is a tricky beast in that it cannot be overloaded", since people overload it all the time when writing classdefs.

Daniel Shub
on 23 Nov 2012

Yes, subsasgn can be (and often is) overload for custom classes, but it cannot be overloaded for builtin classes. According to the documentation: "Therefore, if A is an array of class double, and there is an @double/subsasgn method on your MATLAB path, the statement A(I) = B calls the MATLAB built-in subsasgn function."

As for x(1:1e8) creating a temporary variable, I am looking into that now. I don't know what the input to subsasgn is, but it doesn't seem to require much memory and it appears to be able to handle so transformations (e.g., 1:M:N) without creating a temporary variable ...

Jan
on 13 Apr 2015

Some tests revealed, that even logical indexing can be implemented more efficiently: Fex: CopyMask . This C-Mex function is about 2 to 3 times faster than Matlab's built-in method:

function test

X = rand(8000,8000);

tic

for k = 1:10

Y = CopyMask(X, X > 0.2);

end

toc

tic

for k = 1:10

Y = X(X > 0.2);

end

toc

Daniel Shub
on 26 Nov 2012

The fact that different means of indexing take different amount of time is interesting, but probably not that useful to understand. The timings appear to depend on the MATLAB version and possibly OS. In my opinion optimizing MATLAB for

x(1:n) = a;

would be a big mistake. I would much rather have MATLAB optimized for something like

x(randi(n, n/100, 1)) = randn(n, 1);

which I don't think its the same problem.

Opportunities for recent engineering grads.

Apply TodayFind the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
## 3 Comments

## Direct link to this comment

https://uk.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient#comment_112929

⋮## Direct link to this comment

https://uk.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient#comment_112929

## Direct link to this comment

https://uk.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient#comment_112944

⋮## Direct link to this comment

https://uk.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient#comment_112944

## Direct link to this comment

https://uk.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient#comment_113016

⋮## Direct link to this comment

https://uk.mathworks.com/matlabcentral/answers/54522-why-is-indexing-vectors-matrices-in-matlab-very-inefficient#comment_113016

Sign in to comment.