Today we are going to test the Arduino compiler, by doing a series of simple tests to see how smart it is at making optimizations in our code.
Often in the chat I tell you not to obsess over code speed and to always prioritize cleanliness over speed. Well, today we are going to justify this speech a bit.
Actually, when talking about the “Arduino compiler”, we are using a generic C++ compiler, usually Gcc. So, how smart is a C++ compiler like GCC?
Well, at this point in the movie (or as they say, in the “state of the art”) very, very, very smart. Much more than you, than me, and even more than you and me together.
It’s not something to get depressed about. Simply, modern compilers are very advanced programs, developed over many years, with the contributions and work of great geniuses. So it is logical that they do an excellent job optimizing code.
But just in case we don’t believe it, let’s do a few basic examples. For the tests we are going to use an ESP32, but it would be the same with another type of processor. Also, the specific times we are going to get don’t really matter to us. What we want to see are the trends, and certain optimizations that we will see very clearly.
Loop with Increment
Let’s start with a base example, executed on an ESP32.
void printResults(unsigned long long counter, unsigned long long rst, unsigned long long microseconds)
{
Serial.print("benchmark ");
Serial.print(counter);
Serial.print(": ");
Serial.print(rst);
Serial.print(" ");
Serial.print(microseconds);
Serial.println("us");
}
void benchmark()
{
auto start = micros();
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum ++;
}
auto finish = micros();
auto microseconds = finish - start;
printResults();
}
void setup()
{
Serial.begin(115200);
}
void loop()
{
delay(5000);
benchmark();
}
Here we basically have,
- A ‘printResults()’ function that displays the data on the screen
- Another benchmark function, which performs the calculations and tests we want to do
- Finally, a very simple setup function that initializes the port, and a very simple loop that launches the benchmark
In this first base case, we are simply entering a loop 1,000,000 (1E6) times and incrementing a variable, to see how fast it executes. If we run this we will see something similar to the second result,
benchmark 1000000: 1000000 1us
And we might think, hmmm… how fast is our processor that it performs a loop of one million elements in one microsecond. So let’s vary the number of times we run the loop. The results are as follows,
| 10^N | Result | us |
|---|---|---|
| 3 | 1000 | 1 |
| 4 | 10000 | 1 |
| 5 | 100000 | 1 |
| 6 | 1000000 | 1 |
| 7 | 10000000 | 1 |
| 8 | 100000000 | 1 |
| 9 | 1000000000 | 1 |
Well, here we already see that something strange is happening. The test is always executing in a time of one microsecond regardless of the number of times we enter the loop.
On the other hand, we are talking about a 240 MHz processor. Even considering that each operation costs more than one clock cycle, it is logical to expect that between 1E8 and 1E9 we are in the range of one second. And we are in a range of microseconds.
What is happening? That the compiler is smart enough to know how to solve the operation, without actually having to perform the loop. That is, when compiling the code, the generated binary does not have the loop part. The compiler has replaced it with the final result of the operation.
Loop with Literals and Constants
Let’s see what happens if instead of simply incrementing a variable, we add a literal.
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += 3;
}
Which if we run the test we get,
benchmark 1000000: 3000000 1us
benchmark 1000000000: 3000000 1us
That is, the compiler is also smart enough to know how to solve the loop without having to perform it.
What if we change it to a constant variable?
const int value = 5;
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += value;
}
It also knows how to solve it without executing
benchmark 1000000: 50000000 1us
What if it wasn’t a variable, and it wasn’t constant?
int value = 10;
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += value;
}
Indeed, the compiler is smart enough to know that this variable is not changing in the code, even though we haven’t marked it with ‘const’
benchmark 1000000: 50000000 1us
Wow, how smart is the compiler. What if the value is returned from a function?
int givemeNumber()
{
return 3;
}
void benchmark()
{
auto start = micros();
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += givemeNumber();
}
auto finish = micros();
auto microseconds = finish - start;
printResults(counter, 0, microseconds);
}
Indeed, the compiler has not only been able to calculate the correct result without calling the function, but in the binary code the loop and the function will disappear completely!
benchmark 1000000: 3000000 1us
Loop with Variable
Well, it’s clear that this compiler is very smart. Let’s try to make life more difficult for it. Instead of adding a number that doesn’t change, let’s add a number that does change. For example, let’s add the loop’s own ‘i’,
void benchmark()
{
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += i;
}
}
We run our base case for 1E6 elements and get.
6 1783293664 8393
Okay, now it has entered the loop. We can easily see this because it is no longer 1 microsecond, but 8393ms. Again, we are not going to evaluate the speed of the ESP32, it is not important. What is important is that this time the compiler has not been able to calculate the variable without entering the loop.
A bit disappointing. If the compiler were a bit smarter, it would know that the result is N*(N+1)/2. But I guess that’s too much to expect from a poor little program (spoiler: we’ll see later).
Let’s continue. Just like before, let’s change the number of times we enter the loop. The results we get are as follows.
| 10^N | Result | us |
|---|---|---|
| 3 | 499500 | 1 |
| 4 | 49995000 | 1 |
| 5 | 704982704 | 835 |
| 6 | 1783293664 | 8393 |
| 7 | -2014260032 | 83946 |
| 8 | 887459712 | 839515 |
| 9 | 3051657984 | 8395178 |
Here we see that in the case of 1E5 the range is 835us, and it increases almost perfectly linearly with the number of loop iterations. That is, the program is indeed entering the loop in these cases.
But, for the 1E3 and 1E4 cases the time is back to 1us. That is, the compiler has been able to calculate the result without needing to perform the loop. Is it possible that the compiler does know the formula N*(N+1)/2? And in that case, why does it know how to apply it at 1E3 and 1E4 but not from 1E5 onwards?
Well, for that we have to notice that the calculated result has overflowed. That is, it has exceeded the maximum size of the sum variable, which is an unsigned long. In the case of an ESP32 it is 4 bytes.
In the following table I show the results of the calculation being performed, indicating when it would overflow the different types of variables.
| int / long | uint / ulong | long long | ulong long | ||
|---|---|---|---|---|---|
| 10^N | Result | 2.15E+09 | 4.29E+09 | 2.31E+18 | 4.61E+18 |
| 1 | 4.50000000000E+01 | ||||
| 2 | 4.95000000000E+03 | ||||
| 3 | 4.99500000000E+05 | ||||
| 4 | 4.99950000000E+07 | ||||
| 5 | 4.99995000000E+09 | ||||
| 6 | 4.99999500000E+11 | ||||
| 7 | 4.99999950000E+13 | ||||
| 8 | 4.99999995000E+15 | ||||
| 9 | 4.99999999500E+17 | ||||
| 10 | 4.99999999950E+19 |
We see that, precisely, for 1E4 the result has not yet overflowed. But, for 1E5 onwards, it does overflow. That is, the compiler knows how to calculate the operation without entering the loop as long as the result variable does not overflow.
If the variable overflows, the compiler has no choice but to perform the loop to execute the calculation. And rightly so, because this calculation is no longer as trivial as the equation we cited. This is why we see this jump between 1E4 and 1E5 iterations.
What if we switch to an 8-byte long long variable? Here are the results,
| 10^N | Result | us |
|---|---|---|
| 3 | 499500 | 1 |
| 4 | 49995000 | 1 |
| 5 | 4999950000 | 7554 |
| 6 | 499999500000 | 75554 |
| 7 | 49999995000000 | 755657 |
| 8 | 4999999950000000 | 7565389 |
| 9 | 499999999500000000 | 76533470 |
We observe that the times are much longer. Logical, because 64-bit operations will cost the ESP32 much more effort, as it is a 32-bit processor. But we have already said that we don’t care much about the specific value.
What is important is that the results are now always correct, and in no case does the result variable overflow. And we see that the times increase linearly… starting from 1E5. We still have the same jump at 1E3 and 1E4, where the compiler is able to eliminate the loop.
Why for < 1E5, if the variable is not overflowing now? Well, I understand it is because the processor is 32-bit, so although we are now using 64-bit variables, the necessary calculations prevent it from optimizing because they require a greater number of operations that the compiler does not know how to eliminate.
Loop with Random Sum
Let’s try to make life more difficult for the compiler. How can we make it so that the compiler is not able to “get rid of” the loop in any case? That it always enters, regardless of whether it is executed 1, 10 times or 1,000,000 times?
Well, for example, we can add a random variable. With this we guarantee that “by force” the program will have to enter the loop to calculate it, it will not be able to do any prior optimization.
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += random(10);
}
Actually, it wouldn’t have to be a random number. Any non-deterministic variable, such as a digital input, or a value received via communication, would work equally well.
We run it and get,
| 10^N | Result | us |
|---|---|---|
| 1 | random | 8 |
| 2 | random | 68 |
| 3 | random | 678 |
| 4 | random | 6809 |
| 5 | random | 68143 |
| 6 | random | 681428 |
| 7 | random | 6814308 |
| 8 | random | 68143113 |
Logically, the times are much higher due to the overhead of generating the random number. But we don’t care about that, what interests us is that it hasn’t been able to simplify anything. Thus, we see that from 1E1 to 1E8 the time increases linearly, and is approximately 10 times higher than the previous one at each level.
Perfect! We have just made a compiler’s life miserable! (you must be proud…) And, in the process, we have demonstrated that everything we assumed and explained earlier was indeed true.
We Don’t Use the Result
Let’s look at another example. Let’s go back to the case where we sum ‘i’ 1E6 times. But this time, we are not going to pass the result to the ‘debug’ function. That is, in this case we don’t use the ‘sum’ variable except in the loop, but not afterwards.
void benchmark()
{
auto start = micros();
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += i;
}
auto finish = micros();
auto microseconds = finish - start;
printResults(counter, 0, microseconds); // we are not using sum
}
If we run it, we see that the time goes from 8400us to 0
benchmark 1000000: 0 0us
That is, the compiler has been able to detect that you are performing a calculation that you are not using anywhere. Consequently, it completely removes all associated code. Indeed, it has gotten rid of the loop and all the variables involved in it.
What if instead of summing ‘i’, we go back to summing random()?,
void benchmark()
{
auto start = micros();
unsigned long counter = 1000000;
unsigned long sum = 0;
for (size_t i = 0; i < counter; i++)
{
sum += random(10);
}
auto finish = micros();
auto microseconds = finish - start;
printResults(counter, 0, microseconds);
}
Here no, it has not been able to eliminate the loop even though the result is not being used anywhere after the calculation.
benchmark 1000000: 0 681200us
The reason is that the random() function is complex enough that the compiler doesn’t know what it does. We know that it basically returns a number. But the compiler cannot determine what that function does. So it plays it safe, and doesn’t remove it.
Decremental Loop
While we’re at it, let’s try another case. It is common among more veteran programmers that decremental loops (i—) are faster than incremental loops (i++)
This is based on the theory that the comparison != 0 is faster than the comparison < or <=, therefore we should scrape some valuable microseconds. Since we’re here, let’s test it.
void benchmark()
{
auto counter = 1000000;
auto sum = 0;
for (size_t i = counter - 1; i != 0; i--)
{
sum += i;
}
}
And, as expected, the result is exactly the same as when executed incrementally.
benchmark 1000000: 1783293664 8393us
Probably, this was true 50 years ago, with very old processors and compilers. Moreover, if there really were any performance difference, the compiler is smart enough to substitute this for you.
Furthermore, as I tell you many times, today it is advisable to avoid this type of “trickery”. By trying to optimize the code, you might prevent the compiler from being able to do some of its optimizations, and even worsen performance.

