Optimizing division and modulo operations on a microcontroller
In this article I’ll show how I optimized a costly division operation in a performance constrained embedded environment, what effect the optimization had, and why I finally decided against it.
Getting rid of the division

I recently did some benchmarks on my JLed library (an embedded
library for microcontrollers to control LEDs with various effects) and compared the performance of
the fade effect function on various microcontrollers. Basically the function uses a look-up-table
and calculates the target value using linear interpolation, avoiding floating point and divisions
mostly. In the original version there is still one division operation left in the calculation of
tnorm. Since not every microcontroller has hardware division built-in (e.g. AVR 8-bit), or uses it
by default (see RP2040 pico divider), it is
a costly operation, especially when called in the “hot path”, possibly thousands of times per
second:
uint8_t original_fade_func(uint32_t t, uint16_t period) {
// normalized look-up-table at x=0,16,32,...
static constexpr uint8_t lut8[] = {
0, 0, 3, 7, 13, 22, 33, 49, 68, 91, 118, 148, 179, 208, 232, 248, 255
};
if (t + 1 >= period) return 255;
// scale t according to period to 0..255
const auto tnorm = (t << 8) / period; // <- expensive division here
const auto i = (tnorm >> 4); // -> i will be in range 0 .. 15
const auto y0 = lut8[i];
const auto y1 = lut8[i + 1];
const auto x0 = i << 4; // *16
// y(t) = mt+b, with m = dy/dx = (y1-y0)/16 = (y1-y0) >> 4
return (((tnorm - x0) * (y1 - y0)) >> 4) + y0;
}
How can we get rid of the costly division? In the book Hacker’s Delight I found the idea to replace the division by a multiplication with the reciprocal value: Instead of computing \( a = \lfloor \frac i j \rfloor \) we compute the approximate \( b = \lfloor i \frac{M}{2^{16}} \rfloor \) with \( M = \lceil \frac{2^{16}}{j} \rceil \). With \( \lceil \frac x d \rceil = \lfloor \frac{x + d - 1}{d} \rfloor \) we have \( M = \lfloor \frac{2^{16}+j-1}{j} \rfloor \).
// calculate the "inverse" of d as ceil(2^16/d). With d > 1 the result
// fits into an uint16_t:
// inv16(1) = 2^16 = 65536 -> will not fit in uint16_t
// inv16(2) = (2^16 + 1) / 2 = 2^15 = 32768 -> fits into uint16_t
// Since we round up, inv can have a small error that must be adjusted
// during the division later in div16.
uint16_t inv16(uint16_t d) {
return ((1UL << 16) + (uint32_t)d - 1) / (uint32_t)d;
}
// calculate floor(a/b) with multiply-and-shift. inv16(d) = ceil(2^16/d)
// introduces a rounding error ε = inv16(d) - 2^16/d, where 0 <= ε < 1.
// Expanding the multiplication:
// a * inv16(d) / 2^16 = a/d + a*ε/2^16
// The term a*ε/2^16 is the excess. Since ε < 1 and a < 2^16, the excess is
// always < 1, so the floor can overshoot by at most 1.
// Example: a=32768, d=3 -> inv16(3)=21846, ε=21846-65536/3=0.667,
// excess = 32768*0.667/65536 = 0.333. The fractional part of a/d is 2/3,
// and 2/3 + 1/3 >= 1, so floor(32768*21846/2^16) = 10923 instead of
// floor(32768/3) = 10922. The correction step detects and fixes this.
// (storing the inverted period as uint32_t would make the correction step obsolete, but at
// the cost of storing 4 instead of 2 bytes)
uint16_t div16(uint16_t a, uint16_t b, uint16_t inv_b) {
const uint16_t d = ((uint32_t)a * (uint32_t)inv_b) >> 16;
if (d * b > a)
return d-1; // Correction
return d;
}
// test for combinations of (i,j) if the multpliy-and-shift approach delivers the
// expected result
int main(void) {
uint32_t nok = 0;
uint32_t ok = 0;
for (uint32_t j = 2; j <= 65535; j++) {
const uint16_t inv16_j = inv16((uint16_t)j);
for (uint32_t i = 1; i <= 65535; i++) {
const uint16_t expected = (uint16_t)i / (uint16_t)j;
const uint16_t b = div16(i, j, inv16_j);
if (expected != b) {
printf("i=%-5u j=%-5u: expected=%-5u div16=%-5u\n", i, j, expected, b);
nok++;
} else {
ok++;
}
}
printf("j=%u, ok=%u, nok = %u\n", j, ok, nok);
}
return 0;
}
Coming back to our fade-function, we now pass the pre-computed inverted period as inv_period to
the function and use the above multiply-and-shift approach. Since we don’t need 100% accuracy (we
are just dealing with LED brightness calculations), we omit the described adjustment and accept the
resulting error of at most 1. To be efficient, inv_period must be calculated once on the call
site and then cached. This results in the optimized_fade_func:
uint8_t optimized_fade_func(uint32_t t, uint16_t period, uint16_t inv_period) {
// pre-calculated fade-on function at x={0,16,...,256}
static constexpr uint8_t lut8[] = {
0, 0, 3, 7, 13, 22, 33, 49, 68, 91, 118, 148, 179, 208, 232, 248, 255
};
if (t + 1 >= period) return 255;
const auto tnorm = ((t * (uint32_t)inv_period) >> 16) & 0xff; // omitting correction
const auto i = tnorm >> 4; // segment index (0..15)
const auto y0 = lut8[i];
const auto y1 = lut8[i + 1];
const auto x0 = i << 4; // segment start in normalized space
return (((tnorm - x0) * (y1 - y0)) >> 4) + y0;
}
// ...
// calculate once and store
const uint16_t inv_period = ((1UL << 16) + (uint32_t)period - 1) / (uint32_t)period;
// ...
// then use it again and again
const auto val = optimized_fade_func(t, period, inv_period);
// ...
For completeness, the same optimization can be used when using the also costly modulo operation in
the JLed Update method, which is also called on the hot path. Instead of using the modulo operator
%, which relies on an expensive division operation, we use div16 and calculate
uint16_t mod16(uint16_t a, uint16_t b, uint16_t inv_b) {
const uint16_t d = div16(a, b, inv_b);
return a - d*b;
}
Does it pay out?
As is often the case with optimizations, you end up trading one property for another. In this
instance, we gain speed at the cost of increased memory usage and code complexity. Because we now
pass an inv_period to the fade function, it must be pre-computed and stored (cached) somewhere in
advance.
Memory
The breathe effect of JLed composes two fade-functions with different periods, resulting in 4
additional bytes to store the pre-computed inverted periods (which can be different) plus one
pre-computed inverted period in the JLed object to avoid the modulo operation mentioned. In total
this sums up to 6 additional bytes per JLed object.
Performance
Let’s compare the performance before and after the optimization for various MCUs:
| MCU | Unoptimized (ns/op) | Optimized (ns/op) |
|---|---|---|
| Arduino Nano | 47900 | 17479 |
| RP2040 | 529 | 362 |
| RP2350 | 282 | 256 |
| RP2350 (RISC V Core) | 337 | 210 |
| STM Nucleo F401RE | 550 | 476 |
| ESP8266 | 1697 | 687 |
| ESP32-S3 | 203 | 193 |
| ESP32-C3 | 345 | 213 |
The 8-bit AVR (Arduino Nano) benefits the most since it is relatively slow and has no hardware divider, so every division is a slow software routine. On MCUs with hardware dividers, numbers of optimized and unoptimized versions are close, making the optimization nearly irrelevant here.
According to the table, on an Arduino Nano with an Atmega 328P, a single calculation takes about 47900 ns with the unoptimized version. This results in at most \( \frac{1}{47900 \times 10^{-9}} \approx 20877 \) updates (if we forget all other overhead) per second. If we update the LEDs at 100 Hz, we theoretically could update 208 LEDs simultaneously.
But since the Atmega 328P has only 6 PWM capable GPIOs, let’s calculate how much of the CPU time the update of 6 LEDs at 100Hz would consume (now assuming it takes \( t_{led} = 100000ns = 100μs\) per Update per LED, taking additional overhead into consideration). Time taken for 6 LED updates per second = \( 6 \times t_{led} \times 100 = 60,\text{ms} \). That means that on every second, 60ms or 6% of CPU time is spent for updating the 6 LEDs and 94% of CPU time is available for other tasks.
Summary
The described method is a good optimization if every CPU cycle counts and the MCU lacks a hardware divider, like the Atmega 328P. Finally I decided against the optimization, because where it would benefit most (on 8-bit AVR), the additional memory consumption also hurts most: an Arduino Nano has only 2 KB of SRAM. On the other hand, the Atmega 328P can easily drive 6 JLed objects simultanously at only 6% CPU consumption. On modern MCUs the effect is negligible, and fast MCUs like RP2040, RP2350 or ESP32s are widely adopted today.
Bonus - how to check how the compiler implements the division
If you are using PlatformIO, you can easily check if the emitted code uses hardware div commands,
or calls a software routine. To do so, add or set your build_flags to build_flags = -save-temps=obj -Og and compile the project. If the target platform is e.g., ESP8266 then
afterwards you will find the corresponding assembly in .pio/build/esp8266/src/bench.ino.cpp.s (if
your sketch is called bench.ino). My test function is called divtest, so let’s look for that
function in the assembly.
uint16_t divtest(uint16_t a, uint16_t b) {
return a / b;
}
This is what the compiler generates for the ESP8266:
_Z7divtesttt:
addi sp, sp, -16
extui a3, a3, 0, 16
extui a2, a2, 0, 16
s32i.n a0, sp, 12
call0 __udivsi3 ; software udiv called here
l32i.n a0, sp, 12
extui a2, a2, 0, 16
addi sp, sp, 16
ret.n
We see that on the ESP8266 platform a software implementation is called. Next have a look
at the RP2350 generated code, which emits a udiv command:
_Z7divtesttt:
udiv r0, r0, r1
bx lr
And finally let’s compare this with the code generated for the RP2040, where again no
hardware div command is used:
_Z7divtesttt:
push {r4, lr}
bl __aeabi_uidiv ; calls software divider
uxth r0, r0
pop {r4, pc}