Optimizing division and modulo operations on a microcontroller

Publish date: Mar 30, 2026

Tags:

In this article I’ll show how I optimized a costly division operation in a performance constrained embedded environment, what effect the optimization had, and why I finally decided against it.

Getting rid of the division

JLed in action. Calculating the effects needs costly integer division operations.

I recently did some benchmarks on my JLed library (an embedded library for microcontrollers to control LEDs with various effects) and compared the performance of the fade effect function on various microcontrollers. Basically the function uses a look-up-table and calculates the target value using linear interpolation, avoiding floating point and divisions mostly. In the original version there is still one division operation left in the calculation of tnorm. Since not every microcontroller has hardware division built-in (e.g. AVR 8-bit), or uses it by default (see RP2040 pico divider), it is a costly operation, especially when called in the “hot path”, possibly thousands of times per second:

uint8_t original_fade_func(uint32_t t, uint16_t period) {
    // normalized look-up-table at x=0,16,32,...
    static constexpr uint8_t lut8[] = {
        0, 0, 3, 7, 13, 22, 33, 49, 68, 91, 118, 148, 179, 208, 232, 248, 255
    };
    if (t + 1 >= period) return 255;

    // scale t according to period to 0..255
    const auto tnorm = (t << 8) / period;  // <- expensive division here
    const auto i = (tnorm >> 4);  // -> i will be in range 0 .. 15
    const auto y0 = lut8[i];
    const auto y1 = lut8[i + 1];
    const auto x0 = i << 4;  // *16

    // y(t) = mt+b, with m = dy/dx = (y1-y0)/16 = (y1-y0) >> 4
    return (((tnorm - x0) * (y1 - y0)) >> 4) + y0;
}

How can we get rid of the costly division? In the book Hacker’s Delight I found the idea to replace the division by a multiplication with the reciprocal value: Instead of computing \( a = \lfloor \frac i j \rfloor \) we compute the approximate \( b = \lfloor i \frac{M}{2^{16}} \rfloor \) with \( M = \lceil \frac{2^{16}}{j} \rceil \). With \( \lceil \frac x d \rceil = \lfloor \frac{x + d - 1}{d} \rfloor \) we have \( M = \lfloor \frac{2^{16}+j-1}{j} \rfloor \).

// calculate the "inverse" of d as ceil(2^16/d). With d > 1 the result
// fits into an uint16_t:
//   inv16(1) = 2^16 = 65536 -> will not fit in uint16_t
//   inv16(2) = (2^16 + 1) / 2 = 2^15 = 32768 -> fits into uint16_t
// Since we round up, inv can have a small error that must be adjusted
// during the division later in div16.
uint16_t inv16(uint16_t d) {
  return ((1UL << 16) + (uint32_t)d - 1) / (uint32_t)d;
}

// calculate floor(a/b) with multiply-and-shift. inv16(d) = ceil(2^16/d)
// introduces a rounding error ε = inv16(d) - 2^16/d, where 0 <= ε < 1.
// Expanding the multiplication:
//   a * inv16(d) / 2^16 = a/d + a*ε/2^16
// The term a*ε/2^16 is the excess. Since ε < 1 and a < 2^16, the excess is
// always < 1, so the floor can overshoot by at most 1.
// Example: a=32768, d=3 -> inv16(3)=21846, ε=21846-65536/3=0.667,
//   excess = 32768*0.667/65536 = 0.333. The fractional part of a/d is 2/3,
//   and 2/3 + 1/3 >= 1, so floor(32768*21846/2^16) = 10923 instead of
//   floor(32768/3) = 10922. The correction step detects and fixes this.
// (storing the inverted period as uint32_t would make the correction step obsolete, but at
//  the cost of storing 4 instead of 2 bytes)
uint16_t div16(uint16_t a, uint16_t b, uint16_t inv_b) {
  const uint16_t d = ((uint32_t)a * (uint32_t)inv_b) >> 16;
   if (d * b > a)
     return d-1; // Correction
  return d;
}

// test for combinations of (i,j) if the multpliy-and-shift approach delivers the
// expected result
int main(void) {
  uint32_t nok = 0;
  uint32_t ok = 0;
  for (uint32_t j = 2; j <= 65535; j++) {

    const uint16_t inv16_j = inv16((uint16_t)j);

    for (uint32_t i = 1; i <= 65535; i++) {
      const uint16_t expected = (uint16_t)i / (uint16_t)j;
      const uint16_t b = div16(i, j, inv16_j);

      if (expected != b) {
        printf("i=%-5u j=%-5u: expected=%-5u div16=%-5u\n", i, j, expected, b);
        nok++;
      } else {
        ok++;
      }
    }
    printf("j=%u, ok=%u, nok = %u\n", j, ok, nok);
  }
  return 0;
}

Coming back to our fade-function, we now pass the pre-computed inverted period as inv_period to the function and use the above multiply-and-shift approach. Since we don’t need 100% accuracy (we are just dealing with LED brightness calculations), we omit the described adjustment and accept the resulting error of at most 1. To be efficient, inv_period must be calculated once on the call site and then cached. This results in the optimized_fade_func:

uint8_t optimized_fade_func(uint32_t t, uint16_t period, uint16_t inv_period) {
    // pre-calculated fade-on function at x={0,16,...,256}
    static constexpr uint8_t lut8[] = {
        0, 0, 3, 7, 13, 22, 33, 49, 68, 91, 118, 148, 179, 208, 232, 248, 255
    };
    if (t + 1 >= period) return 255;

    const auto tnorm = ((t * (uint32_t)inv_period) >> 16) & 0xff; // omitting correction
    const auto i = tnorm >> 4;  // segment index (0..15)

    const auto y0 = lut8[i];
    const auto y1 = lut8[i + 1];

    const auto x0 = i << 4;  // segment start in normalized space
    return (((tnorm - x0) * (y1 - y0)) >> 4) + y0;
}

// ...

// calculate once and store
const uint16_t inv_period = ((1UL << 16) + (uint32_t)period - 1) / (uint32_t)period;

// ...

// then use it again and again
const auto val = optimized_fade_func(t, period, inv_period);
// ...

For completeness, the same optimization can be used when using the also costly modulo operation in the JLed Update method, which is also called on the hot path. Instead of using the modulo operator %, which relies on an expensive division operation, we use div16 and calculate

uint16_t mod16(uint16_t a, uint16_t b, uint16_t inv_b) {
    const uint16_t d = div16(a, b, inv_b);
    return a - d*b;
}

Does it pay out?

As is often the case with optimizations, you end up trading one property for another. In this instance, we gain speed at the cost of increased memory usage and code complexity. Because we now pass an inv_period to the fade function, it must be pre-computed and stored (cached) somewhere in advance.

Memory

The breathe effect of JLed composes two fade-functions with different periods, resulting in 4 additional bytes to store the pre-computed inverted periods (which can be different) plus one pre-computed inverted period in the JLed object to avoid the modulo operation mentioned. In total this sums up to 6 additional bytes per JLed object.

Performance

Let’s compare the performance before and after the optimization for various MCUs:

MCU	Unoptimized (ns/op)	Optimized (ns/op)
Arduino Nano	47900	17479
RP2040	529	362
RP2350	282	256
RP2350 (RISC V Core)	337	210
STM Nucleo F401RE	550	476
ESP8266	1697	687
ESP32-S3	203	193
ESP32-C3	345	213

The 8-bit AVR (Arduino Nano) benefits the most since it is relatively slow and has no hardware divider, so every division is a slow software routine. On MCUs with hardware dividers, numbers of optimized and unoptimized versions are close, making the optimization nearly irrelevant here.

According to the table, on an Arduino Nano with an Atmega 328P, a single calculation takes about 47900 ns with the unoptimized version. This results in at most \( \frac{1}{47900 \times 10^{-9}} \approx 20877 \) updates (if we forget all other overhead) per second. If we update the LEDs at 100 Hz, we theoretically could update 208 LEDs simultaneously.

But since the Atmega 328P has only 6 PWM capable GPIOs, let’s calculate how much of the CPU time the update of 6 LEDs at 100Hz would consume (now assuming it takes \( t_{led} = 100000ns = 100μs\) per Update per LED, taking additional overhead into consideration). Time taken for 6 LED updates per second = \( 6 \times t_{led} \times 100 = 60,\text{ms} \). That means that on every second, 60ms or 6% of CPU time is spent for updating the 6 LEDs and 94% of CPU time is available for other tasks.

Summary

The described method is a good optimization if every CPU cycle counts and the MCU lacks a hardware divider, like the Atmega 328P. Finally I decided against the optimization, because where it would benefit most (on 8-bit AVR), the additional memory consumption also hurts most: an Arduino Nano has only 2 KB of SRAM. On the other hand, the Atmega 328P can easily drive 6 JLed objects simultanously at only 6% CPU consumption. On modern MCUs the effect is negligible, and fast MCUs like RP2040, RP2350 or ESP32s are widely adopted today.

Bonus - how to check how the compiler implements the division

If you are using PlatformIO, you can easily check if the emitted code uses hardware div commands, or calls a software routine. To do so, add or set your build_flags to build_flags = -save-temps=obj -Og and compile the project. If the target platform is e.g., ESP8266 then afterwards you will find the corresponding assembly in .pio/build/esp8266/src/bench.ino.cpp.s (if your sketch is called bench.ino). My test function is called divtest, so let’s look for that function in the assembly.

uint16_t divtest(uint16_t a, uint16_t b) {
	return a / b;
}

This is what the compiler generates for the ESP8266:

_Z7divtesttt:
        addi    sp, sp, -16
        extui   a3, a3, 0, 16
        extui   a2, a2, 0, 16
        s32i.n  a0, sp, 12
        call0   __udivsi3       ; software udiv called here
        l32i.n  a0, sp, 12
        extui   a2, a2, 0, 16
        addi    sp, sp, 16
        ret.n

We see that on the ESP8266 platform a software implementation is called. Next have a look at the RP2350 generated code, which emits a udiv command:

_Z7divtesttt:
        udiv    r0, r0, r1
        bx      lr

And finally let’s compare this with the code generated for the RP2040, where again no hardware div command is used:

_Z7divtesttt:
        push    {r4, lr}
        bl      __aeabi_uidiv ; calls software divider
        uxth    r0, r0
        pop     {r4, pc}