Hacksies

Dabbling in x86 assembly

Introduction

I have some free time now, so I want to do something that’s been in the back of my mind for a long time: learning assembly. I don’t really have a strong preference for which variant, but I’m doing x86 assembly just because I have an x86 machine, so it’s a little easier to get started with. I have no idea what I’m talking about, so you definitely shouldn’t take this as an instructional post. It’s more about recording my learning process for posterity and writing down the things that I find interesting along the way.

After playing around with Matt Godbolt’s amazing Compiler Explorer, watching some old Low Level videos, and learning about the call stack and how it’s conventionally handled in x86 (this is actually really fascinating, and I can’t believe I didn’t know about it earlier), I feel ready to start trying some things.

The challenge

As a starting point, I want to see if I can implement my own assembly version of a function that calculates rolling sums of an array of numbers. Something like this:

void calculate_rolling_sums_c(const float *x,
                              float *rolling_sums,
                              unsigned n,
                              unsigned kernel_size) {
    float prev_rolling_sum = 0.0f;

    for (unsigned i = 0; i < n; i++) {
        // Start the current rolling sum as the previous rolling sum
        rolling_sums[i] = prev_rolling_sum;

        // Add the current element
        rolling_sums[i] += x[i];

        // Subtract the newly out of bounds element
        if (i >= kernel_size) {
            rolling_sums[i] -= x[i - kernel_size];
        }

        prev_rolling_sum = rolling_sums[i];
    }
}

My goals are:

  • Write a working assembly version of this function
  • Make my assembly version run at least as fast as the C version when compiled using GCC1 with -O2 optimization2
  • Learn something interesting

This task might seem a bit random, but it’s actually quite relevant for me because I’m interested in AutoTune, which (at least in the original patent) is based on quickly calculating autocorrelations in a similar, rolling fashion. This is like a baby version of that. Maybe I’ll return to this topic in a later post.

First attempt

I start trudging my way through by writing small snippets of C into the Compiler Explorer to answer basic questions like “which registers do my arguments get put into?”, “how do I set a floating point register to zero?”, or “how do I jump if some register is less than another register?”, combined with Googling and ChatGPTing the corresponding output assembly to try to understand what’s happening. This gives me a first draft that looks like this:

.global calculate_rolling_sums_asm1
.intel_syntax noprefix

calculate_rolling_sums_asm1:
        // push rbp
        // mov rbp, rsp

        // rdi: const float *x
        // rsi: float *rolling_sums
        // edx: unsigned n
        // ecx: unsigned kernel_size

        // r8 = 4 * kernel_size;
        mov r8d, ecx
        imul r8, 4

        // float prev_rolling_sum = 0.0f;
        pxor xmm0, xmm0

        // unsigned i = 0;
        mov eax, 0

        jmp .comparison

        .loop_body:

        // float rolling_sum = prev_rolling_sum;
        movss xmm1, xmm0

        // Add the current element
        addss xmm1, DWORD PTR [rdi]

        // Jump to .skip_subtract_newly_out_of_bounds_element
        // if i < kernel_size
        cmp eax, ecx
        jb .skip_subtract_newly_out_of_bounds_element

        // Subtract the newly out of bounds element
        mov r9, rdi
        sub r9, r8
        subss xmm1, DWORD PTR [r9]
        .skip_subtract_newly_out_of_bounds_element:

        // rolling_sums[i] = rolling_sum;
        movss DWORD PTR [rsi], xmm1

        // prev_rolling_sum = rolling_sum;
        movss xmm0, xmm1
        
        // x++; rolling_sums++; i++;
        add rdi, 4
        add rsi, 4
        add eax, 1

        // Jump to .comparison if i < n
        .comparison:
        cmp eax, edx
        jb .loop_body

        // pop rbp
        ret

Ok, so maybe not that pretty, but it does the job and is quite analogous to the C code from before. I heard that using CPU registers is much faster than using main memory, so I try to be clever by avoiding memory accesses altogether and using only registers to perform the calculations. And because I’m not touching the stack at all, I omit the usual

push rbp
mov rbp, rsp

and

pop rbp

function prologue and epilogue that would usually prepare the stack frame for my function. At this point, I already learned a few things:

  • .global <name> exposes <name> to the linker.
  • x86 registers have nested 32 and 64-bit versions, with the 32-bit version living in the lower half of the full 64-bit version. There’s also a similar principle for smaller (16-bit, 8-bit) register sizes.
  • mov dst, src zero-fills the upper half of the full 64-bit version of dst if dst and src are 32-bit. But not so for smaller register sizes.
  • pxor xmm0, xmm0 is a common way to set a floating point register to zero.
  • cmp op1, op2 followed by jb .my_label jumps to .my_label if and only if op1 < op2 (unsigned integer comparison).
  • mov can accept, for each of its operands, either a register or a memory address. This is common for CISC, but not for RISC. For the latter, there are usually different instructions depending on whether the source and destination are registers or memory addresses.

I also learned that actually assembling this code and calling it from C is pleasantly straightforward:

as calculate_rolling_sums_asm1.s -o calculate_rolling_sums_asm1.o
gcc -o hello_world -O2 -Wall \
    calculate_rolling_sums_c.c \
    calculate_rolling_sums_asm1.o \
    hello_world.c

To summarize the files at this point:

  • calculate_rolling_sums_c.c contains the C version from above,
  • calculate_rolling_sums_asm1.s contains my homemade assembly version, and
  • hello_world.c contains some skeleton code that calls both, times them, and ensures their outputs are the same within a small tolerance.

Doing a quick-and-dirty speed test3 like the one here, I get the following results:

C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50

ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000

Ok, so apparently I’m not so clever after all. My version is noticably slower than the GCC version. But not bad for a first try? Let’s have a look at how GCC compiles the C code to hopefully get a hint about what we can improve. This is also very convenient to do with GCC:

gcc -S -O2 -masm=intel calculate_rolling_sums_c.c

This generates an output file, calculate_rolling_sums_c.s, with the following contents:

        .file   "calculate_rolling_sums_c.c"
        .intel_syntax noprefix
        .text
        .p2align 4
        .globl  calculate_rolling_sums_c
        .type   calculate_rolling_sums_c, @function
calculate_rolling_sums_c:
.LFB0:
        .cfi_startproc
        mov     r8d, ecx
        test    edx, edx
        je      .L1
        neg     ecx
        mov     edx, edx
        pxor    xmm0, xmm0
        xor     eax, eax
        .p2align 6
        .p2align 4
        .p2align 3
.L4:
        movss   DWORD PTR [rsi+rax*4], xmm0
        addss   xmm0, DWORD PTR [rdi+rax*4]
        movss   DWORD PTR [rsi+rax*4], xmm0
        cmp     eax, r8d
        jb      .L3
        mov     r9d, ecx
        subss   xmm0, DWORD PTR [rdi+r9*4]
        movss   DWORD PTR [rsi+rax*4], xmm0
.L3:
        add     rax, 1
        add     ecx, 1
        cmp     rax, rdx
        jne     .L4
.L1:
        ret
        .cfi_endproc
.LFE0:
        .size   calculate_rolling_sums_c, .-calculate_rolling_sums_c
        .ident  "GCC: (GNU) 15.2.1 20260123 (Red Hat 15.2.1-7)"
        .section        .note.GNU-stack,"",@progbits

This is quite busy, so let’s first work out what we can ignore. The lines starting with “.” are assembler directives, and most of them aren’t that important. Some of them contain metadata about the source file, what version of GCC was used, and so on. .cfi_startproc and .cfi_endproc are used (I think) by certain debugging tools to keep track of the call stack. .p2align is used to add empty instructions to the assembled code to align the following instruction with a power-of-two number of bytes, which I believe can make jumping to that instruction faster in some cases. None of these are relevant for our purposes, though. So after removing these distracting assembler directives, adding some comments to document the arguments, and giving the labels more suggestive names, we get something like this:

.global calculate_rolling_sums_c
.intel_syntax noprefix

calculate_rolling_sums_c:
        // rdi: const float *x
        // rsi: float *rolling_sums
        // edx: unsigned n
        // ecx: unsigned kernel_size

        mov r8d, ecx
        test edx, edx
        je .quit
        neg ecx
        mov edx, edx
        pxor xmm0, xmm0
        xor eax, eax

.loop_body:
        movss DWORD PTR [rsi+rax*4], xmm0
        addss xmm0, DWORD PTR [rdi+rax*4]
        movss DWORD PTR [rsi+rax*4], xmm0
        cmp eax, r8d
        jb .comparison
        mov r9d, ecx
        subss xmm0, DWORD PTR [rdi+r9*4]
        movss DWORD PTR [rsi+rax*4], xmm0

.comparison:
        add rax, 1
        add ecx, 1
        cmp rax, rdx
        jne .loop_body

.quit:
        ret

Wow. That’s very compact. At first glance, without even understanding anything, I’m very surprised to see things like movss DWORD PTR [rsi+rax*4], xmm0. Apparently you can have simple math expressions inside of operands of certain instructions! I had no idea you could do that, which makes my code needlessly long. Now I’m starting to see why x86 assembly is sometimes called an interpreted language.

Next, let’s see if we can interpret the code’s meaning a little bit by adding some comments:

.global calculate_rolling_sums_c
.intel_syntax noprefix

calculate_rolling_sums_c:
        // rdi: const float *x
        // rsi: float *rolling_sums
        // edx: unsigned n
        // ecx: unsigned kernel_size

        // r8d = kernel_size;
        mov r8d, ecx

        // Return if n == 0
        test edx, edx
        je .quit

        // ecx = -kernel_size;
        neg ecx

        // edx = edx; (zero out the upper 32 bits of rdx)
        mov edx, edx

        // float rolling_sum = 0.0f;
        pxor xmm0, xmm0

        // unsigned i = 0;
        xor eax, eax

.loop_body:
        // rolling_sums[i] = rolling_sum;
        movss DWORD PTR [rsi+rax*4], xmm0

        // rolling_sum += x[i];
        addss xmm0, DWORD PTR [rdi+rax*4]

        // rolling_sums[i] = rolling_sum;
        movss DWORD PTR [rsi+rax*4], xmm0

        // Jump to .comparison if i < kernel_size
        cmp eax, r8d
        jb .comparison

        // r9d = -kernel_size + i;
        mov r9d, ecx

        // rolling_sum -= x[-kernel_size + i];
        subss xmm0, DWORD PTR [rdi+r9*4]

        // rolling_sums[i] = rolling_sum;
        movss DWORD PTR [rsi+rax*4], xmm0

.comparison:
        // i++;
        add rax, 1

        // ecx = -kernel_size + i;
        add ecx, 1

        // Jump to .loop_body if i != n
        cmp rax, rdx
        jne .loop_body

.quit:
        ret

So in my own words, the GCC-compiled solution works like this:

  • Set r8d to kernel_size for later use
  • Return if n == 0
  • Set ecx to -kernel_size for later use
  • Zero out the upper 32 bits of rdx (n is stored in edx) so that it can later be safely used as a 64-bit unsigned integer
  • Initialize the rolling_sum variable, stored in the floating point register xmm0, to zero
  • Initialize the counting variable i, stored in eax, to zero
  • Enter the loop body:
    • Write xmm0 to rolling_sums[i] (superfluous)
    • Add the ith element of x to xmm0
    • Write xmm0 to rolling_sums[i]
    • If i >= kernel_size (r8d is used here):
      • Subtract x[-kernel_size + i] from xmm0 (ecx is used here)
      • Write xmm0 to rolling_sums[i]
    • Increment i by 1
    • Increment ecx by 1 (so that it holds -kernel_size + i)
    • Quit if i == n

So it’s actually quite straightforward. It surprises me that there seems to be a superfluous write to rolling_sums[i] at the start of the loop body. It also surprises me that ecx is copied to r9d before being used in the subsequent subss instruction. It seems to me that ecx could be used directly, and r9d could be omitted. But I’ve heard that relying on intuition to evaluate assembly code, especially for modern processors, doesn’t really work, so I’ll reserve my judgement for now.

This solution seems quite similar to mine, with a few differences:

  • Using xor eax, eax to initialize the counting variable instead of mov eax, 0
  • Using expressions (e.g. [rsi+rax*4]) to avoid using extra instructions and registers
  • Using just one floating point register (xmm0)

As for the first item, apparently this is just a better way of setting a register to zero that gets encoded slightly more compactly than the mov equivalent, so I’ll do it that way from now on. For the latter two, I’ll introduce them one at a time.

Second attempt: use expressions for memory addressing

For the second attempt, let’s use xor eax, eax to initialize the counting variable i to zero, as well as use expressions for memory addressing. That gives us something like this:

.global calculate_rolling_sums_asm2
.intel_syntax noprefix

calculate_rolling_sums_asm2:
        // push rbp
        // mov rbp, rsp

        // rdi: const float *x
        // rsi: float *rolling_sums
        // edx: unsigned n
        // ecx: unsigned kernel_size

        // r8 = -kernel_size;
        mov r8d, ecx
        neg r8

        // float prev_rolling_sum = 0.0f;
        pxor xmm0, xmm0

        // unsigned i = 0;
        xor eax, eax

        jmp .comparison

        .loop_body:

        // float rolling_sum = prev_rolling_sum;
        movss xmm1, xmm0

        // Add the current element
        addss xmm1, DWORD PTR [rdi]

        // Jump to .skip_subtract_newly_out_of_bounds_element
        // if i < kernel_size
        cmp eax, ecx
        jb .skip_subtract_newly_out_of_bounds_element

        // Subtract the newly out of bounds element
        subss xmm1, DWORD PTR [rdi+r8*4]
        .skip_subtract_newly_out_of_bounds_element:

        // rolling_sums[i] = rolling_sum;
        movss DWORD PTR [rsi], xmm1

        // prev_rolling_sum = rolling_sum;
        movss xmm0, xmm1
        
        // x++; rolling_sums++; i++;
        add rdi, 4
        add rsi, 4
        add eax, 1

        // Jump to .comparison if i < n
        .comparison:
        cmp eax, edx
        jb .loop_body

        // pop rbp
        ret

This lets us avoid using r9 and to use two less instructions in the loop body, previously used to prepare r9. Taking a look at the speed comparison, we get something like this:

C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50

ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000

ASM2 (using expressions):
Total cycles: 5,838,649,440
Cycles per call: 579,000 +/- 4,000

Even though these two changes made our code smaller and more readable, they seem not to have helped the execution time. Next, let’s try to use just one floating point register like in the GCC solution.

Third attempt: use only one floating point register

Here we rework the code to not use the second floating point register, xmm1, similarly to the GCC-compiled code. This second register isn’t actually necessary, and I only wrote it that way because I was translating quite directly from my C code. Now we get something like this:

.global calculate_rolling_sums_asm3
.intel_syntax noprefix

calculate_rolling_sums_asm3:
        // push rbp
        // mov rbp, rsp

        // rdi: const float *x
        // rsi: float *rolling_sums
        // edx: unsigned n
        // ecx: unsigned kernel_size

        // r8 = -kernel_size;
        mov r8d, ecx
        neg r8

        // float rolling_sum = 0.0f;
        pxor xmm0, xmm0

        // unsigned i = 0;
        xor eax, eax

        jmp .comparison

        .loop_body:

        // Add the current element
        addss xmm0, DWORD PTR [rdi]

        // Jump to .skip_subtract_newly_out_of_bounds_element
        // if i < kernel_size
        cmp eax, ecx
        jb .skip_remove_newly_out_of_bounds_element

        // Subtract the newly out of bounds element
        subss xmm0, DWORD PTR [rdi+r8*4]
        .skip_remove_newly_out_of_bounds_element:

        // rolling_sums[i] = rolling_sum;
        movss DWORD PTR [rsi], xmm0
        
        // x++; rolling_sums++; i++;
        add rdi, 4
        add rsi, 4
        add eax, 1

        // Jump to .comparison if i < n
        .comparison:
        cmp eax, edx
        jb .loop_body

        // pop rbp
        ret

And the speed comparison:

C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50

ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000

ASM2 (using expressions):
Total cycles: 5,838,649,440
Cycles per call: 579,000 +/- 4,000

ASM3 (using expressions, using only one floating point register):
Total cycles: 4,659,686,450
Cycles per call: 463,130 +/- 50

Ok! So the goal is achieved, at least within the precision that this method of measuring allows. At this point, I don’t want to spend any more time on this. But I will try one more thing that I’ve been curious about: can ChatGPT produce a solution that’s faster than GCC’s?

Using ChatGPT as a C compiler

Plugging the prompt

Compile this C code to x86-64 assembly. Make it run as fast as possible.

<C code>

into GPT-5.2 on March 19, 2026 produced the following assembly:4

.global calculate_rolling_sums_asm4
.intel_syntax noprefix

calculate_rolling_sums_asm4:
    // rdi: const float *x
    // rsi: float *rolling_sums
    // edx: unsigned n
    // ecx: unsigned kernel_size

    test    edx, edx                # n == 0 ?
    je      .done

    pxor    xmm0, xmm0              # prev_rolling_sum = 0.0

    xor     r8d, r8d                # i = 0

    # -------- First phase: i < kernel_size --------
.loop1:
    cmp     r8d, ecx
    jae     .loop2_entry

    cmp     r8d, edx
    jae     .done

    movss   xmm1, DWORD PTR [rdi + r8*4]   # load x[i]
    addss   xmm0, xmm1                     # sum += x[i]

    movss   DWORD PTR [rsi + r8*4], xmm0   # store result

    inc     r8d
    jmp     .loop1

# -------- Second phase: i >= kernel_size --------
.loop2_entry:
    cmp     r8d, edx
    jae     .done

.loop2:
    movss   xmm1, DWORD PTR [rdi + r8*4]           # x[i]
    addss   xmm0, xmm1                             # sum += x[i]

    mov     r9d, r8d
    sub     r9d, ecx                               # i - kernel_size
    movss   xmm2, DWORD PTR [rdi + r9*4]           # x[i - k]
    subss   xmm0, xmm2                             # sum -= old

    movss   DWORD PTR [rsi + r8*4], xmm0           # store

    inc     r8d
    cmp     r8d, edx
    jb      .loop2

.done:
    ret

To be honest, I’m not sure what to think of this. It uses not one but three floating point registers. The structure is different in that it separates the main loop into two loops: one where i < kernel_size, and one where i >= kernel_size. In the first loop body, the i < n check is at the beginning, whereas in the second loop body, the i < n check is at the end. I suppose this saves a few jumps. It also uses the inc instruction to increment the counter variable i, which I didn’t know existed. Anyway, here’s how it fares in the speed comparison:

C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50

ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000

ASM2 (using expressions):
Total cycles: 5,838,649,440
Cycles per call: 579,000 +/- 4,000

ASM3 (using expressions, using only one floating point register):
Total cycles: 4,659,686,450
Cycles per call: 463,130 +/- 50

ASM4 (ChatGPT):
Total cycles: 4,662,425,150
Cycles per call: 463,120 +/- 40

So definitely not bad. It’s on par with GCC’s solution and my best solution, but I don’t know why. In my solution, removing a floating point register seemd to make the function run faster. But ChatGPT’s solution uses three floating point registers while being about as fast as mine. ChatGPT’s solution has the same amount of memory accesses per loop iteration as mine, but it tends to explictly copy array elements to registers before using them. This pattern is avoided in the GCC solution.

Conclusion

I tried my best to capture the main differences between the five solutions in the table below. I don’t know enough about assembly yet to really know which metrics are important in this context, but I think it’s a nice overview nonetheless.

SolutionAdditional integer registers used5Floating point registers usedNumber of memory accesses per loop6Cycles per call
GCC with -O2rax, r8d, r9xmm03 to 5463,120 +/- 50
ASM1 (first attempt)eax, r8, r9xmm0, xmm12 to 3579,000 +/- 5,000
ASM2 (using expressions)eax, r8xmm0, xmm12 to 3579,000 +/- 4,000
ASM3 (using expressions, using only one floating point register)eax, r8xmm02 to 3463,130 +/- 50
ASM4 (ChatGPT)r8, r9xmm0, xmm1, xmm22 to 3463,120 +/- 40

Anyway, that was an interesting little diversion. Writing some homemade assembly (with a lot of help from Compiler Explorer) was very insightful and much more approachable I expected. I look forward to doing more of it soon.

Appendix


  1. GCC version 15.2.1. ↩︎

  2. I choose -O2 optimization because it’s supposed to be a good “middle ground” optimization, balancing execution time and executable size, while still being basically guaranteed to run correctly. ↩︎

  3. I call the function 10,000 times in a for loop on an array of 100,000 random numbers sampled from $\text{Uniform}(0,1)$. The total number of CPU cycles, along with the sample median and interquartile range of the number of cycles for each repeat, are reported. ↩︎

  4. The assembly output from ChatGPT is copied verbatim with the following exceptions: I changed the name of the function, and I added four comments at the top to describe the arguments. ↩︎

  5. By “additional integer registers”, I mean integer registers besides those used for passing the function’s arguments (rdi, rsi, edx, and ecx). ↩︎

  6. The first number is for i < kernel_size, and the second is for i >= kernel_size↩︎