Dabbling in x86 assembly
Introduction
I have some free time now, so I want to do something that’s been in the back of my mind for a long time: learning assembly. I don’t really have a strong preference for which variant, but I’m doing x86 assembly just because I have an x86 machine, so it’s a little easier to get started with. I have no idea what I’m talking about, so you definitely shouldn’t take this as an instructional post. It’s more about recording my learning process for posterity and writing down the things that I find interesting along the way.
After playing around with Matt Godbolt’s amazing Compiler Explorer, watching some old Low Level videos, and learning about the call stack and how it’s conventionally handled in x86 (this is actually really fascinating, and I can’t believe I didn’t know about it earlier), I feel ready to start trying some things.
The challenge
As a starting point, I want to see if I can implement my own assembly version of a function that calculates rolling sums of an array of numbers. Something like this:
void calculate_rolling_sums_c(const float *x,
float *rolling_sums,
unsigned n,
unsigned kernel_size) {
float prev_rolling_sum = 0.0f;
for (unsigned i = 0; i < n; i++) {
// Start the current rolling sum as the previous rolling sum
rolling_sums[i] = prev_rolling_sum;
// Add the current element
rolling_sums[i] += x[i];
// Subtract the newly out of bounds element
if (i >= kernel_size) {
rolling_sums[i] -= x[i - kernel_size];
}
prev_rolling_sum = rolling_sums[i];
}
}
My goals are:
- Write a working assembly version of this function
- Make my assembly version run at least as fast as the C version when compiled using GCC1 with
-O2optimization2 - Learn something interesting
This task might seem a bit random, but it’s actually quite relevant for me because I’m interested in AutoTune, which (at least in the original patent) is based on quickly calculating autocorrelations in a similar, rolling fashion. This is like a baby version of that. Maybe I’ll return to this topic in a later post.
First attempt
I start trudging my way through by writing small snippets of C into the Compiler Explorer to answer basic questions like “which registers do my arguments get put into?”, “how do I set a floating point register to zero?”, or “how do I jump if some register is less than another register?”, combined with Googling and ChatGPTing the corresponding output assembly to try to understand what’s happening. This gives me a first draft that looks like this:
.global calculate_rolling_sums_asm1
.intel_syntax noprefix
calculate_rolling_sums_asm1:
// push rbp
// mov rbp, rsp
// rdi: const float *x
// rsi: float *rolling_sums
// edx: unsigned n
// ecx: unsigned kernel_size
// r8 = 4 * kernel_size;
mov r8d, ecx
imul r8, 4
// float prev_rolling_sum = 0.0f;
pxor xmm0, xmm0
// unsigned i = 0;
mov eax, 0
jmp .comparison
.loop_body:
// float rolling_sum = prev_rolling_sum;
movss xmm1, xmm0
// Add the current element
addss xmm1, DWORD PTR [rdi]
// Jump to .skip_subtract_newly_out_of_bounds_element
// if i < kernel_size
cmp eax, ecx
jb .skip_subtract_newly_out_of_bounds_element
// Subtract the newly out of bounds element
mov r9, rdi
sub r9, r8
subss xmm1, DWORD PTR [r9]
.skip_subtract_newly_out_of_bounds_element:
// rolling_sums[i] = rolling_sum;
movss DWORD PTR [rsi], xmm1
// prev_rolling_sum = rolling_sum;
movss xmm0, xmm1
// x++; rolling_sums++; i++;
add rdi, 4
add rsi, 4
add eax, 1
// Jump to .comparison if i < n
.comparison:
cmp eax, edx
jb .loop_body
// pop rbp
ret
Ok, so maybe not that pretty, but it does the job and is quite analogous to the C code from before. I heard that using CPU registers is much faster than using main memory, so I try to be clever by avoiding memory accesses altogether and using only registers to perform the calculations. And because I’m not touching the stack at all, I omit the usual
push rbp
mov rbp, rsp
and
pop rbp
function prologue and epilogue that would usually prepare the stack frame for my function. At this point, I already learned a few things:
.global <name>exposes<name>to the linker.- x86 registers have nested 32 and 64-bit versions, with the 32-bit version living in the lower half of the full 64-bit version. There’s also a similar principle for smaller (16-bit, 8-bit) register sizes.
mov dst, srczero-fills the upper half of the full 64-bit version ofdstifdstandsrcare 32-bit. But not so for smaller register sizes.pxor xmm0, xmm0is a common way to set a floating point register to zero.cmp op1, op2followed byjb .my_labeljumps to.my_labelif and only ifop1 < op2(unsigned integer comparison).movcan accept, for each of its operands, either a register or a memory address. This is common for CISC, but not for RISC. For the latter, there are usually different instructions depending on whether the source and destination are registers or memory addresses.
I also learned that actually assembling this code and calling it from C is pleasantly straightforward:
as calculate_rolling_sums_asm1.s -o calculate_rolling_sums_asm1.o
gcc -o hello_world -O2 -Wall \
calculate_rolling_sums_c.c \
calculate_rolling_sums_asm1.o \
hello_world.c
To summarize the files at this point:
calculate_rolling_sums_c.ccontains the C version from above,calculate_rolling_sums_asm1.scontains my homemade assembly version, andhello_world.ccontains some skeleton code that calls both, times them, and ensures their outputs are the same within a small tolerance.
Doing a quick-and-dirty speed test3 like the one here, I get the following results:
C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50
ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000
Ok, so apparently I’m not so clever after all. My version is noticably slower than the GCC version. But not bad for a first try? Let’s have a look at how GCC compiles the C code to hopefully get a hint about what we can improve. This is also very convenient to do with GCC:
gcc -S -O2 -masm=intel calculate_rolling_sums_c.c
This generates an output file, calculate_rolling_sums_c.s, with the following contents:
.file "calculate_rolling_sums_c.c"
.intel_syntax noprefix
.text
.p2align 4
.globl calculate_rolling_sums_c
.type calculate_rolling_sums_c, @function
calculate_rolling_sums_c:
.LFB0:
.cfi_startproc
mov r8d, ecx
test edx, edx
je .L1
neg ecx
mov edx, edx
pxor xmm0, xmm0
xor eax, eax
.p2align 6
.p2align 4
.p2align 3
.L4:
movss DWORD PTR [rsi+rax*4], xmm0
addss xmm0, DWORD PTR [rdi+rax*4]
movss DWORD PTR [rsi+rax*4], xmm0
cmp eax, r8d
jb .L3
mov r9d, ecx
subss xmm0, DWORD PTR [rdi+r9*4]
movss DWORD PTR [rsi+rax*4], xmm0
.L3:
add rax, 1
add ecx, 1
cmp rax, rdx
jne .L4
.L1:
ret
.cfi_endproc
.LFE0:
.size calculate_rolling_sums_c, .-calculate_rolling_sums_c
.ident "GCC: (GNU) 15.2.1 20260123 (Red Hat 15.2.1-7)"
.section .note.GNU-stack,"",@progbits
This is quite busy, so let’s first work out what we can ignore. The lines starting with “.” are assembler directives, and most of them aren’t that important. Some of them contain metadata about the source file, what version of GCC was used, and so on. .cfi_startproc and .cfi_endproc are used (I think) by certain debugging tools to keep track of the call stack. .p2align is used to add empty instructions to the assembled code to align the following instruction with a power-of-two number of bytes, which I believe can make jumping to that instruction faster in some cases. None of these are relevant for our purposes, though. So after removing these distracting assembler directives, adding some comments to document the arguments, and giving the labels more suggestive names, we get something like this:
.global calculate_rolling_sums_c
.intel_syntax noprefix
calculate_rolling_sums_c:
// rdi: const float *x
// rsi: float *rolling_sums
// edx: unsigned n
// ecx: unsigned kernel_size
mov r8d, ecx
test edx, edx
je .quit
neg ecx
mov edx, edx
pxor xmm0, xmm0
xor eax, eax
.loop_body:
movss DWORD PTR [rsi+rax*4], xmm0
addss xmm0, DWORD PTR [rdi+rax*4]
movss DWORD PTR [rsi+rax*4], xmm0
cmp eax, r8d
jb .comparison
mov r9d, ecx
subss xmm0, DWORD PTR [rdi+r9*4]
movss DWORD PTR [rsi+rax*4], xmm0
.comparison:
add rax, 1
add ecx, 1
cmp rax, rdx
jne .loop_body
.quit:
ret
Wow. That’s very compact. At first glance, without even understanding anything, I’m very surprised to see things like movss DWORD PTR [rsi+rax*4], xmm0. Apparently you can have simple math expressions inside of operands of certain instructions! I had no idea you could do that, which makes my code needlessly long. Now I’m starting to see why x86 assembly is sometimes called an interpreted language.
Next, let’s see if we can interpret the code’s meaning a little bit by adding some comments:
.global calculate_rolling_sums_c
.intel_syntax noprefix
calculate_rolling_sums_c:
// rdi: const float *x
// rsi: float *rolling_sums
// edx: unsigned n
// ecx: unsigned kernel_size
// r8d = kernel_size;
mov r8d, ecx
// Return if n == 0
test edx, edx
je .quit
// ecx = -kernel_size;
neg ecx
// edx = edx; (zero out the upper 32 bits of rdx)
mov edx, edx
// float rolling_sum = 0.0f;
pxor xmm0, xmm0
// unsigned i = 0;
xor eax, eax
.loop_body:
// rolling_sums[i] = rolling_sum;
movss DWORD PTR [rsi+rax*4], xmm0
// rolling_sum += x[i];
addss xmm0, DWORD PTR [rdi+rax*4]
// rolling_sums[i] = rolling_sum;
movss DWORD PTR [rsi+rax*4], xmm0
// Jump to .comparison if i < kernel_size
cmp eax, r8d
jb .comparison
// r9d = -kernel_size + i;
mov r9d, ecx
// rolling_sum -= x[-kernel_size + i];
subss xmm0, DWORD PTR [rdi+r9*4]
// rolling_sums[i] = rolling_sum;
movss DWORD PTR [rsi+rax*4], xmm0
.comparison:
// i++;
add rax, 1
// ecx = -kernel_size + i;
add ecx, 1
// Jump to .loop_body if i != n
cmp rax, rdx
jne .loop_body
.quit:
ret
So in my own words, the GCC-compiled solution works like this:
- Set
r8dtokernel_sizefor later use - Return if
n == 0 - Set
ecxto-kernel_sizefor later use - Zero out the upper 32 bits of
rdx(nis stored inedx) so that it can later be safely used as a 64-bit unsigned integer - Initialize the
rolling_sumvariable, stored in the floating point registerxmm0, to zero - Initialize the counting variable
i, stored ineax, to zero - Enter the loop body:
- Write
xmm0torolling_sums[i](superfluous) - Add the
ith element ofxtoxmm0 - Write
xmm0torolling_sums[i] - If
i >= kernel_size(r8dis used here):- Subtract
x[-kernel_size + i]fromxmm0(ecxis used here) - Write
xmm0torolling_sums[i]
- Subtract
- Increment
iby 1 - Increment
ecxby 1 (so that it holds-kernel_size + i) - Quit if
i == n
- Write
So it’s actually quite straightforward. It surprises me that there seems to be a superfluous write to rolling_sums[i] at the start of the loop body. It also surprises me that ecx is copied to r9d before being used in the subsequent subss instruction. It seems to me that ecx could be used directly, and r9d could be omitted. But I’ve heard that relying on intuition to evaluate assembly code, especially for modern processors, doesn’t really work, so I’ll reserve my judgement for now.
This solution seems quite similar to mine, with a few differences:
- Using
xor eax, eaxto initialize the counting variable instead ofmov eax, 0 - Using expressions (e.g.
[rsi+rax*4]) to avoid using extra instructions and registers - Using just one floating point register (
xmm0)
As for the first item, apparently this is just a better way of setting a register to zero that gets encoded slightly more compactly than the mov equivalent, so I’ll do it that way from now on. For the latter two, I’ll introduce them one at a time.
Second attempt: use expressions for memory addressing
For the second attempt, let’s use xor eax, eax to initialize the counting variable i to zero, as well as use expressions for memory addressing. That gives us something like this:
.global calculate_rolling_sums_asm2
.intel_syntax noprefix
calculate_rolling_sums_asm2:
// push rbp
// mov rbp, rsp
// rdi: const float *x
// rsi: float *rolling_sums
// edx: unsigned n
// ecx: unsigned kernel_size
// r8 = -kernel_size;
mov r8d, ecx
neg r8
// float prev_rolling_sum = 0.0f;
pxor xmm0, xmm0
// unsigned i = 0;
xor eax, eax
jmp .comparison
.loop_body:
// float rolling_sum = prev_rolling_sum;
movss xmm1, xmm0
// Add the current element
addss xmm1, DWORD PTR [rdi]
// Jump to .skip_subtract_newly_out_of_bounds_element
// if i < kernel_size
cmp eax, ecx
jb .skip_subtract_newly_out_of_bounds_element
// Subtract the newly out of bounds element
subss xmm1, DWORD PTR [rdi+r8*4]
.skip_subtract_newly_out_of_bounds_element:
// rolling_sums[i] = rolling_sum;
movss DWORD PTR [rsi], xmm1
// prev_rolling_sum = rolling_sum;
movss xmm0, xmm1
// x++; rolling_sums++; i++;
add rdi, 4
add rsi, 4
add eax, 1
// Jump to .comparison if i < n
.comparison:
cmp eax, edx
jb .loop_body
// pop rbp
ret
This lets us avoid using r9 and to use two less instructions in the loop body, previously used to prepare r9. Taking a look at the speed comparison, we get something like this:
C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50
ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000
ASM2 (using expressions):
Total cycles: 5,838,649,440
Cycles per call: 579,000 +/- 4,000
Even though these two changes made our code smaller and more readable, they seem not to have helped the execution time. Next, let’s try to use just one floating point register like in the GCC solution.
Third attempt: use only one floating point register
Here we rework the code to not use the second floating point register, xmm1, similarly to the GCC-compiled code. This second register isn’t actually necessary, and I only wrote it that way because I was translating quite directly from my C code. Now we get something like this:
.global calculate_rolling_sums_asm3
.intel_syntax noprefix
calculate_rolling_sums_asm3:
// push rbp
// mov rbp, rsp
// rdi: const float *x
// rsi: float *rolling_sums
// edx: unsigned n
// ecx: unsigned kernel_size
// r8 = -kernel_size;
mov r8d, ecx
neg r8
// float rolling_sum = 0.0f;
pxor xmm0, xmm0
// unsigned i = 0;
xor eax, eax
jmp .comparison
.loop_body:
// Add the current element
addss xmm0, DWORD PTR [rdi]
// Jump to .skip_subtract_newly_out_of_bounds_element
// if i < kernel_size
cmp eax, ecx
jb .skip_remove_newly_out_of_bounds_element
// Subtract the newly out of bounds element
subss xmm0, DWORD PTR [rdi+r8*4]
.skip_remove_newly_out_of_bounds_element:
// rolling_sums[i] = rolling_sum;
movss DWORD PTR [rsi], xmm0
// x++; rolling_sums++; i++;
add rdi, 4
add rsi, 4
add eax, 1
// Jump to .comparison if i < n
.comparison:
cmp eax, edx
jb .loop_body
// pop rbp
ret
And the speed comparison:
C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50
ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000
ASM2 (using expressions):
Total cycles: 5,838,649,440
Cycles per call: 579,000 +/- 4,000
ASM3 (using expressions, using only one floating point register):
Total cycles: 4,659,686,450
Cycles per call: 463,130 +/- 50
Ok! So the goal is achieved, at least within the precision that this method of measuring allows. At this point, I don’t want to spend any more time on this. But I will try one more thing that I’ve been curious about: can ChatGPT produce a solution that’s faster than GCC’s?
Using ChatGPT as a C compiler
Plugging the prompt
Compile this C code to x86-64 assembly. Make it run as fast as possible.
<C code>
into GPT-5.2 on March 19, 2026 produced the following assembly:4
.global calculate_rolling_sums_asm4
.intel_syntax noprefix
calculate_rolling_sums_asm4:
// rdi: const float *x
// rsi: float *rolling_sums
// edx: unsigned n
// ecx: unsigned kernel_size
test edx, edx # n == 0 ?
je .done
pxor xmm0, xmm0 # prev_rolling_sum = 0.0
xor r8d, r8d # i = 0
# -------- First phase: i < kernel_size --------
.loop1:
cmp r8d, ecx
jae .loop2_entry
cmp r8d, edx
jae .done
movss xmm1, DWORD PTR [rdi + r8*4] # load x[i]
addss xmm0, xmm1 # sum += x[i]
movss DWORD PTR [rsi + r8*4], xmm0 # store result
inc r8d
jmp .loop1
# -------- Second phase: i >= kernel_size --------
.loop2_entry:
cmp r8d, edx
jae .done
.loop2:
movss xmm1, DWORD PTR [rdi + r8*4] # x[i]
addss xmm0, xmm1 # sum += x[i]
mov r9d, r8d
sub r9d, ecx # i - kernel_size
movss xmm2, DWORD PTR [rdi + r9*4] # x[i - k]
subss xmm0, xmm2 # sum -= old
movss DWORD PTR [rsi + r8*4], xmm0 # store
inc r8d
cmp r8d, edx
jb .loop2
.done:
ret
To be honest, I’m not sure what to think of this. It uses not one but three floating point registers. The structure is different in that it separates the main loop into two loops: one where i < kernel_size, and one where i >= kernel_size. In the first loop body, the i < n check is at the beginning, whereas in the second loop body, the i < n check is at the end. I suppose this saves a few jumps. It also uses the inc instruction to increment the counter variable i, which I didn’t know existed. Anyway, here’s how it fares in the speed comparison:
C (GCC with -O2):
Total cycles: 4,664,343,418
Cycles per call: 463,120 +/- 50
ASM1 (first attempt):
Total cycles: 5,836,401,446
Cycles per call: 579,000 +/- 5,000
ASM2 (using expressions):
Total cycles: 5,838,649,440
Cycles per call: 579,000 +/- 4,000
ASM3 (using expressions, using only one floating point register):
Total cycles: 4,659,686,450
Cycles per call: 463,130 +/- 50
ASM4 (ChatGPT):
Total cycles: 4,662,425,150
Cycles per call: 463,120 +/- 40
So definitely not bad. It’s on par with GCC’s solution and my best solution, but I don’t know why. In my solution, removing a floating point register seemd to make the function run faster. But ChatGPT’s solution uses three floating point registers while being about as fast as mine. ChatGPT’s solution has the same amount of memory accesses per loop iteration as mine, but it tends to explictly copy array elements to registers before using them. This pattern is avoided in the GCC solution.
Conclusion
I tried my best to capture the main differences between the five solutions in the table below. I don’t know enough about assembly yet to really know which metrics are important in this context, but I think it’s a nice overview nonetheless.
| Solution | Additional integer registers used5 | Floating point registers used | Number of memory accesses per loop6 | Cycles per call |
|---|---|---|---|---|
GCC with -O2 | rax, r8d, r9 | xmm0 | 3 to 5 | 463,120 +/- 50 |
| ASM1 (first attempt) | eax, r8, r9 | xmm0, xmm1 | 2 to 3 | 579,000 +/- 5,000 |
| ASM2 (using expressions) | eax, r8 | xmm0, xmm1 | 2 to 3 | 579,000 +/- 4,000 |
| ASM3 (using expressions, using only one floating point register) | eax, r8 | xmm0 | 2 to 3 | 463,130 +/- 50 |
| ASM4 (ChatGPT) | r8, r9 | xmm0, xmm1, xmm2 | 2 to 3 | 463,120 +/- 40 |
Anyway, that was an interesting little diversion. Writing some homemade assembly (with a lot of help from Compiler Explorer) was very insightful and much more approachable I expected. I look forward to doing more of it soon.
Appendix
- Everything was compiled and run on an Intel Core i5-10500H Processor.
- GitHub repository
GCC version 15.2.1. ↩︎
I choose
-O2optimization because it’s supposed to be a good “middle ground” optimization, balancing execution time and executable size, while still being basically guaranteed to run correctly. ↩︎I call the function 10,000 times in a
forloop on an array of 100,000 random numbers sampled from $\text{Uniform}(0,1)$. The total number of CPU cycles, along with the sample median and interquartile range of the number of cycles for each repeat, are reported. ↩︎The assembly output from ChatGPT is copied verbatim with the following exceptions: I changed the name of the function, and I added four comments at the top to describe the arguments. ↩︎
By “additional integer registers”, I mean integer registers besides those used for passing the function’s arguments (
rdi,rsi,edx, andecx). ↩︎The first number is for
i < kernel_size, and the second is fori >= kernel_size. ↩︎