assembly interview questions
Top assembly frequently asked interview questions
The gcc -S
option will generate assembly code in AT&T syntax, is there a way to generate files in Intel syntax? Or is there a way to convert between the two?
Source: (StackOverflow)
I'm reading a book where the author says that if( a < 901 )
is faster than if( a <= 900 )
.
Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.
Source: (StackOverflow)
I was looking for the fastest way to popcount
large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned
to uint64_t
made the performance drop by 50% on my PC.
The Benchmark
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
if (argc != 2) {
cerr << "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
for (unsigned i=0; i<size/8; i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
As you see, we create a buffer of random data, with the size being x
megabytes where x
is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount
intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned
, in the lower case, the inner loop variable is uint64_t
. I thought that this should make no difference, but the opposite is the case.
The (absolutely crazy) results
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):
g++ -O3 -march=native -std=c++11 test.cpp -o test
Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1
(so 1 MB random data):
- unsigned 41959360000 0.401554 sec 26.113 GB/s
- uint64_t 41959360000 0.759822 sec 13.8003 GB/s
As you see, the throughput of the uint64_t
version is only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++
(Ubuntu Clang version 3.4-1ubuntu3):
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Result: test 1
- unsigned 41959360000 0.398293 sec 26.3267 GB/s
- uint64_t 41959360000 0.680954 sec 15.3986 GB/s
So, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1
, so I change:
uint64_t size = atol(argv[1]) << 20;
to
uint64_t size = 1 << 20;
Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for g++
:
- unsigned 41959360000 0.509156 sec 20.5944 GB/s
- uint64_t 41959360000 0.508673 sec 20.6139 GB/s
Now, both versions are equally fast. However, the unsigned
got even slower! It dropped from 26
to 20 GB/s
, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++
with the new version:
- unsigned 41959360000 0.677009 sec 15.4884 GB/s
- uint64_t 41959360000 0.676909 sec 15.4906 GB/s
Wait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!
I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.
More madness, please!
Take the first example (the one with atol(argv[1])
) and put a static
before the variable, i.e.:
static uint64_t size=atol(argv[1])<<20;
Here are my results in g++:
- unsigned 41959360000 0.396728 sec 26.4306 GB/s
- uint64_t 41959360000 0.509484 sec 20.5811 GB/s
Yay, yet another alternative. We still have the fast 26 GB/s with u32
, but we managed to get u64
at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64
version became even faster than the u32
version, yielding the fastest result of all. Sadly, this only works for g++
, clang++
does not seem to care about static
.
My question
Can you explain these results? Especially:
- How can there be such a difference between
u32
and u64
?
- How can replacing a non-constant by a constant buffer size trigger less optimal code?
- How can the insertion of the
static
keyword make the u64
loop faster? Even faster than the original code on my collegue's computer!
I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?
The Disassembly
Here is the disassembly for the various results:
26 GB/s version from g++ / u32 / non-const bufsize:
0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8
13 GB/s version from g++ / u64 / non-const bufsize:
0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00
15 GB/s version from clang++ / u64 / non-const bufsize:
0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50
20 GB/s version from g++ / u32&u64 / const bufsize:
0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68
15 GB/s version from clang++ / u32&u64 / const bufsize:
0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0
Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea
. Some versions use jb
to jump, others use jne
. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?
Lessons learned
No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static
keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.
The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.
Source: (StackOverflow)
How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU?
As far as I understand it take three cycles for an SSE add
and five cycles for a mul
to complete on most of the modern Intel CPUs (see for example Agner Fog's 'Instruction Tables' ). Due to pipelining one can get a throughput of one add
per cycle if the algorithm has at least three independent summations. Since that is true for packed addpd
as well as the scalar addsd
versions and SSE registers can contain two double
's the throughput can be as much as two flops per cycle.
Furthermore, it seems (although I've not seen any proper documentation on this) add
's and mul
's can be executed in parallel giving a theoretical max throughput of four flops per cycle.
However, I've not been able to replicate that performance with a simple C/C++ programme. My best attempt resulted in about 2.7 flops/cycle. If anyone can contribute a simple C/C++ or assembler programme which demonstrates peak performance that'd be greatly appreciated.
My attempt:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
double stoptime(void) {
struct timeval t;
gettimeofday(&t,NULL);
return (double) t.tv_sec + t.tv_usec/1000000.0;
}
double addmul(double add, double mul, int ops){
// Need to initialise differently otherwise compiler might optimise away
double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
int loops=ops/10; // We have 10 floating point operations inside the loop
double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
+ pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);
for (int i=0; i<loops; i++) {
mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
}
return sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}
int main(int argc, char** argv) {
if (argc != 2) {
printf("usage: %s <num>\n", argv[0]);
printf("number of operations: <num> millions\n");
exit(EXIT_FAILURE);
}
int n = atoi(argv[1]) * 1000000;
if (n<=0)
n=1000;
double x = M_PI;
double y = 1.0 + 1e-8;
double t = stoptime();
x = addmul(x, y, n);
t = stoptime() - t;
printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x);
return EXIT_SUCCESS;
}
Compiled with
g++ -O2 -march=native addmul.cpp ; ./a.out 1000
produces the following output on an Intel Core i5-750, 2.66 GHz.
addmul: 0.270 s, 3.707 Gflops, res=1.326463
That is, just about 1.4 flops per cycle. Looking at the assembler code with
g++ -S -O2 -march=native -masm=intel addmul.cpp
the main loop seems kind of
optimal to me:
.L4:
inc eax
mulsd xmm8, xmm3
mulsd xmm7, xmm3
mulsd xmm6, xmm3
mulsd xmm5, xmm3
mulsd xmm1, xmm3
addsd xmm13, xmm2
addsd xmm12, xmm2
addsd xmm11, xmm2
addsd xmm10, xmm2
addsd xmm9, xmm2
cmp eax, ebx
jne .L4
Changing the scalar versions with packed versions (addpd
and mulpd
) would double the flop count without changing the execution time and so I'd get just short of 2.8 flops per cycle. Is there a simple example which achieves four flops per cycle?
Nice little programme by Mysticial; here are my results (run just for a few seconds though):
gcc -O2 -march=nocona
: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)
cl /O2
, openmp removed: 10.1 Gflops out of 10.66 Gflops (3.8 flops/cycle)
It all seems a bit complex, but my conclusions so far:
gcc -O2
changes the order of independent floating point operations with
the aim of alternating
addpd
and mulpd
's if possible. Same applies to gcc-4.6.2 -O2 -march=core2
.
gcc -O2 -march=nocona
seems to keep the order of floating point operations as defined in
the C++ source.
cl /O2
, the 64-bit compiler from the
SDK for Windows 7
does loop-unrolling automatically and seems to try and arrange operations
so that groups of three addpd
's alternate with three mulpd
's (well, at least on my system and for my simple programme).
My Core i5 750 (Nahelem architecture)
doesn't like alternating add's and mul's and seems unable
to run both operations in parallel. However, if grouped in 3's it suddenly works like magic.
Other architectures (possibly Sandy Bridge and others) appear to
be able to execute add/mul in parallel without problems
if they alternate in the assembly code.
Although difficult to admit, but on my system cl /O2
does a much better job at low-level optimising operations for my system and achieves close to peak performance for the little C++ example above. I measured between
1.85-2.01 flops/cycle (have used clock() in Windows which is not that precise. I guess, need to use a better timer - thanks Mackie Messer).
The best I managed with gcc
was to manually loop unroll and arrange
additions and multiplications in groups of three. With
g++ -O2 -march=nocona addmul_unroll.cpp
I get at best 0.207s, 4.825 Gflops
which corresponds to 1.8 flops/cycle
which I'm quite happy with now.
In the C++ code I've replaced the for
loop with
for (int i=0; i<loops/3; i++) {
mul1*=mul; mul2*=mul; mul3*=mul;
sum1+=add; sum2+=add; sum3+=add;
mul4*=mul; mul5*=mul; mul1*=mul;
sum4+=add; sum5+=add; sum1+=add;
mul2*=mul; mul3*=mul; mul4*=mul;
sum2+=add; sum3+=add; sum4+=add;
mul5*=mul; mul1*=mul; mul2*=mul;
sum5+=add; sum1+=add; sum2+=add;
mul3*=mul; mul4*=mul; mul5*=mul;
sum3+=add; sum4+=add; sum5+=add;
}
And the assembly now looks like
.L4:
mulsd xmm8, xmm3
mulsd xmm7, xmm3
mulsd xmm6, xmm3
addsd xmm13, xmm2
addsd xmm12, xmm2
addsd xmm11, xmm2
mulsd xmm5, xmm3
mulsd xmm1, xmm3
mulsd xmm8, xmm3
addsd xmm10, xmm2
addsd xmm9, xmm2
addsd xmm13, xmm2
...
Source: (StackOverflow)
I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2)
by compiling it into a*a
, but the call pow(a,6)
is not optimized and will actually call the library function pow
, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc
, will eliminate the library call for pow(a,6)
.)
What I am curious about is that when I replaced pow(a,6)
with a*a*a*a*a*a
using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4
", it uses 5 mulsd
instructions:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
while if I write (a*a*a)*(a*a*a)
, it will produce
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3. icc
has similar behavior.
Why do compilers not recognize this optimization trick?
Source: (StackOverflow)
One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembler.
This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembler besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.
Can anyone provide some specific examples of cases where assembler will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.
Source: (StackOverflow)
I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os
) instead of speed (-O2
or -O3
), and I have been wondering ever since why.
I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.
const int LOOP_BOUND = 200000000;
__attribute__((noinline))
static int add(const int& x, const int& y) {
return x + y;
}
__attribute__((noinline))
static int work(int xval, int yval) {
int sum(0);
for (int i=0; i<LOOP_BOUND; ++i) {
int x(xval+sum);
int y(yval+sum);
int z = add(x, y);
sum += z;
}
return sum;
}
int main(int , char* argv[]) {
int result = work(*argv[1], *argv[2]);
return result;
}
If I compile it with -Os
, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2
or -O3
. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).
(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-*
flags have the same effect.)
The generated assembly with -Os
and -O2
.
Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2
and merged all its differences into the assembly for -Os
except the .p2align
lines, result here. This code still runs in 0.38s and the only difference is the .p2align
stuff.
If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.
Is it the padding that is the culprit in this case? Why and how?
The noise it makes pretty much makes timing micro-optimizations impossible.
How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?
UPDATE:
Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops
to gcc, all .p2align
are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:
-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:
-falign-functions -falign-jumps -falign-loops <br/>
-falign-labels -freorder-blocks -freorder-blocks-and-partition <br/>
-fprefetch-loop-arrays <br/>
So, it pretty much seems like a (mis)alignment issue.
I am still skeptical about -march=native
as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)
UPDATE 2:
We can take -Os
out of the picture. The following times are obtained by compiling with
-O2 -fno-omit-frame-pointer
0.37s
-O2 -fno-align-functions -fno-align-loops
0.37s
-S -O2
then manually moving the assembly of add()
after work()
0.37s
-O2
0.44s
It looks like to me the distance of add()
from the call site matters a lot. I have tried perf
, but the output of perf stat
and perf report
makes very little sense to me. However, I could only get one consistent result out of it:
-O2
:
602,312,864 stalled-cycles-frontend # 0.00% frontend cycles idle
3,318 cache-misses
0.432703993 seconds time elapsed
[...]
81.23% a.out a.out [.] work(int, int)
18.50% a.out a.out [.] add(int const&, int const&) [clone .isra.0]
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ return x + y;
100.00 ¦ lea (%rdi,%rsi,1),%eax
¦ }
¦ ? retq
[...]
¦ int z = add(x, y);
1.93 ¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
79.79 ¦ add %eax,%ebx
For fno-align-*
:
604,072,552 stalled-cycles-frontend # 0.00% frontend cycles idle
9,508 cache-misses
0.375681928 seconds time elapsed
[...]
82.58% a.out a.out [.] work(int, int)
16.83% a.out a.out [.] add(int const&, int const&) [clone .isra.0]
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ return x + y;
51.59 ¦ lea (%rdi,%rsi,1),%eax
¦ }
[...]
¦ __attribute__((noinline))
¦ static int work(int xval, int yval) {
¦ int sum(0);
¦ for (int i=0; i<LOOP_BOUND; ++i) {
¦ int x(xval+sum);
8.20 ¦ lea 0x0(%r13,%rbx,1),%edi
¦ int y(yval+sum);
¦ int z = add(x, y);
35.34 ¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
39.48 ¦ add %eax,%ebx
¦ }
For -fno-omit-frame-pointer
:
404,625,639 stalled-cycles-frontend # 0.00% frontend cycles idle
10,514 cache-misses
0.375445137 seconds time elapsed
[...]
75.35% a.out a.out [.] add(int const&, int const&) [clone .isra.0] ¦
24.46% a.out a.out [.] work(int, int)
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
18.67 ¦ push %rbp
¦ return x + y;
18.49 ¦ lea (%rdi,%rsi,1),%eax
¦ const int LOOP_BOUND = 200000000;
¦
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ mov %rsp,%rbp
¦ return x + y;
¦ }
12.71 ¦ pop %rbp
¦ ? retq
[...]
¦ int z = add(x, y);
¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
29.83 ¦ add %eax,%ebx
It looks like we are stalling on the call to add()
in the slow case.
I have examined everything that perf -e
can spit out on my machine; not just the stats that are given above.
For the same executable, the stalled-cycles-frontend
shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend
for different executables doesn't make sense to me.)
I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf
, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.
Source: (StackOverflow)
I am working on some Java code which needs to be highly optimized as it will run in hot functions that are invoked at many points in my main program logic. Part of this code involves multiplying double
variables by 10
raised to arbitrary non-negative int
exponent
s. One fast way (edit: but not the fastest possible, see Update 2 below) to get the multiplied value is to switch
on the exponent
:
double multiplyByPowerOfTen(final double d, final int exponent) {
switch (exponent) {
case 0:
return d;
case 1:
return d*10;
case 2:
return d*100;
// ... same pattern
case 9:
return d*1000000000;
case 10:
return d*10000000000L;
// ... same pattern with long literals
case 18:
return d*1000000000000000000L;
default:
throw new ParseException("Unhandled power of ten " + power, 0);
}
}
The commented ellipses above indicate that the case
int
constants continue incrementing by 1, so there are really 19 case
s in the above code snippet. Since I wasn't sure whether I would actually need all the powers of 10 in case
statements 10
thru 18
, I ran some microbenchmarks comparing the time to complete 10 million operations with this switch
statement versus a switch
with only case
s 0
thru 9
(with the exponent
limited to 9 or less to avoid breaking the pared-down switch
). I got the rather surprising (to me, at least!) result that the longer switch
with more case
statements actually ran faster.
On a lark, I tried adding even more case
s which just returned dummy values, and found that I could get the switch to run even faster with around 22-27 declared case
s (even though those dummy cases are never actually hit while the code is running). (Again, case
s were added in a contiguous fashion by incrementing the prior case
constant by 1
.) These execution time differences are not very significant: for a random exponent
between 0
and 10
, the dummy padded switch
statement finishes 10 million executions in 1.49 secs versus 1.54 secs for the unpadded version, for a grand total savings of 5ns per execution. So, not the kind of thing that makes obsessing over padding out a switch
statement worth the effort from an optimization standpoint. But I still just find it curious and counter-intuitive that a switch
doesn't become slower (or perhaps at best maintain constant O(1) time) to execute as more case
s are added to it.
These are the results I obtained from running with various limits on the randomly-generated exponent
values. I didn't include the results all the way down to 1
for the exponent
limit, but the general shape of the curve remains the same, with a ridge around the 12-17 case mark, and a valley between 18-28. All tests were run in JUnitBenchmarks using shared containers for the random values to ensure identical testing inputs. I also ran the tests both in order from longest switch
statement to shortest, and vice-versa, to try and eliminate the possibility of ordering-related test problems. I've put my testing code up on a github repo if anyone wants to try to reproduce these results.
So, what's going on here? Some vagaries of my architecture or micro-benchmark construction? Or is the Java switch
really a little faster to execute in the 18
to 28
case
range than it is from 11
up to 17
?
github test repo "switch-experiment"
UPDATE: I cleaned up the benchmarking library quite a bit and added a text file in /results with some output across a wider range of possible exponent
values. I also added an option in the testing code not to throw an Exception
from default
, but this doesn't appear to affect the results.
UPDATE 2: Found some pretty good discussion of this issue from back in 2009 on the xkcd forum here: http://forums.xkcd.com/viewtopic.php?f=11&t=33524. The OP's discussion of using Array.binarySearch()
gave me the idea for a simple array-based implementation of the exponentiation pattern above. There's no need for the binary search since I know what the entries in the array
are. It appears to run about 3 times faster than using switch
, obviously at the expense of some of the control flow that switch
affords. That code has been added to the github repo also.
Source: (StackOverflow)
Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV
instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV
instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576
times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
- Why would uselessly copying a register's contents to RAM ever increase performance?
- Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
- Is this behavior something that could be exploited predictably by a compiler?
Source: (StackOverflow)
How does one do this?
If I want to analyze how something is getting compiled, how would I get the emitted assembly code?
Source: (StackOverflow)
Is a switch
statement actually faster than an if
statement?
I ran the code below on Visual Studio 2010's x64 C++ compiler with the /Ox
flag:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define MAX_COUNT (1 << 29)
size_t counter = 0;
size_t testSwitch()
{
clock_t start = clock();
size_t i;
for (i = 0; i < MAX_COUNT; i++)
{
switch (counter % 4 + 1)
{
case 1: counter += 4; break;
case 2: counter += 3; break;
case 3: counter += 2; break;
case 4: counter += 1; break;
}
}
return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}
size_t testIf()
{
clock_t start = clock();
size_t i;
for (i = 0; i < MAX_COUNT; i++)
{
const size_t c = counter % 4 + 1;
if (c == 1) { counter += 4; }
else if (c == 2) { counter += 3; }
else if (c == 3) { counter += 2; }
else if (c == 4) { counter += 1; }
}
return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}
int main()
{
printf("Starting...\n");
printf("Switch statement: %u ms\n", testSwitch());
printf("If statement: %u ms\n", testIf());
}
and got these results:
Switch statement: 5261 ms
If statement: 5196 ms
From what I've learned, switch
statements apparently use jump tables to optimize the branching.
Questions:
What would a basic jump table look like, in x86 or x64?
Is this code using a jump table?
Why is there no performance difference in this example? Is there any situation in which there is a significant performance difference?
Disassembly of the code:
testIf:
13FE81B10 sub rsp,48h
13FE81B14 call qword ptr [__imp_clock (13FE81128h)]
13FE81B1A mov dword ptr [start],eax
13FE81B1E mov qword ptr [i],0
13FE81B27 jmp testIf+26h (13FE81B36h)
13FE81B29 mov rax,qword ptr [i]
13FE81B2E inc rax
13FE81B31 mov qword ptr [i],rax
13FE81B36 cmp qword ptr [i],20000000h
13FE81B3F jae testIf+0C3h (13FE81BD3h)
13FE81B45 xor edx,edx
13FE81B47 mov rax,qword ptr [counter (13FE835D0h)]
13FE81B4E mov ecx,4
13FE81B53 div rax,rcx
13FE81B56 mov rax,rdx
13FE81B59 inc rax
13FE81B5C mov qword ptr [c],rax
13FE81B61 cmp qword ptr [c],1
13FE81B67 jne testIf+6Dh (13FE81B7Dh)
13FE81B69 mov rax,qword ptr [counter (13FE835D0h)]
13FE81B70 add rax,4
13FE81B74 mov qword ptr [counter (13FE835D0h)],rax
13FE81B7B jmp testIf+0BEh (13FE81BCEh)
13FE81B7D cmp qword ptr [c],2
13FE81B83 jne testIf+89h (13FE81B99h)
13FE81B85 mov rax,qword ptr [counter (13FE835D0h)]
13FE81B8C add rax,3
13FE81B90 mov qword ptr [counter (13FE835D0h)],rax
13FE81B97 jmp testIf+0BEh (13FE81BCEh)
13FE81B99 cmp qword ptr [c],3
13FE81B9F jne testIf+0A5h (13FE81BB5h)
13FE81BA1 mov rax,qword ptr [counter (13FE835D0h)]
13FE81BA8 add rax,2
13FE81BAC mov qword ptr [counter (13FE835D0h)],rax
13FE81BB3 jmp testIf+0BEh (13FE81BCEh)
13FE81BB5 cmp qword ptr [c],4
13FE81BBB jne testIf+0BEh (13FE81BCEh)
13FE81BBD mov rax,qword ptr [counter (13FE835D0h)]
13FE81BC4 inc rax
13FE81BC7 mov qword ptr [counter (13FE835D0h)],rax
13FE81BCE jmp testIf+19h (13FE81B29h)
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)]
13FE81BD9 sub eax,dword ptr [start]
13FE81BDD imul eax,eax,3E8h
13FE81BE3 cdq
13FE81BE4 mov ecx,3E8h
13FE81BE9 idiv eax,ecx
13FE81BEB cdqe
13FE81BED add rsp,48h
13FE81BF1 ret
testSwitch:
13FE81C00 sub rsp,48h
13FE81C04 call qword ptr [__imp_clock (13FE81128h)]
13FE81C0A mov dword ptr [start],eax
13FE81C0E mov qword ptr [i],0
13FE81C17 jmp testSwitch+26h (13FE81C26h)
13FE81C19 mov rax,qword ptr [i]
13FE81C1E inc rax
13FE81C21 mov qword ptr [i],rax
13FE81C26 cmp qword ptr [i],20000000h
13FE81C2F jae testSwitch+0C5h (13FE81CC5h)
13FE81C35 xor edx,edx
13FE81C37 mov rax,qword ptr [counter (13FE835D0h)]
13FE81C3E mov ecx,4
13FE81C43 div rax,rcx
13FE81C46 mov rax,rdx
13FE81C49 inc rax
13FE81C4C mov qword ptr [rsp+30h],rax
13FE81C51 cmp qword ptr [rsp+30h],1
13FE81C57 je testSwitch+73h (13FE81C73h)
13FE81C59 cmp qword ptr [rsp+30h],2
13FE81C5F je testSwitch+87h (13FE81C87h)
13FE81C61 cmp qword ptr [rsp+30h],3
13FE81C67 je testSwitch+9Bh (13FE81C9Bh)
13FE81C69 cmp qword ptr [rsp+30h],4
13FE81C6F je testSwitch+0AFh (13FE81CAFh)
13FE81C71 jmp testSwitch+0C0h (13FE81CC0h)
13FE81C73 mov rax,qword ptr [counter (13FE835D0h)]
13FE81C7A add rax,4
13FE81C7E mov qword ptr [counter (13FE835D0h)],rax
13FE81C85 jmp testSwitch+0C0h (13FE81CC0h)
13FE81C87 mov rax,qword ptr [counter (13FE835D0h)]
13FE81C8E add rax,3
13FE81C92 mov qword ptr [counter (13FE835D0h)],rax
13FE81C99 jmp testSwitch+0C0h (13FE81CC0h)
13FE81C9B mov rax,qword ptr [counter (13FE835D0h)]
13FE81CA2 add rax,2
13FE81CA6 mov qword ptr [counter (13FE835D0h)],rax
13FE81CAD jmp testSwitch+0C0h (13FE81CC0h)
13FE81CAF mov rax,qword ptr [counter (13FE835D0h)]
13FE81CB6 inc rax
13FE81CB9 mov qword ptr [counter (13FE835D0h)],rax
13FE81CC0 jmp testSwitch+19h (13FE81C19h)
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)]
13FE81CCB sub eax,dword ptr [start]
13FE81CCF imul eax,eax,3E8h
13FE81CD5 cdq
13FE81CD6 mov ecx,3E8h
13FE81CDB idiv eax,ecx
13FE81CDD cdqe
13FE81CDF add rsp,48h
13FE81CE3 ret
Update:
Interesting results here and here. Not sure why one is faster and one is slower, though.
Source: (StackOverflow)
I was wondering how to use GCC on my C source file to dump a mnemonic version of the machine code so I could see what my code was being compiled into. You can do this with Java but I haven't been able to find a way with GCC.
I am trying to re-write a C method in assembly and seeing how GCC does it would be a big help.
Source: (StackOverflow)
A friend of mine downloaded some malware from Facebook, and I'm curious to see what it does without infecting myself. I know that you can't really decompile an .exe, but can I at least view it in Assembly or attach a debugger?
Edit to say it is not a .NET executable, no CLI header.
Source: (StackOverflow)
What is the difference between object code, machine code and assembly code?
Can you give a visual example of their difference?
Source: (StackOverflow)