x86 interview questions
Top x86 frequently asked interview questions
The gcc -S
option will generate assembly code in AT&T syntax, is there a way to generate files in Intel syntax? Or is there a way to convert between the two?
Source: (StackOverflow)
I have been hearing much about the ARM and x86 Architectures. Is the x86 Architecture specially designed to work with a keyboard while ARM expects to be mobile? What are the key differences between the two?
Source: (StackOverflow)
I've got an arbitrary list of .NET assemblies.
I need to programmatically check if each DLL was built for x86 (as opposed to x64 or Any CPU). Is this possible?
Source: (StackOverflow)
I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots.
It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time.
I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler."
Here are some concrete ideas I'm considering:
- Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a
std::priority_queue
) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
- Along similar lines, for
std::vector
s whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
- I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
- How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
- Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
- One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
- On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
Lastly, to nip certain kinds of answers in the bud:
- I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs.
- I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.
Source: (StackOverflow)
Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by the RISC core.
At the beginning it was clear for me that Intel decided to hide new internal architecture and force programmers to use "CISC shell". Thanks to this decision Intel could fully redesign microprocessors architecture without breaking compatibility, it's reasonable.
However I don't understand one thing, why Intel still keeps an internal RISC instructions set hidden for so many years? Why wouldn't they let programmers use RISC instructions like the use old x86 CISC instructions set?
If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).
Looking at new Intel 'Core i' series I see, that they only extends CISC instructions set adding AVX, SSE4 and others.
Source: (StackOverflow)
I'm interested in writing an x86 dissembler as an educational project.
The only real resource I have found is Spiral Space's, "How to write a disassembler". While this gives a nice high level description of the various components of a disassembler, I'm interested in some more detailed resources. I've also taken a quick look at NASM's source code but this is somewhat of a heavyweight to learn from.
I realize one of the major challenges of this project is the rather large x86 instruction set I'm going to have to handle. I'm also interested in basic structure, basic disassembler links, etc.
Can anyone point me to any detailed resources on writing a x86 disassembler?
Source: (StackOverflow)
The compare
function is a function that takes two arguments a
and b
and returns an integer describing their order. If a
is smaller than b
, the result is some negative integer. If a
is bigger than b
, the result is some positive integer. Otherwise, a
and b
are equal, and the result is zero.
This function is often used to parametrize sorting and searching algorithms from standard libraries.
Implementing the compare
function for characters is quite easy; you simply subtract the arguments:
int compare_char(char a, char b)
{
return a - b;
}
This works because the difference between two characters is generally assumed to fit into an integer. (Note that this assumption does not hold for systems where sizeof(char) == sizeof(int)
.)
This trick cannot work to compare integers, because the difference between two integers generally does not fit into an integer. For example, INT_MAX - (-1) = INT_MIN
suggests that INT_MAX
is smaller than -1
(technically, the overflow leads to undefined behavior, but let's assume modulo arithmetic).
So how can we implement the compare function efficiently for integers? Here is my first attempt:
int compare_int(int a, int b)
{
int temp;
int result;
__asm__ __volatile__ (
"cmp %3, %2 \n\t"
"mov $0, %1 \n\t"
"mov $1, %0 \n\t"
"cmovg %0, %1 \n\t"
"mov $-1, %0 \n\t"
"cmovl %0, %1 \n\t"
: "=r"(temp), "=r"(result)
: "r"(a), "r"(b)
: "cc");
return result;
}
Can it be done in less than 6 instructions? Is there a less straightforward way that is more efficient?
Source: (StackOverflow)
I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What's the fastest way to calculate A % B
- that is the (64-bit) remainder from dividing A by B?
I'm looking to do this in either C or assembly language, but I need to target the 32-bit x86 platform. This unfortunately means that I cannot take advantage of compiler support for 128-bit integers, nor of the x64 architecture's ability to perform the required operation in a single instruction.
Edit:
Thank you for the answers so far. However, it appears to me that the suggested algorithms would be quite slow - wouldn't the fastest way to perform a 128-bit by 64-bit division be to leverage the processor's native support for 64-bit by 32-bit division? Does anyone know if there is a way to perform the larger division in terms of a few smaller divisions?
Re: How often does B change?
Primarily I'm interested in a general solution - what calculation would you perform if A and B are likely to be different every time?
However, a second possible situation is that B does not vary as often as A - there may be as many as 200 As to divide by each B. How would your answer differ in this case?
Source: (StackOverflow)
I write empty programs to annoy the hell out of stackoverflow coders, NOT. I am just exploring the gnu toolchain.
Now the following might be too deep for me, but to continuie the empty program saga I have started to examine the output of the C compiler, the stuff GNU as consumes.
gcc version 4.4.0 (TDM-1 mingw32)
test.c:
int main()
{
return 0;
}
gcc -S test.c
.file "test.c"
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
movl $0, %eax
leave
ret
Can you explain what happens here? Here is my effort to understand it. I have used the as
manual and my minimal x86 ASM knowledge:
.file "test.c"
is the directive for the logical filename.
.def
: according to the docs "Begin defining debugging information for a symbol name". What is a symbol (a function name/variable?) and what kind of debugging information?
.scl
: docs say "Storage class may flag whether a symbol is static or external". Is this the same static and external I know from C? And what is that '2'?
.type
: stores the parameter "as the type attribute of a symbol table entry", I have no clue.
.endef
: no problem.
.text
: Now this is problematic, it seems to be something called section and I have read that its the place for code, but the docs didn't tell me too much.
.globl
"makes the symbol visible to ld.", the manual is quite clear on this.
_main:
This might be the starting address (?) for my main function
pushl_
: A long (32bit) push, which places EBP on the stack
movl
: 32-bit move. Pseudo-C: EBP = ESP;
andl
: Logical AND. Pseudo-C: ESP = -16 & ESP
, I don't really see whats the point of this.
call
: Pushes the IP to the stack (so the called procedure can find its way back) and continues where __main
is. (what is __main?)
movl
: this zero must be the constant I return at the end of my code. The MOV places this zero into EAX.
leave
: restores stack after an ENTER instruction (?). Why?
ret
: goes back to the instruction address that is saved on the stack
Thank you for your help!
Source: (StackOverflow)
- What does
rep; nop
mean?
- Is it the same as
pause
instruction?
- Is it the same as
rep nop
(without the semi-colon)?
- What's the difference to the simple
nop
instruction?
- Does it behave differently on AMD and Intel processors?
- (bonus) Where is the official documentation for these instructions?
Motivation for this question
After some discussion in the comments of another question, I realized that I don't know what rep; nop;
means in x86 (or x86-64) assembly. And also I couldn't find a good explanation on the web.
I know that rep
is a prefix that means "repeat the next instruction cx
times" (or at least it was, in old 16-bit x86 assembly). According to this summary table at Wikipedia, it seems rep
can only be used with movs
, stos
, cmps
, lods
, scas
(but maybe this limitation was removed on newer processors). Thus, I would think rep nop
(without semi-colon) would repeat a nop
operation cx
times.
However, after further searching, I got even more confused. It seems that rep; nop
and pause
map to the exactly same opcode, and pause
has a bit different behavior than just nop
. Some old mail from 2005 said different things:
- "try not to burn too much power"
- "it is equivalent to 'nop' just with 2 byte encoding."
- "it is magic on intel. Its like 'nop but let the other HT sibling run'"
- "it is pause on intel and fast padding on Athlon"
With these different opinions, I couldn't understand the correct meaning.
It's being used in Linux kernel (on both i386 and x86_64), together with this comment: /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
It is also being used in BeRTOS, with the same comment.
Source: (StackOverflow)
Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.
Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?
Source: (StackOverflow)
I'm outside gdb's target executable and I don't even have a stack that corresponds to that target. I want to single-step anyway, so that I can verify what's going on in my assembly code, because I'm not an expert at x86 assembly. Unfortunately, gdb refuses to do this simple assembly-level debugging. It allows me to set and stop on appropriate breakpoint, but as soon as I try to single-step onwards, gdb reports the error "Cannot find bounds of current function" and the EIP doesn't change.
Additional details:
The machine code was generated by gcc asm statements and I copied it to the kernel memory location where it's executing, from the output of objdump -d. I wouldn't mind a simple way to use a loader to load my object code to a relocated address, but bear in mind the loading has to be done in a kernel module.
I suppose another alternative would be to produce a fake kernel module or debug info file to give to gdb, to cause it to believe this area is within the program code. gdb works fine on the kernel executable itself.
(For those who really want to know, I'm inserting code at runtime into Linux kernel data space inside a VMware VM and debugging it from gdb remote debugging the kernel via VMware Workstation's built-in gdb stub. Note I'm not writing kernel exploits; I'm a security graduate student writing a prototype.)
(I can set a breakpoint on each instruction inside my assembly. This works but would get quite laborious after a while, since the size of x86 assembly instructions varies and the location of the assembly will change every time I reboot.)
Source: (StackOverflow)
I am trying to install a Windows service using InstallUtil.exe and am getting the error message
System.BadImageFormatException: Could not load file or assembly '{xxx.exe}
' or one of its dependencies. An attempt was made to load a program with an incorrect format.
What gives?
EDIT: (Not by OP) Full message extracted from dup getting way more hits [for googleability]:
C:\Windows\Microsoft.NET\Framework64\v4.0.30319>InstallUtil.exe C:\xxx.exe
Microsoft (R) .NET Framework Installation utility Version 4.0.30319.1
Copyright (c) Microsoft Corporation. All rights reserved.
Exception occurred while initializing the installation:
System.BadImageFormatException: Could not load file or assembly 'file:///C:\xxx.exe' or one of its dependencies. An attempt was made to load a program with an incorrect format..
Source: (StackOverflow)
I'd like to disassemble the MBR (first 512 bytes) of a bootable x86 disk that I have. I have copied the MBR to a file using
dd if=/dev/my-device of=mbr bs=512 count=1
Any suggestions for a Linux utility that can disassemble the file mbr
?
Source: (StackOverflow)
I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.
Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?
I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?
The programs I used as follows, they are not identical by any means:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
int accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += rand( ) % 365;
}
std::cout << accum << std::endl;
return 0;
}
Program 2:
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <time.h>
int main( int argc, char** argv )
{
float accum = 0;
srand( time( NULL ) );
for( unsigned int i = 0; i < 100000000; ++i )
{
accum += (float)( rand( ) % 365 );
}
std::cout << accum << std::endl;
return 0;
}
Thanks in advance!
Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.
Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.
Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).
Source: (StackOverflow)