gcc interview questions
Top gcc frequently asked interview questions
I was wondering how to use GCC on my C source file to dump a mnemonic version of the machine code so I could see what my code was being compiled into. You can do this with Java but I haven't been able to find a way with GCC.
I am trying to re-write a C method in assembly and seeing how GCC does it would be a big help.
Source: (StackOverflow)
Why does the C preprocessor in GCC interpret the word linux
(small letters) as the constant 1
?
test.c:
#include <stdio.h>
int main(void)
{
int linux = 5;
return 0;
}
Result of $ gcc -E test.c
(stop after the preprocessing stage):
....
int main(void)
{
int 1 = 5;
return 0;
}
Which -of course- yields an error.
(BTW: There is no #define linux
in the stdio.h file.)
Source: (StackOverflow)
I've noticed that the Linux kernel code uses bool, but I thought that bool was a C++ type. Is bool a standard C extension (e.g., ISO C90) or a GCC extension?
Source: (StackOverflow)
In a gcc
compiled project, how to specify debug vs. release C/C++ flags using CMAKE
and how to run cmake
for each target type and how to express that main app will be compiled with g++
and one nested library with gcc
?
Source: (StackOverflow)
How does one do this?
If I want to analyze how something is getting compiled, how would I get the emitted assembly code?
Source: (StackOverflow)
I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2)
by compiling it into a*a
, but the call pow(a,6)
is not optimized and will actually call the library function pow
, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc
, will eliminate the library call for pow(a,6)
.)
What I am curious about is that when I replaced pow(a,6)
with a*a*a*a*a*a
using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4
", it uses 5 mulsd
instructions:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
while if I write (a*a*a)*(a*a*a)
, it will produce
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3. icc
has similar behavior.
Why do compilers not recognize this optimization trick?
Source: (StackOverflow)
I have install Mountain Lion (Mac OS X 10.8) and now gcc doesn't seem to be available anymore. I've also installed Xcode 4.4 so there is no more /Developer directory.
I need gcc both for mac ports and for ruby gems (that have native extensions).
Does Xcode 4.4 include gcc or is there a way to install gcc?
Source: (StackOverflow)
I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os
) instead of speed (-O2
or -O3
), and I have been wondering ever since why.
I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.
const int LOOP_BOUND = 200000000;
__attribute__((noinline))
static int add(const int& x, const int& y) {
return x + y;
}
__attribute__((noinline))
static int work(int xval, int yval) {
int sum(0);
for (int i=0; i<LOOP_BOUND; ++i) {
int x(xval+sum);
int y(yval+sum);
int z = add(x, y);
sum += z;
}
return sum;
}
int main(int , char* argv[]) {
int result = work(*argv[1], *argv[2]);
return result;
}
If I compile it with -Os
, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2
or -O3
. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).
(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-*
flags have the same effect.)
The generated assembly with -Os
and -O2
.
Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2
and merged all its differences into the assembly for -Os
except the .p2align
lines, result here. This code still runs in 0.38s and the only difference is the .p2align
stuff.
If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.
Is it the padding that is the culprit in this case? Why and how?
The noise it makes pretty much makes timing micro-optimizations impossible.
How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?
UPDATE:
Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops
to gcc, all .p2align
are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:
-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:
-falign-functions -falign-jumps -falign-loops <br/>
-falign-labels -freorder-blocks -freorder-blocks-and-partition <br/>
-fprefetch-loop-arrays <br/>
So, it pretty much seems like a (mis)alignment issue.
I am still skeptical about -march=native
as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)
UPDATE 2:
We can take -Os
out of the picture. The following times are obtained by compiling with
-O2 -fno-omit-frame-pointer
0.37s
-O2 -fno-align-functions -fno-align-loops
0.37s
-S -O2
then manually moving the assembly of add()
after work()
0.37s
-O2
0.44s
It looks like to me the distance of add()
from the call site matters a lot. I have tried perf
, but the output of perf stat
and perf report
makes very little sense to me. However, I could only get one consistent result out of it:
-O2
:
602,312,864 stalled-cycles-frontend # 0.00% frontend cycles idle
3,318 cache-misses
0.432703993 seconds time elapsed
[...]
81.23% a.out a.out [.] work(int, int)
18.50% a.out a.out [.] add(int const&, int const&) [clone .isra.0]
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ return x + y;
100.00 ¦ lea (%rdi,%rsi,1),%eax
¦ }
¦ ? retq
[...]
¦ int z = add(x, y);
1.93 ¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
79.79 ¦ add %eax,%ebx
For fno-align-*
:
604,072,552 stalled-cycles-frontend # 0.00% frontend cycles idle
9,508 cache-misses
0.375681928 seconds time elapsed
[...]
82.58% a.out a.out [.] work(int, int)
16.83% a.out a.out [.] add(int const&, int const&) [clone .isra.0]
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ return x + y;
51.59 ¦ lea (%rdi,%rsi,1),%eax
¦ }
[...]
¦ __attribute__((noinline))
¦ static int work(int xval, int yval) {
¦ int sum(0);
¦ for (int i=0; i<LOOP_BOUND; ++i) {
¦ int x(xval+sum);
8.20 ¦ lea 0x0(%r13,%rbx,1),%edi
¦ int y(yval+sum);
¦ int z = add(x, y);
35.34 ¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
39.48 ¦ add %eax,%ebx
¦ }
For -fno-omit-frame-pointer
:
404,625,639 stalled-cycles-frontend # 0.00% frontend cycles idle
10,514 cache-misses
0.375445137 seconds time elapsed
[...]
75.35% a.out a.out [.] add(int const&, int const&) [clone .isra.0] ¦
24.46% a.out a.out [.] work(int, int)
[...]
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
18.67 ¦ push %rbp
¦ return x + y;
18.49 ¦ lea (%rdi,%rsi,1),%eax
¦ const int LOOP_BOUND = 200000000;
¦
¦ __attribute__((noinline))
¦ static int add(const int& x, const int& y) {
¦ mov %rsp,%rbp
¦ return x + y;
¦ }
12.71 ¦ pop %rbp
¦ ? retq
[...]
¦ int z = add(x, y);
¦ ? callq add(int const&, int const&) [clone .isra.0]
¦ sum += z;
29.83 ¦ add %eax,%ebx
It looks like we are stalling on the call to add()
in the slow case.
I have examined everything that perf -e
can spit out on my machine; not just the stats that are given above.
For the same executable, the stalled-cycles-frontend
shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend
for different executables doesn't make sense to me.)
I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf
, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.
Source: (StackOverflow)
How do list the symbols being exported from a .so file. If possible, I'd also like to know their source (e.g. if they are pulled in from a static library).
I'm using gcc 4.0.2, if that makes a difference
Source: (StackOverflow)
What would be the best way to write Objective-C on the Windows platform?
Cygwin and gcc? Is there a way I can somehow integrate this into Visual Studio?
Along those lines - are there any suggestions as to how to link in and use the Windows SDK for something like this. Its a different beast but I know I can write assembly and link in the Windows DLLs giving me accessibility to those calls but I don't know how to do this without googling and getting piecemeal directions.
Is anyone aware of a good online or book resource to do or explain these kinds of things?
Source: (StackOverflow)
I have found an interesting performance regression in a small C++ snippet, when I enable C++11:
#include <vector>
struct Item
{
int a;
int b;
};
int main()
{
const std::size_t num_items = 10000000;
std::vector<Item> container;
container.reserve(num_items);
for (std::size_t i = 0; i < num_items; ++i) {
container.push_back(Item());
}
return 0;
}
With g++ (GCC) 4.8.2 20131219 (prerelease) and C++03 I get:
milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
35.206824 task-clock # 0.988 CPUs utilized ( +- 1.23% )
4 context-switches # 0.116 K/sec ( +- 4.38% )
0 cpu-migrations # 0.006 K/sec ( +- 66.67% )
849 page-faults # 0.024 M/sec ( +- 6.02% )
95,693,808 cycles # 2.718 GHz ( +- 1.14% ) [49.72%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
95,282,359 instructions # 1.00 insns per cycle ( +- 0.65% ) [75.27%]
30,104,021 branches # 855.062 M/sec ( +- 0.87% ) [77.46%]
6,038 branch-misses # 0.02% of all branches ( +- 25.73% ) [75.53%]
0.035648729 seconds time elapsed ( +- 1.22% )
With C++11 enabled on the other hand, the performance degrades significantly:
milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
86.485313 task-clock # 0.994 CPUs utilized ( +- 0.50% )
9 context-switches # 0.104 K/sec ( +- 1.66% )
2 cpu-migrations # 0.017 K/sec ( +- 26.76% )
798 page-faults # 0.009 M/sec ( +- 8.54% )
237,982,690 cycles # 2.752 GHz ( +- 0.41% ) [51.32%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
135,730,319 instructions # 0.57 insns per cycle ( +- 0.32% ) [75.77%]
30,880,156 branches # 357.057 M/sec ( +- 0.25% ) [75.76%]
4,188 branch-misses # 0.01% of all branches ( +- 7.59% ) [74.08%]
0.087016724 seconds time elapsed ( +- 0.50% )
Can someone explain this? So far my experience was that the STL gets faster by enabling C++11, esp. thanks to move semantics.
EDIT: As suggested, using container.emplace_back();
instead the performance gets on par with the C++03 version. How can the C++03 version achieve the same for push_back
?
milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out
Performance counter stats for './a.out' (10 runs):
36.229348 task-clock # 0.988 CPUs utilized ( +- 0.81% )
4 context-switches # 0.116 K/sec ( +- 3.17% )
1 cpu-migrations # 0.017 K/sec ( +- 36.85% )
798 page-faults # 0.022 M/sec ( +- 8.54% )
94,488,818 cycles # 2.608 GHz ( +- 1.11% ) [50.44%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
94,851,411 instructions # 1.00 insns per cycle ( +- 0.98% ) [75.22%]
30,468,562 branches # 840.991 M/sec ( +- 1.07% ) [76.71%]
2,723 branch-misses # 0.01% of all branches ( +- 9.84% ) [74.81%]
0.036678068 seconds time elapsed ( +- 0.80% )
Source: (StackOverflow)
I have read the link about GCC's Options for Code Generation Conventions, but could not understand what is "Generate position-independent code (PIC)". Please give an example to explain me what does it mean.
Source: (StackOverflow)
So I'm working on an exceedingly large codebase, and recently upgraded to gcc 4.3, which now triggers this warning:
warning: deprecated conversion from string constant to ‘char*’
Obviously, the correct way to fix this is to find every declaration like
char *s = "constant string";
or function call like:
void foo(char *s);
foo("constant string");
and make them const char
pointers. However, that would mean touching 564 files, minimum, which is not a task I wish to perform at this point in time. The problem right now is that I'm running with -werror
, so I need some way to stifle these warnings. How can I do that?
Source: (StackOverflow)