intel interview questions
Top intel frequently asked interview questions
I have Windows 8.1 pro with an AMD processor. I installed the Android SDK and Eclipse. It works but the problem is that when I Create AVD and launch it shows this error:
emulator: ERROR: x86 emulation currently requires hardware acceleration!
Please ensure Intel HAXM is properly installed and usable.
CPU acceleration status: HAX kernel module is not installed!
I have already installed Intel Hardware_Accelerated_Execution_Manager and I have enabled Virtual modulation from the boot menu, but it's still not working.
Source: (StackOverflow)
I have a multiply-add kernel inside my application and I want to increase its performance.
I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:
for(int i=0; i<iterations; i+=4) {
y1 = _mm_set_ss(output[i]);
y2 = _mm_set_ss(output[i+1]);
y3 = _mm_set_ss(output[i+2]);
y4 = _mm_set_ss(output[i+3]);
for(k=0; k<ksize; k++){
for(l=0; l<ksize; l++){
w = _mm_set_ss(weight[i+k+l]);
x1 = _mm_set_ss(input[i+k+l]);
y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
…
x4 = _mm_set_ss(input[i+k+l+3]);
y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
}
}
_mm_store_ss(&output[i],y1);
_mm_store_ss(&output[i+1],y2);
_mm_store_ss(&output[i+2],y3);
_mm_store_ss(&output[i+3],y4);
}
I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.
The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).
If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov
instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).
According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.
I am trying to either:
- get rid of this overhead, or
- explain where it comes from
Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?
I hope my problem is clear, thanks in advance for your responses!
Update: The assembly of the inner-loop looks as follows:
...
Block 21:
movssl (%rsi,%rdi,4), %xmm4
movssl (%rcx,%rdi,4), %xmm0
movssl 0x4(%rcx,%rdi,4), %xmm1
movssl 0x8(%rcx,%rdi,4), %xmm2
movssl 0xc(%rcx,%rdi,4), %xmm3
inc %rdi
mulss %xmm4, %xmm0
cmp $0x32, %rdi
mulss %xmm4, %xmm1
mulss %xmm4, %xmm2
mulss %xmm3, %xmm4
addss %xmm0, %xmm5
addss %xmm1, %xmm6
addss %xmm2, %xmm7
addss %xmm4, %xmm8
jl 0x401b52 <Block 21>
...
Source: (StackOverflow)
I can understand how one can write a program that uses multiple processes or threads: fork() a new process and use IPC, or create multiple threads and use those sorts of communication mechanisms.
I also understand context switching. That is, with only once CPU, the operating system schedules time for each process (and there are tons of scheduling algorithms out there) and thereby we achieve running multiple processes simultaneously.
And now that we have multi-core processors (or multi-processor computers), we could have two processes running simultaneously on two separate cores.
My question is about the last scenario: how does the kernel control which core a process runs on? Which system calls (in Linux, or even Windows) schedule a process on a specific core?
The reason I'm asking: I'm working on a project for school where we are to explore a recent topic in computing - and I chose multi-core architectures. There seems to be a lot of material on how to program in that kind of environment (how to watch for deadlock or race conditions) but not much on controlling the individual cores themselves. I would love to be able to write a few demonstration programs and present some assembly instructions or C code to the effect of "See, I am running an infinite loop on the 2nd core, look at the spike in CPU utilization for that specific core".
Any code examples? Or tutorials?
edit: For clarification - many people have said that this is the purpose of the OS, and that one should let the OS take care of this. I completely agree! But then what I'm asking (or trying to get a feel for) is what the operating system actually does to do this. Not the scheduling algorithm, but more "once a core is chosen, what instructions must be executed to have that core start fetching instructions?"
Source: (StackOverflow)
I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?
[edit]
After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.
Source: (StackOverflow)
This question continues on my question here (on the advice of Mystical):
C code loop performance
Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar:
for(int i=0; i<size; i+=16) {
y1 = _mm_load_ps(output[i]);
…
y4 = _mm_load_ps(output[i+12]);
for(k=0; k<ksize; k++){
for(l=0; l<ksize; l++){
w = _mm_set_ps1(weight[i+k+l]);
x1 = _mm_load_ps(input[i+k+l]);
y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1));
…
x4 = _mm_load_ps(input[i+k+l+12]);
y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4));
}
}
_mm_store_ps(&output[i],y1);
…
_mm_store_ps(&output[i+12],y4);
}
The measured performance of this kernel is about 5.6 FP operations per cycle, although i would expect it to be exactly 4x the performance of the scalar version, i.e. 4.1,6=6,4 FP ops per cycle.
Taking the move of the weight factor into account (thanks for pointing that out), the schedule looks like:
It looks like the schedule doesn't change, although there is an extra instruction after the movss
operation that moves the scalar weight value to the XMM register and then uses shufps
to copy this scalar value in the entire vector. It seems like the weight vector is ready to be used for the mulps
in time taking the switching latency from load to the floating point domain into account, so this shouldn't incur any extra latency.
The movaps
(aligned, packed move),addps
& mulps
instructions that are used in this kernel (checked with assembly code) have the same latency & throughput as their scalar versions, so this shouldn't incur any extra latency either.
Does anybody have an idea where this extra cycle per 8 cycles is spent on, assuming the maximum performance this kernel can get is 6.4 FP ops per cycle and it is running at 5.6 FP ops per cycle?
By the way here is what the actual assembly looks like:
…
Block x:
movapsx (%rax,%rcx,4), %xmm0
movapsx 0x10(%rax,%rcx,4), %xmm1
movapsx 0x20(%rax,%rcx,4), %xmm2
movapsx 0x30(%rax,%rcx,4), %xmm3
movssl (%rdx,%rcx,4), %xmm4
inc %rcx
shufps $0x0, %xmm4, %xmm4 {fill weight vector}
cmp $0x32, %rcx
mulps %xmm4, %xmm0
mulps %xmm4, %xmm1
mulps %xmm4, %xmm2
mulps %xmm3, %xmm4
addps %xmm0, %xmm5
addps %xmm1, %xmm6
addps %xmm2, %xmm7
addps %xmm4, %xmm8
jl 0x401ad6 <Block x>
…
Source: (StackOverflow)
I tried to run my hello world Android Studio application in my computer but got following informations:
Emulator: ERROR: x86 emulation currently requires hardware acceleration!
Please ensure Intel HAXM is properly installed and usable.
CPU acceleration status: HAX kernel module is not installed!
Can you tell me what I can do with the error?
Source: (StackOverflow)
I try to install Intel MPI Benchmark in my computer and I receive the error:
fork: retry: Resource temporarily unavailable
Then I receive this error again when I run ls
and top
command.
What is causing this error?
Configuration my machine:
Dell precision T7500
Scientific Linux release 6.2 (Carbon)
Source: (StackOverflow)
Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by the RISC core.
At the beginning it was clear for me that Intel decided to hide new internal architecture and force programmers to use "CISC shell". Thanks to this decision Intel could fully redesign microprocessors architecture without breaking compatibility, it's reasonable.
However I don't understand one thing, why Intel still keeps an internal RISC instructions set hidden for so many years? Why wouldn't they let programmers use RISC instructions like the use old x86 CISC instructions set?
If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).
Looking at new Intel 'Core i' series I see, that they only extends CISC instructions set adding AVX, SSE4 and others.
Source: (StackOverflow)
For the Intel architectures, is there a way to instruct the GCC compiler to generate code that always forces branch prediction a particular way in my code? Does the Intel hardware even support this? What about other compilers or hardwares?
I would use this in C++ code where I know the case I wish to run fast and do not care about the slow down when the other branch needs to be taken even when it has recently taken that branch.
for (;;) {
if (normal) { // How to tell compiler to always branch predict true value?
doSomethingNormal();
} else {
exceptionalCase();
}
}
As a follow on question for Evdzhan Mustafa, can the hint just specify a hint for the first time the processor encounters the instruction, all subsequent branch prediction, functioning normally?
Source: (StackOverflow)
I have an issue with my HAXM installation. Here is the thing. I got this error every single time I tried to install HAXM for my computer:
Problem is, that my computer supports Virtualization Technology (see pic below). Any idea how to fix this issue?
Source: (StackOverflow)
The gcc -S
option will generate assembly code in AT&T syntax, is there a way to generate files in Intel syntax? Or is there a way to convert between the two?
Source: (StackOverflow)
I'm trying to use the Intel HAX x86 emulator for Windows (8, if that matters). I installed everything and created an AVD for the android version, and everything appears correct, but when I run it, I get this output:
Starting emulator for AVD 'x86_QVGA_Level10'
emulator: device fd:1044
HAX is working and emulator runs in fast virt mode
emulator: Failed to sync vcpu reg
emulator: Failed to sync HAX vcpu context
and the emulator won't run. The significant part of this error (Failed to sync vcpu reg)
is not mentioned ANYWHERE online, except in the source code for the HAX itself, but I can't figure out how to make any sense of this.
Can anyone advise about how to get past this error? I really need to get this working, because debugging on device and in the default emulators is painfully slow.
Source: (StackOverflow)
I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.
To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.
The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:
// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm> // Needed for the "max" function
#include <cmath>
#include <iostream>
// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
double x = 0.0;
double y = 0.0;
double euclid_sq = 0.0;
// Continue generating two uniform random variables
// until the square of their "euclidean distance"
// is less than unity
do {
x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
euclid_sq = x*x + y*y;
} while (euclid_sq >= 1.0);
return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}
// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(S_cur - K, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(K - S_cur, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
int main(int argc, char **argv) {
// First we create the parameter list
int num_sims = 10000000; // Number of simulated asset paths
double S = 100.0; // Option price
double K = 100.0; // Strike price
double r = 0.05; // Risk-free rate (5%)
double v = 0.2; // Volatility of the underlying (20%)
double T = 1.0; // One year until expiry
// Then we calculate the call/put values via Monte Carlo
double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
double put = monte_carlo_put_price(num_sims, S, K, r, v, T);
// Finally we output the parameters and prices
std::cout << "Number of Paths: " << num_sims << std::endl;
std::cout << "Underlying: " << S << std::endl;
std::cout << "Strike: " << K << std::endl;
std::cout << "Risk-Free Rate: " << r << std::endl;
std::cout << "Volatility: " << v << std::endl;
std::cout << "Maturity: " << T << std::endl;
std::cout << "Call Price: " << call << std::endl;
std::cout << "Put Price: " << put << std::endl;
return 0;
}
The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.
The highlights are:
- It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
- the lab computers have Haswell CPUs
- The students have been exposed to the
CPUID
instruction and how to determine cache size, as well as intrinsics and the CLFLUSH
instruction.
- any compiler options are allowed, and so is inline asm.
- Writing your own square root algorithm was announced as being outside the pale
Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0
, and that a 17% increase in run-time was reasonable.
So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.
Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.
Source: (StackOverflow)
Lately Google and Intel have published a new way to run the emulator, which should work much better than the previous version (which has emulated ARM CPU). Here are some links about it: this and this.
However, after installing the new components and creating a new emulator configuration as instructed, I get an error and I also can't see any improvements. I've tried both API 10 and API 15, and with GPU enabled and disabled. None of those configurations helped. I've also tried it on two different computers and didn't get any boost (and got the same errors).
It seems that on the posts I've read about it, nobody had any problems with it and all report a much faster emulator.
The error it shows is:
emulator: Failed to open the HAX device!
HAX is not working and emulator runs in emulation mode
emulator: Open HAX device failed
Why is it happening, and is there a way to fix it? Is there anyone else who gets those errors or vice versa?
By the way, I have an Intel CPU, if that could be a problem.
EDIT:
here's what I see in the BIOS, so it should be available... :
Source: (StackOverflow)
Just read this fascinating article about the 20x-200x slowdowns you can get on Intel CPUs with denormalized floats (floating point numbers very close to 0).
There is an option with SSE to round these off to 0, restoring performance when such floating point values are encountered.
How do C# apps handle this? Is there an option to enable/disable _MM_FLUSH_ZERO
?
Source: (StackOverflow)