EzDevInfo.com

intel interview questions

Top intel frequently asked interview questions

Error in launching AVD with AMD processor

I have Windows 8.1 pro with an AMD processor. I installed the Android SDK and Eclipse. It works but the problem is that when I Create AVD and launch it shows this error:

emulator: ERROR: x86 emulation currently requires hardware acceleration!
Please ensure Intel HAXM is properly installed and usable.
CPU acceleration status: HAX kernel module is not installed!

I have already installed Intel Hardware_Accelerated_Execution_Manager and I have enabled Virtual modulation from the boot menu, but it's still not working.

Source: (StackOverflow)

C code loop performance

I have a multiply-add kernel inside my application and I want to increase its performance.

I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:

 for(int i=0; i<iterations; i+=4) {
    y1 = _mm_set_ss(output[i]);
    y2 = _mm_set_ss(output[i+1]);
    y3 = _mm_set_ss(output[i+2]);
    y4 = _mm_set_ss(output[i+3]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ss(weight[i+k+l]);

            x1 = _mm_set_ss(input[i+k+l]);
            y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
            …
            x4 = _mm_set_ss(input[i+k+l+3]);
            y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
        }
    }
    _mm_store_ss(&output[i],y1);
    _mm_store_ss(&output[i+1],y2);
    _mm_store_ss(&output[i+2],y3);
    _mm_store_ss(&output[i+3],y4);
 }

I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.

The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).

If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).

schedule

According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.

I am trying to either:

get rid of this overhead, or
explain where it comes from

Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?

I hope my problem is clear, thanks in advance for your responses!

Update: The assembly of the inner-loop looks as follows:

...
Block 21: 
  movssl  (%rsi,%rdi,4), %xmm4 
  movssl  (%rcx,%rdi,4), %xmm0 
  movssl  0x4(%rcx,%rdi,4), %xmm1 
  movssl  0x8(%rcx,%rdi,4), %xmm2 
  movssl  0xc(%rcx,%rdi,4), %xmm3 
  inc %rdi 
  mulss %xmm4, %xmm0 
  cmp $0x32, %rdi 
  mulss %xmm4, %xmm1 
  mulss %xmm4, %xmm2 
  mulss %xmm3, %xmm4 
  addss %xmm0, %xmm5 
  addss %xmm1, %xmm6 
  addss %xmm2, %xmm7 
  addss %xmm4, %xmm8 
  jl 0x401b52 <Block 21> 
...

Source: (StackOverflow)

Advertisements

How to control which core a process runs on?

I can understand how one can write a program that uses multiple processes or threads: fork() a new process and use IPC, or create multiple threads and use those sorts of communication mechanisms.

I also understand context switching. That is, with only once CPU, the operating system schedules time for each process (and there are tons of scheduling algorithms out there) and thereby we achieve running multiple processes simultaneously.

And now that we have multi-core processors (or multi-processor computers), we could have two processes running simultaneously on two separate cores.

My question is about the last scenario: how does the kernel control which core a process runs on? Which system calls (in Linux, or even Windows) schedule a process on a specific core?

The reason I'm asking: I'm working on a project for school where we are to explore a recent topic in computing - and I chose multi-core architectures. There seems to be a lot of material on how to program in that kind of environment (how to watch for deadlock or race conditions) but not much on controlling the individual cores themselves. I would love to be able to write a few demonstration programs and present some assembly instructions or C code to the effect of "See, I am running an infinite loop on the 2nd core, look at the spike in CPU utilization for that specific core".

Any code examples? Or tutorials?

edit: For clarification - many people have said that this is the purpose of the OS, and that one should let the OS take care of this. I completely agree! But then what I'm asking (or trying to get a feel for) is what the operating system actually does to do this. Not the scheduling algorithm, but more "once a core is chosen, what instructions must be executed to have that core start fetching instructions?"

Source: (StackOverflow)

Where is the L1 memory cache of Intel x86 processors documented?

I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?

[edit] After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.

Source: (StackOverflow)

C code loop performance [continued]

This question continues on my question here (on the advice of Mystical):

C code loop performance

Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar:

for(int i=0; i<size; i+=16) {
    y1 = _mm_load_ps(output[i]);
    …
    y4 = _mm_load_ps(output[i+12]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ps1(weight[i+k+l]);

            x1 = _mm_load_ps(input[i+k+l]);
            y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1));
            …
            x4 = _mm_load_ps(input[i+k+l+12]);
            y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4));
        }
    }
    _mm_store_ps(&output[i],y1);
    …
    _mm_store_ps(&output[i+12],y4);
    }

The measured performance of this kernel is about 5.6 FP operations per cycle, although i would expect it to be exactly 4x the performance of the scalar version, i.e. 4.1,6=6,4 FP ops per cycle.

Taking the move of the weight factor into account (thanks for pointing that out), the schedule looks like:

schedule

It looks like the schedule doesn't change, although there is an extra instruction after the movss operation that moves the scalar weight value to the XMM register and then uses shufps to copy this scalar value in the entire vector. It seems like the weight vector is ready to be used for the mulps in time taking the switching latency from load to the floating point domain into account, so this shouldn't incur any extra latency.

The movaps (aligned, packed move),addps & mulps instructions that are used in this kernel (checked with assembly code) have the same latency & throughput as their scalar versions, so this shouldn't incur any extra latency either.

Does anybody have an idea where this extra cycle per 8 cycles is spent on, assuming the maximum performance this kernel can get is 6.4 FP ops per cycle and it is running at 5.6 FP ops per cycle?

By the way here is what the actual assembly looks like:

…
Block x: 
  movapsx  (%rax,%rcx,4), %xmm0
  movapsx  0x10(%rax,%rcx,4), %xmm1
  movapsx  0x20(%rax,%rcx,4), %xmm2
  movapsx  0x30(%rax,%rcx,4), %xmm3
  movssl  (%rdx,%rcx,4), %xmm4
  inc %rcx
  shufps $0x0, %xmm4, %xmm4               {fill weight vector}
  cmp $0x32, %rcx 
  mulps %xmm4, %xmm0 
  mulps %xmm4, %xmm1
  mulps %xmm4, %xmm2 
  mulps %xmm3, %xmm4
  addps %xmm0, %xmm5 
  addps %xmm1, %xmm6 
  addps %xmm2, %xmm7 
  addps %xmm4, %xmm8 
  jl 0x401ad6 <Block x> 
…

Source: (StackOverflow)

Emulator: ERROR: x86 emulation currently requires hardware acceleration

I tried to run my hello world Android Studio application in my computer but got following informations:

Emulator: ERROR: x86 emulation currently requires hardware acceleration!

Please ensure Intel HAXM is properly installed and usable.

CPU acceleration status: HAX kernel module is not installed!

Can you tell me what I can do with the error?

Source: (StackOverflow)

fork: retry: Resource temporarily unavailable [closed]

I try to install Intel MPI Benchmark in my computer and I receive the error:

fork: retry: Resource temporarily unavailable

Then I receive this error again when I run ls and top command.

What is causing this error?

Configuration my machine:

Dell precision T7500
Scientific Linux release 6.2 (Carbon)

Source: (StackOverflow)

Why does Intel hide internal RISC core in their processors?

Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by the RISC core.

At the beginning it was clear for me that Intel decided to hide new internal architecture and force programmers to use "CISC shell". Thanks to this decision Intel could fully redesign microprocessors architecture without breaking compatibility, it's reasonable.

However I don't understand one thing, why Intel still keeps an internal RISC instructions set hidden for so many years? Why wouldn't they let programmers use RISC instructions like the use old x86 CISC instructions set?

If Intel keeps backward compatibility for so long (we still have virtual 8086 mode next to 64 bit mode), Why don't they allow us compile programs so they will bypass CISC instructions and use RISC core directly? This will open natural way to slowly abandon x86 instructions set, which is deprecated nowadays (this is the main reason why Intel decided to use RISC core inside, right?).

Looking at new Intel 'Core i' series I see, that they only extends CISC instructions set adding AVX, SSE4 and others.

Source: (StackOverflow)

Is there a compiler hint for GCC to force branch prediction to always go a certain way?

For the Intel architectures, is there a way to instruct the GCC compiler to generate code that always forces branch prediction a particular way in my code? Does the Intel hardware even support this? What about other compilers or hardwares?

I would use this in C++ code where I know the case I wish to run fast and do not care about the slow down when the other branch needs to be taken even when it has recently taken that branch.

for (;;) {
  if (normal) { // How to tell compiler to always branch predict true value?
    doSomethingNormal();
  } else {
    exceptionalCase();
  }
}

As a follow on question for Evdzhan Mustafa, can the hint just specify a hint for the first time the processor encounters the instruction, all subsequent branch prediction, functioning normally?

Source: (StackOverflow)

Intel HAXM installation error - This computer does not support Intel Virtualization Technology (VT-x)

I have an issue with my HAXM installation. Here is the thing. I got this error every single time I tried to install HAXM for my computer: Installation error print-screen

Problem is, that my computer supports Virtualization Technology (see pic below). Any idea how to fix this issue? enter image description here

Source: (StackOverflow)

How do you use gcc to generate assembly code in Intel syntax?

The gcc -S option will generate assembly code in AT&T syntax, is there a way to generate files in Intel syntax? Or is there a way to convert between the two?

Source: (StackOverflow)

How do I fix "Failed to sync vcpu reg" error?

I'm trying to use the Intel HAX x86 emulator for Windows (8, if that matters). I installed everything and created an AVD for the android version, and everything appears correct, but when I run it, I get this output:

Starting emulator for AVD 'x86_QVGA_Level10'
emulator: device fd:1044
HAX is working and emulator runs in fast virt mode
emulator: Failed to sync vcpu reg
emulator: Failed to sync HAX vcpu context

and the emulator won't run. The significant part of this error (Failed to sync vcpu reg) is not mentioned ANYWHERE online, except in the source code for the HAX itself, but I can't figure out how to make any sense of this.

Can anyone advise about how to get past this error? I really need to get this working, because debugging on device and in the default emulators is painfully slow.

Source: (StackOverflow)

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:

Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the algorithm.

To deoptimize the program, use your knowledge of how the Intel i7 pipeline operates. Imagine ways to re-order instruction paths to introduce WAR, RAW, and other hazards. Think of ways to minimize the effectiveness of the cache. Be diabolically incompetent.

The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:

// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm>    // Needed for the "max" function
#include <cmath>
#include <iostream>

// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in 
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
  double x = 0.0;
  double y = 0.0;
  double euclid_sq = 0.0;

  // Continue generating two uniform random variables
  // until the square of their "euclidean distance" 
  // is less than unity
  do {
    x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
    y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
    euclid_sq = x*x + y*y;
  } while (euclid_sq >= 1.0);

  return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}

// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
  double S_adjust = S * exp(T*(r-0.5*v*v));
  double S_cur = 0.0;
  double payoff_sum = 0.0;

  for (int i=0; i<num_sims; i++) {
    double gauss_bm = gaussian_box_muller();
    S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
    payoff_sum += std::max(S_cur - K, 0.0);
  }

  return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}

// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
  double S_adjust = S * exp(T*(r-0.5*v*v));
  double S_cur = 0.0;
  double payoff_sum = 0.0;

  for (int i=0; i<num_sims; i++) {
    double gauss_bm = gaussian_box_muller();
    S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
    payoff_sum += std::max(K - S_cur, 0.0);
  }

  return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}

int main(int argc, char **argv) {
  // First we create the parameter list                                                                               
  int num_sims = 10000000;   // Number of simulated asset paths                                                       
  double S = 100.0;  // Option price                                                                                  
  double K = 100.0;  // Strike price                                                                                  
  double r = 0.05;   // Risk-free rate (5%)                                                                           
  double v = 0.2;    // Volatility of the underlying (20%)                                                            
  double T = 1.0;    // One year until expiry                                                                         

  // Then we calculate the call/put values via Monte Carlo                                                                          
  double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
  double put = monte_carlo_put_price(num_sims, S, K, r, v, T);

  // Finally we output the parameters and prices                                                                      
  std::cout << "Number of Paths: " << num_sims << std::endl;
  std::cout << "Underlying:      " << S << std::endl;
  std::cout << "Strike:          " << K << std::endl;
  std::cout << "Risk-Free Rate:  " << r << std::endl;
  std::cout << "Volatility:      " << v << std::endl;
  std::cout << "Maturity:        " << T << std::endl;

  std::cout << "Call Price:      " << call << std::endl;
  std::cout << "Put Price:       " << put << std::endl;

  return 0;
}

The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.

Update: the professor who gave this assignment posted some details

The highlights are:

It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).
the lab computers have Haswell CPUs
The students have been exposed to the CPUID instruction and how to determine cache size, as well as intrinsics and the CLFLUSH instruction.
any compiler options are allowed, and so is inline asm.
Writing your own square root algorithm was announced as being outside the pale

Cowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.

So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.

Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general.

Source: (StackOverflow)

Running the new Intel emulator for Android

Lately Google and Intel have published a new way to run the emulator, which should work much better than the previous version (which has emulated ARM CPU). Here are some links about it: this and this.

However, after installing the new components and creating a new emulator configuration as instructed, I get an error and I also can't see any improvements. I've tried both API 10 and API 15, and with GPU enabled and disabled. None of those configurations helped. I've also tried it on two different computers and didn't get any boost (and got the same errors).

It seems that on the posts I've read about it, nobody had any problems with it and all report a much faster emulator.

The error it shows is:

emulator: Failed to open the HAX device!
HAX is not working and emulator runs in emulation mode
emulator: Open HAX device failed

Why is it happening, and is there a way to fix it? Is there anyone else who gets those errors or vice versa?

By the way, I have an Intel CPU, if that could be a problem.

EDIT:

here's what I see in the BIOS, so it should be available... :

enter image description here

Source: (StackOverflow)

How are denormalized floats handled in C#?

Just read this fascinating article about the 20x-200x slowdowns you can get on Intel CPUs with denormalized floats (floating point numbers very close to 0).

There is an option with SSE to round these off to 0, restoring performance when such floating point values are encountered.

How do C# apps handle this? Is there an option to enable/disable _MM_FLUSH_ZERO?

Source: (StackOverflow)