EzDevInfo.com

Instructions

Create walkthroughs and guided tours (using coach marks) in a simple way, using Swift.

Should a developer be a designer?

I have been developing websites for quite some time and I am not so good in designing websites? My Boss is refering me to take some lessons on it.

But I really want to stick to development rather than designing?


Source: (StackOverflow)

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.

In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?

I want to give an input program and see the "out of order instruction execution trace" of my program.

I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.


Source: (StackOverflow)

Advertisements

Logging the Behavior of a Binary?

I want to figure out which instructions are executed differently when a command-line flag is passed to a program on Windows, of which I have the compiled (and optimized) binary, with no debug symbols or anything of the sort. I know the difference will not be more than a handful of instructions.

How would I go about figuring this out? Are there any techniques for logging exactly which instructions a program executes over a certain period of time?
(Note that this does not involve any system calls, just a flag being set in a loop because of the command-line flag.)


Source: (StackOverflow)

difference between conditional instructions (cmov) and jump instructions

I am having confusion where to use cmov instructions and where to use jump instructions in assembly?From performance point of view,what is the difference in both of them?which one is better? If possible,please explain their difference with an example. thanks in advance.


Source: (StackOverflow)

Does ret instruction cause esp register added by 4?

Does "ret" instruction cause "esp" register added by 4?


Source: (StackOverflow)

What is the 0x10 in the "leal 0x10(%ebx), %eax" x86 assembly instruction?

What the function is of the 0x10 in regards to this LEAL instruction? Is it a multiply or addition or is something else?

leal 0x10(%ebx), %eax

Can someone please clarify? This is x86 assembler on a Linux box.


Source: (StackOverflow)

Android instructions when open the application at first time?

Do you know this enter image description here

Well i want create something like this screen. When i open for the first time the application i want open this screen and display a context.. How is possible? I don't know what search for this type of thing..


Source: (StackOverflow)

IL Instructions not exposed by C#

What IL instructions are not exposed by C#?

I'm referring to instructions like sizeof and cpblk - there's no class or command that executes these instructions (sizeof in C# is computed at compile time, not at runtime AFAIK).

Others?

EDIT: The reason I'm asking this (and hopefully this will make my question a little more valid) is because I'm working on a small library which will provide the functionality of these instructions. sizeof and cpblk are already implemented - I wanted to know what others I may have missed before moving on.

EDIT2: Using Eric's answer, I've compiled a list of instructions:

  • Break
  • Jmp
  • Calli
  • Cpobj
  • Ckfinite
  • Prefix[1-7]
  • Prefixref
  • Endfilter
  • Unaligned
  • Tailcall
  • Cpblk
  • Initblk

There were a number of other instructions which were not included in the list, which I'm separating because they're basically shortcuts for other instructions (compressed to save time and space):

  • Ldarg[0-3]
  • Ldloc[0-3]
  • Stloc[0-3]
  • Ldc_[I4_[M1/S/0-8]/I8/R4/R8]
  • Ldind_[I1/U1/I2/U2/I4/U4/I8/R4/R8]
  • Stind_[I1/I2/I4/I8/R4/R8]
  • Conv_[I1/I2/I4/I8/R4/R8/U4/U8/U2/U1]
  • Conv_Ovf_[I1/I2/I4/I8/U1/U2/U4/U8]
  • Conv_Ovf_[I1/I2/I4/I8/U1/U2/U4/U8]_Un
  • Ldelem_[I1/I2/I4/I8/U1/U2/U4/R4/R8]
  • Stelem_[I1/I2/I4/I8/R4/R8]

Source: (StackOverflow)

C code loop performance [continued]

This question continues on my question here (on the advice of Mystical):

C code loop performance


Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar:

for(int i=0; i<size; i+=16) {
    y1 = _mm_load_ps(output[i]);
    …
    y4 = _mm_load_ps(output[i+12]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ps1(weight[i+k+l]);

            x1 = _mm_load_ps(input[i+k+l]);
            y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1));
            …
            x4 = _mm_load_ps(input[i+k+l+12]);
            y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4));
        }
    }
    _mm_store_ps(&output[i],y1);
    …
    _mm_store_ps(&output[i+12],y4);
    }

The measured performance of this kernel is about 5.6 FP operations per cycle, although i would expect it to be exactly 4x the performance of the scalar version, i.e. 4.1,6=6,4 FP ops per cycle.

Taking the move of the weight factor into account (thanks for pointing that out), the schedule looks like:

schedule

It looks like the schedule doesn't change, although there is an extra instruction after the movss operation that moves the scalar weight value to the XMM register and then uses shufps to copy this scalar value in the entire vector. It seems like the weight vector is ready to be used for the mulps in time taking the switching latency from load to the floating point domain into account, so this shouldn't incur any extra latency.

The movaps (aligned, packed move),addps & mulps instructions that are used in this kernel (checked with assembly code) have the same latency & throughput as their scalar versions, so this shouldn't incur any extra latency either.

Does anybody have an idea where this extra cycle per 8 cycles is spent on, assuming the maximum performance this kernel can get is 6.4 FP ops per cycle and it is running at 5.6 FP ops per cycle?


By the way here is what the actual assembly looks like:

…
Block x: 
  movapsx  (%rax,%rcx,4), %xmm0
  movapsx  0x10(%rax,%rcx,4), %xmm1
  movapsx  0x20(%rax,%rcx,4), %xmm2
  movapsx  0x30(%rax,%rcx,4), %xmm3
  movssl  (%rdx,%rcx,4), %xmm4
  inc %rcx
  shufps $0x0, %xmm4, %xmm4               {fill weight vector}
  cmp $0x32, %rcx 
  mulps %xmm4, %xmm0 
  mulps %xmm4, %xmm1
  mulps %xmm4, %xmm2 
  mulps %xmm3, %xmm4
  addps %xmm0, %xmm5 
  addps %xmm1, %xmm6 
  addps %xmm2, %xmm7 
  addps %xmm4, %xmm8 
  jl 0x401ad6 <Block x> 
…

Source: (StackOverflow)

C code loop performance

I have a multiply-add kernel inside my application and I want to increase its performance.

I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:

 for(int i=0; i<iterations; i+=4) {
    y1 = _mm_set_ss(output[i]);
    y2 = _mm_set_ss(output[i+1]);
    y3 = _mm_set_ss(output[i+2]);
    y4 = _mm_set_ss(output[i+3]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ss(weight[i+k+l]);

            x1 = _mm_set_ss(input[i+k+l]);
            y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
            …
            x4 = _mm_set_ss(input[i+k+l+3]);
            y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
        }
    }
    _mm_store_ss(&output[i],y1);
    _mm_store_ss(&output[i+1],y2);
    _mm_store_ss(&output[i+2],y3);
    _mm_store_ss(&output[i+3],y4);
 }

I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.

The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).

If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).

schedule

According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.

I am trying to either:

  • get rid of this overhead, or
  • explain where it comes from

Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?

I hope my problem is clear, thanks in advance for your responses!


Update: The assembly of the inner-loop looks as follows:

...
Block 21: 
  movssl  (%rsi,%rdi,4), %xmm4 
  movssl  (%rcx,%rdi,4), %xmm0 
  movssl  0x4(%rcx,%rdi,4), %xmm1 
  movssl  0x8(%rcx,%rdi,4), %xmm2 
  movssl  0xc(%rcx,%rdi,4), %xmm3 
  inc %rdi 
  mulss %xmm4, %xmm0 
  cmp $0x32, %rdi 
  mulss %xmm4, %xmm1 
  mulss %xmm4, %xmm2 
  mulss %xmm3, %xmm4 
  addss %xmm0, %xmm5 
  addss %xmm1, %xmm6 
  addss %xmm2, %xmm7 
  addss %xmm4, %xmm8 
  jl 0x401b52 <Block 21> 
...

Source: (StackOverflow)

How is x86 instruction cache synchronized?

I like examples, so I wrote a bit of self-modifying code in c...

#include <stdio.h>
#include <sys/mman.h> // linux

int main(void) {
    unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
                            MAP_ANONYMOUS, -1, 0); // get executable memory
    c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
    c[1] = 0b11000000; // to register rax (000) which holds the return value
                       // according to linux x86_64 calling convention 
    c[6] = 0b11000011; // return
    for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
        // rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
        printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
    }
    putchar('\n');
    return 0;
}

...which works, apparently:

>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.

I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?

(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)


Source: (StackOverflow)

JVM instruction ALOAD_0 in the 'main' method points to 'args' instead of 'this'?

I am trying to implement a subset of Java for an academic study. Well, I'm in the last stages (code generation) and I wrote a rather simple program to see how method arguments are handled:

class Main {
    public static void main(String[] args) {
        System.out.println(args.length);
    }
}

Then I built it, and ran 'Main.class' through an online disassembler I found at: http://www.cs.cornell.edu/People/egs/kimera/disassembler.html

I get the following implementation for the 'main' method: (the disassembled output is in Jasmin)

.method public static main([Ljava/lang/String;)V
    .limit locals 1
    .limit stack 2

    getstatic   java/lang/System/out Ljava/io/PrintStream;
    aload_0
    arraylength
    invokevirtual   java/io/PrintStream.println(I)V
    return
.end method

My problem with this is:
1. aload_0 is supposed to push 'this' on to the stack (thats what the JVM spec seems to say)
2. arraylength is supposed to return the length of the array whose reference is on the top-of-stack

So according to me the combination of 1 & 2 should not even work.

How/why is it working? Or is the disassembler buggy and the actual bytecode is something else?


Source: (StackOverflow)

What does the `test` instruction do?

I'm looking at some small assembler codes and I'm having trouble understanding the TEST instruction and its use. I'm looking at the following code at the end of a loop:

8048531:    84 c0                   test   al,al
8048533:    75 dc                   jne    8048511 <function+0x2d>

The way i understand TEST is that it works a bit like the AND operator and it sets some flags. I guess I don't really understand how the flags work. test al,al to me looks like it checks the same lower bits and will always get the same results.

Can someone explain?


Source: (StackOverflow)

Is there a list of deprecated x86 instructions?

I'm taking an x86 assembly language programming class and know that certain instructions shouldn't be used anymore -- because they're slow on modern processors; for example, the loop instruction.

I haven't been able to find any list of instructions that are considered deprecated and should be avoided; any guidance would be appreciated.


Source: (StackOverflow)

How exactly do executables work?

I know that executables contain instructions, but what exactly are these instructions? If I want to call the MessageBox API function for example, what does the instruction look like?

Thanks.


Source: (StackOverflow)