Instructions
Create walkthroughs and guided tours (using coach marks) in a simple way, using Swift.
I have been developing websites for quite some time and I am not so good in designing websites? My Boss is refering me to take some lessons on it.
But I really want to stick to development rather than designing?
Source: (StackOverflow)
I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
Source: (StackOverflow)
I want to figure out which instructions are executed differently when a command-line flag is passed to a program on Windows, of which I have the compiled (and optimized) binary, with no debug symbols or anything of the sort. I know the difference will not be more than a handful of instructions.
How would I go about figuring this out? Are there any techniques for logging exactly which instructions a program executes over a certain period of time?
(Note that this does not involve any system calls, just a flag being set in a loop because of the command-line flag.)
Source: (StackOverflow)
I am having confusion where to use cmov instructions and where to use jump instructions in assembly?From performance point of view,what is the difference in both of them?which one is better? If possible,please explain their difference with an example.
thanks in advance.
Source: (StackOverflow)
What the function is of the 0x10 in regards to this LEAL instruction? Is it a multiply or addition or is something else?
leal 0x10(%ebx), %eax
Can someone please clarify? This is x86 assembler on a Linux box.
Source: (StackOverflow)
Do you know this
Well i want create something like this screen. When i open for the first time the application i want open this screen and display a context.. How is possible? I don't know what search for this type of thing..
Source: (StackOverflow)
What IL instructions are not exposed by C#?
I'm referring to instructions like sizeof and cpblk - there's no class or command that executes these instructions (sizeof in C# is computed at compile time, not at runtime AFAIK).
Others?
EDIT: The reason I'm asking this (and hopefully this will make my question a little more valid) is because I'm working on a small library which will provide the functionality of these instructions. sizeof and cpblk are already implemented - I wanted to know what others I may have missed before moving on.
EDIT2: Using Eric's answer, I've compiled a list of instructions:
- Break
- Jmp
- Calli
- Cpobj
- Ckfinite
- Prefix[1-7]
- Prefixref
- Endfilter
- Unaligned
- Tailcall
- Cpblk
- Initblk
There were a number of other instructions which were not included in the list, which I'm separating because they're basically shortcuts for other instructions (compressed to save time and space):
- Ldarg[0-3]
- Ldloc[0-3]
- Stloc[0-3]
- Ldc_[I4_[M1/S/0-8]/I8/R4/R8]
- Ldind_[I1/U1/I2/U2/I4/U4/I8/R4/R8]
- Stind_[I1/I2/I4/I8/R4/R8]
- Conv_[I1/I2/I4/I8/R4/R8/U4/U8/U2/U1]
- Conv_Ovf_[I1/I2/I4/I8/U1/U2/U4/U8]
- Conv_Ovf_[I1/I2/I4/I8/U1/U2/U4/U8]_Un
- Ldelem_[I1/I2/I4/I8/U1/U2/U4/R4/R8]
- Stelem_[I1/I2/I4/I8/R4/R8]
Source: (StackOverflow)
This question continues on my question here (on the advice of Mystical):
C code loop performance
Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar:
for(int i=0; i<size; i+=16) {
y1 = _mm_load_ps(output[i]);
…
y4 = _mm_load_ps(output[i+12]);
for(k=0; k<ksize; k++){
for(l=0; l<ksize; l++){
w = _mm_set_ps1(weight[i+k+l]);
x1 = _mm_load_ps(input[i+k+l]);
y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1));
…
x4 = _mm_load_ps(input[i+k+l+12]);
y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4));
}
}
_mm_store_ps(&output[i],y1);
…
_mm_store_ps(&output[i+12],y4);
}
The measured performance of this kernel is about 5.6 FP operations per cycle, although i would expect it to be exactly 4x the performance of the scalar version, i.e. 4.1,6=6,4 FP ops per cycle.
Taking the move of the weight factor into account (thanks for pointing that out), the schedule looks like:
It looks like the schedule doesn't change, although there is an extra instruction after the movss
operation that moves the scalar weight value to the XMM register and then uses shufps
to copy this scalar value in the entire vector. It seems like the weight vector is ready to be used for the mulps
in time taking the switching latency from load to the floating point domain into account, so this shouldn't incur any extra latency.
The movaps
(aligned, packed move),addps
& mulps
instructions that are used in this kernel (checked with assembly code) have the same latency & throughput as their scalar versions, so this shouldn't incur any extra latency either.
Does anybody have an idea where this extra cycle per 8 cycles is spent on, assuming the maximum performance this kernel can get is 6.4 FP ops per cycle and it is running at 5.6 FP ops per cycle?
By the way here is what the actual assembly looks like:
…
Block x:
movapsx (%rax,%rcx,4), %xmm0
movapsx 0x10(%rax,%rcx,4), %xmm1
movapsx 0x20(%rax,%rcx,4), %xmm2
movapsx 0x30(%rax,%rcx,4), %xmm3
movssl (%rdx,%rcx,4), %xmm4
inc %rcx
shufps $0x0, %xmm4, %xmm4 {fill weight vector}
cmp $0x32, %rcx
mulps %xmm4, %xmm0
mulps %xmm4, %xmm1
mulps %xmm4, %xmm2
mulps %xmm3, %xmm4
addps %xmm0, %xmm5
addps %xmm1, %xmm6
addps %xmm2, %xmm7
addps %xmm4, %xmm8
jl 0x401ad6 <Block x>
…
Source: (StackOverflow)
I have a multiply-add kernel inside my application and I want to increase its performance.
I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:
for(int i=0; i<iterations; i+=4) {
y1 = _mm_set_ss(output[i]);
y2 = _mm_set_ss(output[i+1]);
y3 = _mm_set_ss(output[i+2]);
y4 = _mm_set_ss(output[i+3]);
for(k=0; k<ksize; k++){
for(l=0; l<ksize; l++){
w = _mm_set_ss(weight[i+k+l]);
x1 = _mm_set_ss(input[i+k+l]);
y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
…
x4 = _mm_set_ss(input[i+k+l+3]);
y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
}
}
_mm_store_ss(&output[i],y1);
_mm_store_ss(&output[i+1],y2);
_mm_store_ss(&output[i+2],y3);
_mm_store_ss(&output[i+3],y4);
}
I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.
The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).
If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov
instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).
According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.
I am trying to either:
- get rid of this overhead, or
- explain where it comes from
Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?
I hope my problem is clear, thanks in advance for your responses!
Update: The assembly of the inner-loop looks as follows:
...
Block 21:
movssl (%rsi,%rdi,4), %xmm4
movssl (%rcx,%rdi,4), %xmm0
movssl 0x4(%rcx,%rdi,4), %xmm1
movssl 0x8(%rcx,%rdi,4), %xmm2
movssl 0xc(%rcx,%rdi,4), %xmm3
inc %rdi
mulss %xmm4, %xmm0
cmp $0x32, %rdi
mulss %xmm4, %xmm1
mulss %xmm4, %xmm2
mulss %xmm3, %xmm4
addss %xmm0, %xmm5
addss %xmm1, %xmm6
addss %xmm2, %xmm7
addss %xmm4, %xmm8
jl 0x401b52 <Block 21>
...
Source: (StackOverflow)
I like examples, so I wrote a bit of self-modifying code in c...
#include <stdio.h>
#include <sys/mman.h> // linux
int main(void) {
unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0); // get executable memory
c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
c[1] = 0b11000000; // to register rax (000) which holds the return value
// according to linux x86_64 calling convention
c[6] = 0b11000011; // return
for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
// rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
}
putchar('\n');
return 0;
}
...which works, apparently:
>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0
to be cached upon the first call to c
, after which all consecutive calls to c
would ignore the repeated changes made to c
(unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.
I guess the cpu compares RAM (assuming c
even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?
(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)
Source: (StackOverflow)
I am trying to implement a subset of Java for an academic study. Well, I'm in the last stages (code generation) and I wrote a rather simple program to see how method arguments are handled:
class Main {
public static void main(String[] args) {
System.out.println(args.length);
}
}
Then I built it, and ran 'Main.class' through an online disassembler I found at:
http://www.cs.cornell.edu/People/egs/kimera/disassembler.html
I get the following implementation for the 'main' method:
(the disassembled output is in Jasmin)
.method public static main([Ljava/lang/String;)V
.limit locals 1
.limit stack 2
getstatic java/lang/System/out Ljava/io/PrintStream;
aload_0
arraylength
invokevirtual java/io/PrintStream.println(I)V
return
.end method
My problem with this is:
1. aload_0
is supposed to push 'this' on to the stack (thats what the JVM spec seems to say)
2. arraylength
is supposed to return the length of the array whose reference is on the top-of-stack
So according to me the combination of 1 & 2 should not even work.
How/why is it working? Or is the disassembler buggy and the actual bytecode is something else?
Source: (StackOverflow)
I'm looking at some small assembler codes and I'm having trouble understanding the TEST instruction and its use. I'm looking at the following code at the end of a loop:
8048531: 84 c0 test al,al
8048533: 75 dc jne 8048511 <function+0x2d>
The way i understand TEST is that it works a bit like the AND operator and it sets some flags. I guess I don't really understand how the flags work. test al,al
to me looks like it checks the same lower bits and will always get the same results.
Can someone explain?
Source: (StackOverflow)
I'm taking an x86 assembly language programming class and know that certain instructions shouldn't be used anymore -- because they're slow on modern processors; for example, the loop instruction.
I haven't been able to find any list of instructions that are considered deprecated and should be avoided; any guidance would be appreciated.
Source: (StackOverflow)
I know that executables contain instructions, but what exactly are these instructions? If I want to call the MessageBox
API function for example, what does the instruction look like?
Thanks.
Source: (StackOverflow)