EzDevInfo.com

jit

The JavaScript InfoVis Toolkit provides tools for creating Interactive Data Visualizations for the Web JavaScript InfoVis Toolkit javascript infovis toolkit, meaningful visualizations

Why are operators so much slower than method calls? (structs are slower only on older JITs)

Intro: I write high-performance code in C#. Yes, I know C++ would give me better optimization, but I still choose to use C#. I do not wish to debate that choice. Rather, I'd like to hear from those who, like me, are trying to write high-performance code on the .NET Framework.

Questions:

  • Why is the operator in the code below slower than the equivalent method call??
  • Why is the method passing two doubles in the code below faster than the equivalent method passing a struct that has two doubles inside? (A: older JITs optimize structs poorly)
  • Is there a way to get the .NET JIT Compiler to treat simple structs as efficiently as the members of the struct? (A: get newer JIT)

What I think I know: The original .NET JIT Compiler would not inline anything that involved a struct. Bizarre given structs should only be used where you need small value types that should be optimized like built-ins, but true. Fortunately, in .NET 3.5SP1 and .NET 2.0SP2, they made some improvements to the JIT Optimizer, including improvements to inlining, particularly for structs. (I am guessing they did that because otherwise the new Complex struct that they were introducing would have performed horribly... so the Complex team was probably pounding on the JIT Optimizer team.) So, any documentation prior to .NET 3.5 SP1 is probably not too relevant to this issue.

What my testing shows: I have verified that I do have the newer JIT Optimizer by checking that C:\Windows\Microsoft.NET\Framework\v2.0.50727\mscorwks.dll file does have version >= 3053 and so should have those improvements to the JIT optimizer. However, even with that, what my timings and looks at the disassembly both show are:

The JIT-produced code for passing a struct with two doubles is far less efficient than code that directly passes the two doubles.

The JIT-produced code for a struct method passes in 'this' far more efficiently than if you passed a struct as an argument.

The JIT still inlines better if you pass two doubles rather than passing a struct with two doubles, even with the multiplier due to being clearly in a loop.

The Timings: Actually, looking at the disassembly I realize that most of the time in the loops is just accessing the test data out of the List. The difference between the four ways of making the same calls is dramatically different if you factor out the overhead code of the loop and the accessing of the data. I get anywhere from 5x to 20x speedups for doing PlusEqual(double, double) instead of PlusEqual(Element). And 10x to 40x for doing PlusEqual(double, double) instead of operator +=. Wow. Sad.

Here's one set of timings:

Populating List<Element> took 320ms.
The PlusEqual() method took 105ms.
The 'same' += operator took 131ms.
The 'same' -= operator took 139ms.
The PlusEqual(double, double) method took 68ms.
The do nothing loop took 66ms.
The ratio of operator with constructor to method is 124%.
The ratio of operator without constructor to method is 132%.
The ratio of PlusEqual(double,double) to PlusEqual(Element) is 64%.
If we remove the overhead time for the loop accessing the elements from the List...
The ratio of operator with constructor to method is 166%.
The ratio of operator without constructor to method is 187%.
The ratio of PlusEqual(double,double) to PlusEqual(Element) is 5%.

The Code:

namespace OperatorVsMethod
{
  public struct Element
  {
    public double Left;
    public double Right;

    public Element(double left, double right)
    {
      this.Left = left;
      this.Right = right;
    }

    public static Element operator +(Element x, Element y)
    {
      return new Element(x.Left + y.Left, x.Right + y.Right);
    }

    public static Element operator -(Element x, Element y)
    {
      x.Left += y.Left;
      x.Right += y.Right;
      return x;
    }    

    /// <summary>
    /// Like the += operator; but faster.
    /// </summary>
    public void PlusEqual(Element that)
    {
      this.Left += that.Left;
      this.Right += that.Right;
    }    

    /// <summary>
    /// Like the += operator; but faster.
    /// </summary>
    public void PlusEqual(double thatLeft, double thatRight)
    {
      this.Left += thatLeft;
      this.Right += thatRight;
    }    
  }    

  [TestClass]
  public class UnitTest1
  {
    [TestMethod]
    public void TestMethod1()
    {
      Stopwatch stopwatch = new Stopwatch();

      // Populate a List of Elements to multiply together
      int seedSize = 4;
      List<double> doubles = new List<double>(seedSize);
      doubles.Add(2.5d);
      doubles.Add(100000d);
      doubles.Add(-0.5d);
      doubles.Add(-100002d);

      int size = 2500000 * seedSize;
      List<Element> elts = new List<Element>(size);

      stopwatch.Reset();
      stopwatch.Start();
      for (int ii = 0; ii < size; ++ii)
      {
        int di = ii % seedSize;
        double d = doubles[di];
        elts.Add(new Element(d, d));
      }
      stopwatch.Stop();
      long populateMS = stopwatch.ElapsedMilliseconds;

      // Measure speed of += operator (calls ctor)
      Element operatorCtorResult = new Element(1d, 1d);
      stopwatch.Reset();
      stopwatch.Start();
      for (int ii = 0; ii < size; ++ii)
      {
        operatorCtorResult += elts[ii];
      }
      stopwatch.Stop();
      long operatorCtorMS = stopwatch.ElapsedMilliseconds;

      // Measure speed of -= operator (+= without ctor)
      Element operatorNoCtorResult = new Element(1d, 1d);
      stopwatch.Reset();
      stopwatch.Start();
      for (int ii = 0; ii < size; ++ii)
      {
        operatorNoCtorResult -= elts[ii];
      }
      stopwatch.Stop();
      long operatorNoCtorMS = stopwatch.ElapsedMilliseconds;

      // Measure speed of PlusEqual(Element) method
      Element plusEqualResult = new Element(1d, 1d);
      stopwatch.Reset();
      stopwatch.Start();
      for (int ii = 0; ii < size; ++ii)
      {
        plusEqualResult.PlusEqual(elts[ii]);
      }
      stopwatch.Stop();
      long plusEqualMS = stopwatch.ElapsedMilliseconds;

      // Measure speed of PlusEqual(double, double) method
      Element plusEqualDDResult = new Element(1d, 1d);
      stopwatch.Reset();
      stopwatch.Start();
      for (int ii = 0; ii < size; ++ii)
      {
        Element elt = elts[ii];
        plusEqualDDResult.PlusEqual(elt.Left, elt.Right);
      }
      stopwatch.Stop();
      long plusEqualDDMS = stopwatch.ElapsedMilliseconds;

      // Measure speed of doing nothing but accessing the Element
      Element doNothingResult = new Element(1d, 1d);
      stopwatch.Reset();
      stopwatch.Start();
      for (int ii = 0; ii < size; ++ii)
      {
        Element elt = elts[ii];
        double left = elt.Left;
        double right = elt.Right;
      }
      stopwatch.Stop();
      long doNothingMS = stopwatch.ElapsedMilliseconds;

      // Report results
      Assert.AreEqual(1d, operatorCtorResult.Left, "The operator += did not compute the right result!");
      Assert.AreEqual(1d, operatorNoCtorResult.Left, "The operator += did not compute the right result!");
      Assert.AreEqual(1d, plusEqualResult.Left, "The operator += did not compute the right result!");
      Assert.AreEqual(1d, plusEqualDDResult.Left, "The operator += did not compute the right result!");
      Assert.AreEqual(1d, doNothingResult.Left, "The operator += did not compute the right result!");

      // Report speeds
      Console.WriteLine("Populating List<Element> took {0}ms.", populateMS);
      Console.WriteLine("The PlusEqual() method took {0}ms.", plusEqualMS);
      Console.WriteLine("The 'same' += operator took {0}ms.", operatorCtorMS);
      Console.WriteLine("The 'same' -= operator took {0}ms.", operatorNoCtorMS);
      Console.WriteLine("The PlusEqual(double, double) method took {0}ms.", plusEqualDDMS);
      Console.WriteLine("The do nothing loop took {0}ms.", doNothingMS);

      // Compare speeds
      long percentageRatio = 100L * operatorCtorMS / plusEqualMS;
      Console.WriteLine("The ratio of operator with constructor to method is {0}%.", percentageRatio);
      percentageRatio = 100L * operatorNoCtorMS / plusEqualMS;
      Console.WriteLine("The ratio of operator without constructor to method is {0}%.", percentageRatio);
      percentageRatio = 100L * plusEqualDDMS / plusEqualMS;
      Console.WriteLine("The ratio of PlusEqual(double,double) to PlusEqual(Element) is {0}%.", percentageRatio);

      operatorCtorMS -= doNothingMS;
      operatorNoCtorMS -= doNothingMS;
      plusEqualMS -= doNothingMS;
      plusEqualDDMS -= doNothingMS;
      Console.WriteLine("If we remove the overhead time for the loop accessing the elements from the List...");
      percentageRatio = 100L * operatorCtorMS / plusEqualMS;
      Console.WriteLine("The ratio of operator with constructor to method is {0}%.", percentageRatio);
      percentageRatio = 100L * operatorNoCtorMS / plusEqualMS;
      Console.WriteLine("The ratio of operator without constructor to method is {0}%.", percentageRatio);
      percentageRatio = 100L * plusEqualDDMS / plusEqualMS;
      Console.WriteLine("The ratio of PlusEqual(double,double) to PlusEqual(Element) is {0}%.", percentageRatio);
    }
  }
}

The IL: (aka. what some of the above gets compiled into)

public void PlusEqual(Element that)
    {
00000000 push    ebp 
00000001 mov     ebp,esp 
00000003 push    edi 
00000004 push    esi 
00000005 push    ebx 
00000006 sub     esp,30h 
00000009 xor     eax,eax 
0000000b mov     dword ptr [ebp-10h],eax 
0000000e xor     eax,eax 
00000010 mov     dword ptr [ebp-1Ch],eax 
00000013 mov     dword ptr [ebp-3Ch],ecx 
00000016 cmp     dword ptr ds:[04C87B7Ch],0 
0000001d je     00000024 
0000001f call    753081B1 
00000024 nop       
      this.Left += that.Left;
00000025 mov     eax,dword ptr [ebp-3Ch] 
00000028 fld     qword ptr [ebp+8] 
0000002b fadd    qword ptr [eax] 
0000002d fstp    qword ptr [eax] 
      this.Right += that.Right;
0000002f mov     eax,dword ptr [ebp-3Ch] 
00000032 fld     qword ptr [ebp+10h] 
00000035 fadd    qword ptr [eax+8] 
00000038 fstp    qword ptr [eax+8] 
    }
0000003b nop       
0000003c lea     esp,[ebp-0Ch] 
0000003f pop     ebx 
00000040 pop     esi 
00000041 pop     edi 
00000042 pop     ebp 
00000043 ret     10h 
 public void PlusEqual(double thatLeft, double thatRight)
    {
00000000 push    ebp 
00000001 mov     ebp,esp 
00000003 push    edi 
00000004 push    esi 
00000005 push    ebx 
00000006 sub     esp,30h 
00000009 xor     eax,eax 
0000000b mov     dword ptr [ebp-10h],eax 
0000000e xor     eax,eax 
00000010 mov     dword ptr [ebp-1Ch],eax 
00000013 mov     dword ptr [ebp-3Ch],ecx 
00000016 cmp     dword ptr ds:[04C87B7Ch],0 
0000001d je     00000024 
0000001f call    75308159 
00000024 nop       
      this.Left += thatLeft;
00000025 mov     eax,dword ptr [ebp-3Ch] 
00000028 fld     qword ptr [ebp+10h] 
0000002b fadd    qword ptr [eax] 
0000002d fstp    qword ptr [eax] 
      this.Right += thatRight;
0000002f mov     eax,dword ptr [ebp-3Ch] 
00000032 fld     qword ptr [ebp+8] 
00000035 fadd    qword ptr [eax+8] 
00000038 fstp    qword ptr [eax+8] 
    }
0000003b nop       
0000003c lea     esp,[ebp-0Ch] 
0000003f pop     ebx 
00000040 pop     esi 
00000041 pop     edi 
00000042 pop     ebp 
00000043 ret     10h 

Source: (StackOverflow)

Weird performance increase in simple benchmark

Yesterday I found an article by Christoph Nahr titled ".NET Struct Performance" which benchmarked several languages (C++, C#, Java, JavaScript) for a method which adds two point structs (double tuples).

As it turned out, C++ version takes about 1000ms to execute (1e9 iterations), while C# cannot get under ~3000ms on the same machine (and performs even worse in x64).

To test it myself, I took the C# code (and simplified slightly to call only the method where parameters are passed by value), and ran it on a i7-3610QM machine (3.1Ghz boost for single core), 8GB RAM, Win8.1, using .NET 4.5.2, RELEASE build 32-bit (x86 WoW64 since my OS is 64-bit). This is the simplified version:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Point AddByVal(Point a, Point b)
    {
        return new Point(a.X + b.Y, a.Y + b.X);
    }

    public static void Main()
    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);
    }
}

With Point defined as simply:

public struct Point 
{
    private readonly double _x, _y;

    public Point(double x, double y) { _x = x; _y = y; }

    public double X { get { return _x; } }

    public double Y { get { return _y; } }
}

Running it produces the results similar to those in the article:

Result: x=1000000001 y=1000000001, Time elapsed: 3159 ms

First strange observation

Since the method should be inlined, I wondered how the code would perform if I removed structs altogether and simply inlined the whole thing together:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    public static void Main()
    {
        // not using structs at all here
        double ax = 1, ay = 1, bx = 1, by = 1;

        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
        {
            ax = ax + by;
            ay = ay + bx;
        }
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
            ax, ay, sw.ElapsedMilliseconds);
    }
}

And got practically the same result (actually 1% slower after several retries), meaning that JIT-ter seems to be doing a good job optimizing all function calls:

Result: x=1000000001 y=1000000001, Time elapsed: 3200 ms

It also means that the benchmark doesn't seem to measure any struct performance and actually only seem to measure basic double arithmetic (after everything else gets optimized away).

The weird stuff

Now comes the weird part. If I merely add another stopwatch outside the loop (yes, I narrowed it down to this crazy step after several retries), the code runs three times faster:

public static void Main()
{
    var outerSw = Stopwatch.StartNew();     // <-- added

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    outerSw.Stop();                         // <-- added
}

Result: x=1000000001 y=1000000001, Time elapsed: 961 ms

That's ridiculous! And it's not like Stopwatch is giving me wrong results because I can clearly see that it ends after a single second.

Can anyone tell me what might be happening here?

(Update)

I also presumed it has something to do with JITting, but repeating the test several times (without the outer stopwatch) is slow in all cases:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Point AddByVal(Point a, Point b)
    {
        return new Point(a.X + b.Y, a.Y + b.X);
    }

    public static void Main()
    {
        Test();
        Test();
        Test();
    }

    private static void Test()
    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
           a.X, a.Y, sw.ElapsedMilliseconds);
    }
}

Output:

Result: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Result: x=1000000001 y=1000000001, Time elapsed: 3249 ms
Result: x=1000000001 y=1000000001, Time elapsed: 3250 ms

Again, with the outer stopwatch added, we get the magical boost:

    public static void Main()
    {
        Test();
        Test();
        Test();
    }

    private static void Test()
    {
        var outerSw = Stopwatch.StartNew();

        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", a.X, a.Y, sw.ElapsedMilliseconds);

        outerSw.Stop();
    }

Result: x=1000000001 y=1000000001, Time elapsed: 979 ms
Result: x=1000000001 y=1000000001, Time elapsed: 982 ms
Result: x=1000000001 y=1000000001, Time elapsed: 978 ms

(Update 2)

Here are two methods in the same program, which shows that the reason is not JITting:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Point AddByVal(Point a, Point b)
    {
        return new Point(a.X + b.Y, a.Y + b.X);
    }

    public static void Main()
    {
        Test1();
        Test2();

        Console.WriteLine();

        Test1();
        Test2();
    }

    private static void Test1()
    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    private static void Test2()
    {
        var swOuter = Stopwatch.StartNew();

        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);

        swOuter.Stop();
    }
}

Output:

Test1: x=1000000001 y=1000000001, Time elapsed: 3242 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 974 ms

Test1: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 972 ms

(Update 3)

Here is a pastebin. You need to run it as a 32-bit release on .NET 4.x (there are a couple of checks in the code to ensure this).

(Solution, perhaps?) (read @HansPassant's answer for details)

According to @Hans, the slowdown is happening due to 64-bit variables not always being aligned in 32-bit apps. If I add a 32-bit variable to one of the test methods, it will perform as fast as the C++ version:

private static void Test3()
{
    var magical_speed_booster = "whatever";

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    GC.KeepAlive(magical_speed_booster);
}

(Update 4)

Following @usr's comments on @Hans' answer, I checked the optimized disassembly for both methods, and they are rather different:

Test1 on the left, Test2 on the right

This seems to show that the difference might be due to compiler acting funny in the first case, rather than double field alignment?

Also, if I add two variables (total offset of 8 bytes), I still get the same speed boost - and it no longer seems it's related to field alignment:

// this is still fast?
private static void Test3()
{
    var magical_speed_booster_1 = "whatever";
    var magical_speed_booster_2 = "whatever";

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    GC.KeepAlive(magical_speed_booster_1);
    GC.KeepAlive(magical_speed_booster_2);
}

Source: (StackOverflow)

Advertisements

Why doesn't the JVM cache JIT compiled code?

The canonical JVM implementation from Sun applies some pretty sophisticated optimization to bytecode to obtain near-native execution speeds after the code has been run a few times. The question is, why isn't this compiled code cached to disk for use during subsequent uses of the same function/class. As it stands, every time a program is executed, the JIT compiler kicks in afresh, rather than using a pre-compiled version of the code. Wouldn't adding this feature add a significant boost to the initial run time of the program, when the bytecode is essentially being interpreted?


Source: (StackOverflow)

Why does a recursive call cause StackOverflow at different stack depths?

I was trying to figure out hands-on how tail calls are handled by the C# compiler.

(Answer: They're not. But the 64bit JIT(s) WILL do TCE (tail call elimination). Restrictions apply.)

So I wrote a small test using a recursive call that prints how many times it gets called before the StackOverflowException kills the process.

class Program
{
    static void Main(string[] args)
    {
        Rec();
    }

    static int sz = 0;
    static Random r = new Random();
    static void Rec()
    {
        sz++;

        //uncomment for faster, more imprecise runs
        //if (sz % 100 == 0)
        {
            //some code to keep this method from being inlined
            var zz = r.Next();  
            Console.Write("{0} Random: {1}\r", sz, zz);
        }

        //uncommenting this stops TCE from happening
        //else
        //{
        //    Console.Write("{0}\r", sz);
        //}

        Rec();
    }

Right on cue, the program ends with SO Exception on any of:

  • 'Optimize build' OFF (either Debug or Release)
  • Target: x86
  • Target: AnyCPU + "Prefer 32 bit" (this is new in VS 2012 and the first time I saw it. More here.)
  • Some seemingly innocuous branch in the code (see commented 'else' branch).

Conversely, using 'Optimize build' ON + (Target = x64 or AnyCPU with 'Prefer 32bit' OFF (on a 64bit CPU)), TCE happens and the counter keeps spinning up forever (ok, it arguably spins down each time its value overflows).

But I noticed a behaviour I can't explain in the StackOverflowException case: it never (?) happens at exactly the same stack depth. Here are the outputs of a few 32-bit runs, Release build:

51600 Random: 1778264579
Process is terminated due to StackOverflowException.

51599 Random: 1515673450
Process is terminated due to StackOverflowException.

51602 Random: 1567871768
Process is terminated due to StackOverflowException.

51535 Random: 2760045665
Process is terminated due to StackOverflowException.

And Debug build:

28641 Random: 4435795885
Process is terminated due to StackOverflowException.

28641 Random: 4873901326  //never say never
Process is terminated due to StackOverflowException.

28623 Random: 7255802746
Process is terminated due to StackOverflowException.

28669 Random: 1613806023
Process is terminated due to StackOverflowException.

The stack size is constant (defaults to 1 MB). The stack frames' sizes are constant.

So then, what can account for the (sometimes non-trivial) variation of stack depth when the StackOverflowException hits?

UPDATE

Hans Passant raises the issue of Console.WriteLine touching P/Invoke, interop and possibly non-deterministic locking.

So I simplified the code to this:

class Program
{
    static void Main(string[] args)
    {
        Rec();
    }
    static int sz = 0;
    static void Rec()
    {
        sz++;
        Rec();
    }
}

I ran it in Release/32bit/Optimization ON without a debugger. When the program crashes, I attach the debugger and check the value of the counter.

And it still isn't the same on several runs. (Or my test is flawed.)

UPDATE: Closure

As suggested by fejesjoco, I looked into ASLR (Address space layout randomization).

It's a security technique that makes it hard for buffer overflow attacks to find the precise location of (e.g.) specific system calls, by randomizing various things in the process address space, including the stack position and, apparently, its size.

The theory sounds good. Let's put it into practice!

In order to test this, I used a Microsoft tool specific for the task: EMET or The Enhanced Mitigation Experience Toolkit. It allows setting the ASLR flag (and a lot more) on a system- or process-level.
(There is also a system-wide, registry hacking alternative that I didn't try)

EMET GUI

In order to verify the effectiveness of the tool, I also discovered that Process Explorer duly reports the status of the ASLR flag in the 'Properties' page of the process. Never saw that until today :)

enter image description here

Theoretically, EMET can (re)set the ASLR flag for a single process. In practice, it didn't seem to change anything (see above image).

However, I disabled ASLR for the entire system and (one reboot later) I could finally verify that indeed, the SO exception now always happens at the same stack depth.

BONUS

ASLR-related, in older news: How Chrome got pwned


Source: (StackOverflow)

What does a just-in-time (JIT) compiler do?

What does a JIT compiler specifically do as opposed to a non-JIT compiler? Can someone give a succinct and easy to understand description?


Source: (StackOverflow)

Why is Java faster when using a JIT vs. compiling to machine code?

I have heard that Java must use a JIT to be fast. This makes perfect sense when comparing to interpretation, but why can't someone make an ahead-of-time compiler that generates fast Java code? I know about gcj, but I don't think its output is typically faster than Hotspot for example.

Are there things about the language that make this difficult? I think it comes down to just these things:

  • Reflection
  • Classloading

What am I missing? If I avoid these features, would it be possible to compile Java code once to native machine code and be done?


Source: (StackOverflow)

Is there a runtime benefit to using const local variables?

Outside of the ensuring that they cannot be changed (to the tune of a compiler error), does the JIT make any optimisations for const locals?

Eg.

public static int Main(string[] args)
{
    const int timesToLoop = 50;

    for (int i=0; i<timesToLoop; i++)
    {
        // ...
    }
}

Source: (StackOverflow)

Why shouldn't I use PyPy over CPython if PyPy is 6.3 times faster?

I've been hearing a lot about the PyPy project. They claim it is 6.3 times faster than the CPython interpreter on their site.

Whenever we talk about dynamic languages like Python, speed is one of the top issues. To solve this, they say PyPy is 6.3 times faster.

The second issue is parallelism, the infamous Global Interpreter Lock (GIL). For this, PyPy says it can give GIL-less Python.

If PyPy can solve these great challenges, what are its weaknesses that are preventing wider adoption? That is to say, what's preventing someone like me, a typical Python developer, from switching to PyPy right now?


Source: (StackOverflow)

.NET JIT potential error?

The following code gives different output when running the release inside Visual Studio, and running the release outside Visual Studio. I'm using Visual Studio 2008 and targeting .NET 3.5. I've also tried .NET 3.5 SP1.

When running outside Visual Studio, the JIT should kick in. Either (a) there's something subtle going on with C# that I'm missing or (b) the JIT is actually in error. I'm doubtful that the JIT can go wrong, but I'm running out of other possiblities...

Output when running inside Visual Studio:

    0 0,
    0 1,
    1 0,
    1 1,

Output when running release outside of Visual Studio:

    0 2,
    0 2,
    1 2,
    1 2,

What is the reason?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Test
{
    struct IntVec
    {
        public int x;
        public int y;
    }

    interface IDoSomething
    {
        void Do(IntVec o);
    }

    class DoSomething : IDoSomething
    {
        public void Do(IntVec o)
        {
            Console.WriteLine(o.x.ToString() + " " + o.y.ToString()+",");
        }
    }

    class Program
    {
        static void Test(IDoSomething oDoesSomething)
        {
            IntVec oVec = new IntVec();
            for (oVec.x = 0; oVec.x < 2; oVec.x++)
            {
                for (oVec.y = 0; oVec.y < 2; oVec.y++)
                {
                    oDoesSomething.Do(oVec);
                }
            }
        }

        static void Main(string[] args)
        {
            Test(new DoSomething());
            Console.ReadLine();
        }
    }
}

Source: (StackOverflow)

What is the loop inversion technique?

I was going through a document which talks about just-in-time compiler (JIT) optimization techniques for Java. One of them was "loop inversion". And the document says:

You replace a regular while loop with a do-while loop. And the do-while loop is set within an if clause. This replacement leads to two fewer jumps.

How does loop inversion work and how does it optimize our code path?

N.B.: It would be great if somebody can explain with an example of Java code and how JIT optimizes it to native code and why is it optimal in modern processors.


Source: (StackOverflow)

Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to optimize the byte code, etc.

The work is, essentially, dot products. As in, I have two float[50] and I need to compute the sum of pairwise products. I know processor instruction sets exist to perform these kind of operations quickly and in bulk, like SSE or MMX.

Yes I can probably access these by writing some native code in JNI. The JNI call turns out to be pretty expensive.

I know you can't guarantee what a JIT will compile or not compile. Has anyone ever heard of a JIT generating code that uses these instructions? and if so, is there anything about the Java code that helps make it compilable this way?

Probably a "no"; worth asking.


Source: (StackOverflow)

JIT not optimizing loop that involves Integer.MAX_VALUE

While writing an answer to another question, I noticed a strange border case for JIT optimization.

The following program is not a "Microbenchmark" and not intended to reliably measure an execution time (as pointed out in the answers to the other question). It is solely intended as an MCVE to reproduce the issue:

class MissedLoopOptimization
{
    public static void main(String args[])
    {
        for (int j=0; j<3; j++)
        {
            for (int i=0; i<5; i++)
            {
                long before = System.nanoTime();
                runWithMaxValue();
                long after = System.nanoTime();
                System.out.println("With MAX_VALUE   : "+(after-before)/1e6);
            }
            for (int i=0; i<5; i++)
            {
                long before = System.nanoTime();
                runWithMaxValueMinusOne();
                long after = System.nanoTime();
                System.out.println("With MAX_VALUE-1 : "+(after-before)/1e6);
            }
        }
    }

    private static void runWithMaxValue()
    {
        final int n = Integer.MAX_VALUE;
        int i = 0;
        while (i++ < n) {}
    }

    private static void runWithMaxValueMinusOne()
    {
        final int n = Integer.MAX_VALUE-1;
        int i = 0;
        while (i++ < n) {}
    }
}

It basically runs the same loop, while (i++ < n){}, where the limit n is once set to Integer.MAX_VALUE, and once to Integer.MAX_VALUE-1.

When executing this on Win7/64 with JDK 1.7.0_21 and

java -server MissedLoopOptimization

the timing results are as follows:

...
With MAX_VALUE   : 1285.227081
With MAX_VALUE   : 1274.36311
With MAX_VALUE   : 1282.992203
With MAX_VALUE   : 1292.88246
With MAX_VALUE   : 1280.788994
With MAX_VALUE-1 : 6.96E-4
With MAX_VALUE-1 : 3.48E-4
With MAX_VALUE-1 : 0.0
With MAX_VALUE-1 : 0.0
With MAX_VALUE-1 : 3.48E-4

Obviously, for the case of MAX_VALUE-1, the JIT does what one could expect: It detects that the loop is useless, and completely eliminates it. However, it does not remove the loop when it is running up to MAX_VALUE.

This observation is confirmed by a look at the JIT assembly output when starting with

java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly MissedLoopOptimization

The log contains the following assembly for the method that runs up to MAX_VALUE:

Decoding compiled method 0x000000000254fa10:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} &apos;runWithMaxValue&apos; &apos;()V&apos; in &apos;MissedLoopOptimization&apos;
  #           [sp+0x20]  (sp of caller)
  0x000000000254fb40: sub    $0x18,%rsp
  0x000000000254fb47: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - MissedLoopOptimization::runWithMaxValue@-1 (line 29)
  0x000000000254fb4c: mov    $0x1,%r11d
  0x000000000254fb52: jmp    0x000000000254fb63
  0x000000000254fb54: nopl   0x0(%rax,%rax,1)
  0x000000000254fb5c: data32 data32 xchg %ax,%ax
  0x000000000254fb60: inc    %r11d              ; OopMap{off=35}
                                                ;*goto
                                                ; - MissedLoopOptimization::runWithMaxValue@11 (line 30)
  0x000000000254fb63: test   %eax,-0x241fb69(%rip)        # 0x0000000000130000
                                                ;*goto
                                                ; - MissedLoopOptimization::runWithMaxValue@11 (line 30)
                                                ;   {poll}
  0x000000000254fb69: cmp    $0x7fffffff,%r11d
  0x000000000254fb70: jl     0x000000000254fb60  ;*if_icmpge
                                                ; - MissedLoopOptimization::runWithMaxValue@8 (line 30)
  0x000000000254fb72: add    $0x10,%rsp
  0x000000000254fb76: pop    %rbp
  0x000000000254fb77: test   %eax,-0x241fb7d(%rip)        # 0x0000000000130000
                                                ;   {poll_return}
  0x000000000254fb7d: retq   
  0x000000000254fb7e: hlt    
  0x000000000254fb7f: hlt    
[Exception Handler]
[Stub Code]
  0x000000000254fb80: jmpq   0x000000000254e820  ;   {no_reloc}
[Deopt Handler Code]
  0x000000000254fb85: callq  0x000000000254fb8a
  0x000000000254fb8a: subq   $0x5,(%rsp)
  0x000000000254fb8f: jmpq   0x0000000002528d00  ;   {runtime_call}
  0x000000000254fb94: hlt    
  0x000000000254fb95: hlt    
  0x000000000254fb96: hlt    
  0x000000000254fb97: hlt    

One can clearly see the loop, with the comparison to 0x7fffffff and the jump back to inc. In contrast to that, the assembly for the case where it is running up to MAX_VALUE-1:

Decoding compiled method 0x000000000254f650:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} &apos;runWithMaxValueMinusOne&apos; &apos;()V&apos; in &apos;MissedLoopOptimization&apos;
  #           [sp+0x20]  (sp of caller)
  0x000000000254f780: sub    $0x18,%rsp
  0x000000000254f787: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - MissedLoopOptimization::runWithMaxValueMinusOne@-1 (line 36)
  0x000000000254f78c: add    $0x10,%rsp
  0x000000000254f790: pop    %rbp
  0x000000000254f791: test   %eax,-0x241f797(%rip)        # 0x0000000000130000
                                                ;   {poll_return}
  0x000000000254f797: retq   
  0x000000000254f798: hlt    
  0x000000000254f799: hlt    
  0x000000000254f79a: hlt    
  0x000000000254f79b: hlt    
  0x000000000254f79c: hlt    
  0x000000000254f79d: hlt    
  0x000000000254f79e: hlt    
  0x000000000254f79f: hlt    
[Exception Handler]
[Stub Code]
  0x000000000254f7a0: jmpq   0x000000000254e820  ;   {no_reloc}
[Deopt Handler Code]
  0x000000000254f7a5: callq  0x000000000254f7aa
  0x000000000254f7aa: subq   $0x5,(%rsp)
  0x000000000254f7af: jmpq   0x0000000002528d00  ;   {runtime_call}
  0x000000000254f7b4: hlt    
  0x000000000254f7b5: hlt    
  0x000000000254f7b6: hlt    
  0x000000000254f7b7: hlt    

So my question is: What is so special about the Integer.MAX_VALUE that prevents the JIT from optimizing it in the same way as it does for Integer.MAX_VALUE-1? My guess would be that has to do with the cmp instruction, which is intended for signed arithmetic, but that alone is not really a convincing reason. Can anybody explain this, and maybe even give a pointer to the OpenJDK HotSpot code where this case is treated?

(An aside: I hope that the answer will also explain the different behavior between i++ and ++i that was asked for in the other question, assuming that the reason for the missing optimization is (obviously) actually caused by the Integer.MAX_VALUE loop limit)


Source: (StackOverflow)

Do redundant casts get optimized?

I am updating some old code, and have found several instances where the same object is being cast repeatedly each time one of its properties or methods needs to be called. Example:

if (recDate != null && recDate > ((System.Windows.Forms.DateTimePicker)ctrl).MinDate)
{
    ((System.Windows.Forms.DateTimePicker)ctrl).CustomFormat = "MM/dd/yyyy";
    ((System.Windows.Forms.DateTimePicker)ctrl).Value = recDate;
}
else
{
    (System.Windows.Forms.DateTimePicker)ctrl).CustomFormat = " ";
}
((System.Windows.Forms.DateTimePicker)ctrl).Format = DateTimePickerFormat.Custom;

My inclination is to fix this monstrosity, but given my limited time I don't want to bother with anything that's not affecting functionality or performance.

So what I'm wondering is, are these redundant casts getting optimized away by the compiler? I tried figuring it out myself by using ildasm on a simplified example, but not being familiar with IL I only ended up more confused.

UPDATE

So far, the consensus seems to be that a)no, the casts are not optimized, but b)while there may possibly be some small performance hit as a result, it is not likely significant, and c)I should consider fixing them anyway. I have come down on the side of resolving to fix these someday, if I have time. Meanwhile, I won't worry about them.

Thanks everyone!


Source: (StackOverflow)

Does the .NET CLR Really Optimize for the Current Processor

When I read about the performance of JITted languages like C# or Java, authors usually say that they should/could theoretically outperform many native-compiled applications. The theory being that native applications are usually just compiled for a processor family (like x86), so the compiler cannot make certain optimizations as they may not truly be optimizations on all processors. On the other hand, the CLR can make processor-specific optimizations during the JIT process.

Does anyone know if Microsoft's (or Mono's) CLR actually performs processor-specific optimizations during the JIT process? If so, what kind of optimizations?


Source: (StackOverflow)

Why does adding local variables make .NET code slower

Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?

int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
    var isMultipleOf16 = i % 16 == 0;
    count += isMultipleOf16 ? 1 : 0;
    //count += i % 16 == 0 ? 1 : 0;
}

Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]

Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?

Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.

Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?

Assembly code (as requested):

    var isMultipleOf16 = i % 16 == 0;
00000037  mov         eax,edx 
00000039  and         eax,0Fh 
0000003c  xor         ecx,ecx 
0000003e  test        eax,eax 
00000040  sete        cl 
    count += isMultipleOf16 ? 1 : 0;
00000043  movzx       eax,cl 
00000046  test        eax,eax 
00000048  jne         0000000000000050 
0000004a  xor         eax,eax 
0000004c  jmp         0000000000000055 
0000004e  xchg        ax,ax 
00000050  mov         eax,1 
00000055  lea         r8d,[rbx+rax] 
    count += i % 16 == 0 ? 1 : 0;
00000037  mov         eax,ecx 
00000039  and         eax,0Fh 
0000003c  je          0000000000000042 
0000003e  xor         eax,eax 
00000040  jmp         0000000000000047 
00000042  mov         eax,1 
00000047  lea         edx,[rbx+rax] 

Source: (StackOverflow)