EzDevInfo.com

performance interview questions

Top performance frequently asked interview questions

Href attribute for JavaScript links: "#" or "javascript:void(0)"?

The following are two methods of building a link that has the sole purpose of running JavaScript code. Which is better, in terms of functionality, page load speed, validation purposes, etc.?

function myJsFunc() {
    alert("myJsFunc");
}
<a rel='nofollow' href="#" onclick="myJsFunc();">Run JavaScript Code</a>

or

function myJsFunc() {
    alert("myJsFunc");
}
 <a rel='nofollow' href="javascript:void(0)" onclick="myJsFunc();">Run JavaScript Code</a>


Source: (StackOverflow)

Why is processing a sorted array faster than an unsorted array?

Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a somewhat similar, but less extreme result.


My first thought was that sorting brings the data into the cache, but my next thought was how silly that is, because the array was just generated.

  • What is going on?
  • Why is a sorted array faster than an unsorted array?
  • The code is summing up some independent terms, and the order should not matter.

Source: (StackOverflow)

Advertisements

Which is faster: while(1) or while(2)?

This was an interview question asked by a senior manager.

Which is faster?

while(1) {
    // Some code
}

or

while(2) {
    //Some code
}

I said that both have the same execution speed, as the expression inside while should finally evaluate to true or false. In this case, both evaluate to true and there are no extra conditional instructions inside the while condition. So, both will have the same speed of execution and I prefer while (1).

But the interviewer said confidently: "Check your basics. while(1) is faster than while(2)." (He was not testing my confidence)

Is this true?

See also: Is "for(;;)" faster than "while (TRUE)"? If not, why do people use it?


Source: (StackOverflow)

Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:

C:

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

python with pypy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

  • C: 100%
  • python: 692% (118% with pypy)
  • erlang: 436% (135% thanks to RichardC)
  • haskell: 1421%

I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

EDIT:

Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.


Source codes used:

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
    end.

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
    if
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  
    end.

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

Source: (StackOverflow)

StringBuilder vs String concatenation in toString() in Java

Given the 2 toString() implementations below, which one is prefered:

public String toString(){
    return "{a:"+ a + ", b:" + b + ", c: " + c +"}";
}

or

public String toString(){
    StringBuilder sb = new StringBuilder(100);
    return sb.append("{a:").append(a)
          .append(", b:").append(b)
          .append(", c:").append(c)
          .append("}")
          .toString();
}

?

More importantly - given we have only 3 properties it might not make a difference, but at what point would you switch from + concat to StringBuilder?


Source: (StackOverflow)

When should I use Cross Apply over Inner Join?

What is the main purpose of using CROSS APPLY?

I have read (vaguely, through posts on the Internet) that cross apply can be more efficient when selecting over large data sets if you are partitioning. (Paging comes to mind)

I also know that CROSS APPLY doesn't require a UDF as the right-table.

In most INNER JOIN queries (one-to-many relationships), I could rewrite them to use CROSS APPLY, but they always give me equivalent execution plans.

Can anyone give me a good example of when CROSS APPLY makes a difference in those cases where INNER JOIN will work as well?


Edit:

Here's a trivial example, where the execution plans are exactly the same. (Show me one where they differ and where cross apply is faster/more efficient)

create table Company (
    companyId int identity(1,1)
,   companyName varchar(100)
,   zipcode varchar(10) 
,   constraint PK_Company primary key (companyId)
)
GO

create table Person (
    personId int identity(1,1)
,   personName varchar(100)
,   companyId int
,   constraint FK_Person_CompanyId foreign key (companyId) references dbo.Company(companyId)
,   constraint PK_Person primary key (personId)
)
GO

insert Company
select 'ABC Company', '19808' union
select 'XYZ Company', '08534' union
select '123 Company', '10016'


insert Person
select 'Alan', 1 union
select 'Bobby', 1 union
select 'Chris', 1 union
select 'Xavier', 2 union
select 'Yoshi', 2 union
select 'Zambrano', 2 union
select 'Player 1', 3 union
select 'Player 2', 3 union
select 'Player 3', 3 


/* using CROSS APPLY */
select *
from Person p
cross apply (
    select *
    from Company c
    where p.companyid = c.companyId
) Czip

/* the equivalent query using INNER JOIN */
select *
from Person p
inner join Company c on p.companyid = c.companyId

Source: (StackOverflow)

Big O, how do you calculate/approximate it?

Most people with a degree in CS will certainly know what Big O stands for. It helps us to measure how (in)efficient an algorithm really is and if you know in what category the problem you are trying to solve lays in you can figure out if it is still possible to squeeze out that little extra performance.1

But I'm curious, how do you calculate or approximate the complexity of your algorithms?

1 but as they say, don't overdo it, premature optimization is the root of all evil, and optimization without a justified cause should deserve that name as well.


Source: (StackOverflow)

Why is my program slow when looping over exactly 8192 elements?

Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:

img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.


Source: (StackOverflow)

Swift performance: sorting arrays

I was implementing an algorithm in Swift and noticed that the performance was very poor. After digging deeper I realised that one of the bottlenecks was something as simple as sorting arrays. The relevant part is here:

let n = 1000000
let x = Int[](count: n, repeatedValue: 0)
for i in 0..n {
    x[i] = random()
}
// start clock here
let y = sort(x)
// stop clock here

In C++, a similar operation takes 0.06 s on my computer.

In Python it takes 0.6 s (no tricks, just y = sorted(x) for a list of integers).

In Swift it takes 6 s if I compile it with the following command:

xcrun swift -O3 -sdk `xcrun --show-sdk-path --sdk macosx`

And it takes as much as 88 s if I compile it with the following command:

xcrun swift -O0 -sdk `xcrun --show-sdk-path --sdk macosx`

Timings in Xcode with "Release" vs. "Debug" builds are similar.

What is wrong here? I could understand some performance loss in comparison with C++, but not a 10-fold slowdown in comparison with pure Python.


Edit: mweathers noticed that changing -O3 to -Ofast makes this code run almost as fast as the C++ version! However, -Ofast changes the semantics of the language a lot — in my testing, it disabled the checks for integer overflows and array indexing overflows. For example, with -Ofast the following Swift code runs silently without crashing (and prints out some garbage):

let n = 10000000
println(n*n*n*n*n)
let x = Int[](count: n, repeatedValue: 10)
println(x[n])

So -Ofast is not what we want; the whole point of Swift is that we have the safety nets in place. Of course the safety nets have some impact on the performance, but they should not make the programs 100 times slower. Remember that Java already checks for array bounds, and in typical cases the slowdown is by a factor much less than 2. And in Clang and GCC we have got -ftrapv for checking (signed) integer overflows, and it is not that slow, either.

Hence the question: how can we get a reasonable performance in Swift without losing the safety nets?


Edit 2: I did some more benchmarking, with very simple loops along the lines of

for i in 0..n {
    x[i] = x[i] ^ 12345678
}

(Here the xor operation is there just so that I can more easily find the relevant loop in the assembly code. I tried to pick an operation that is easy to spot but also "harmless" in the sense that it should not require any checks related to integer overflows.)

Again, there was a huge difference in the performance between -O3 and -Ofast. So I had a look at the assembly code:

  • With -Ofast I get pretty much what I would expect. The relevant part is a loop with 5 machine language instructions.

  • With -O3 I get something that was beyond my wildest imagination. The inner loop spans 88 lines of assembly code. I did not try to understand all of it, but the most suspicious parts are 13 invocations of "callq _swift_retain" and another 13 invocations of "callq _swift_release". That is, 26 subroutine calls in the inner loop!


Edit 3: In comments, Ferruccio asked for benchmarks that are fair in the sense that they do not rely on built-in functions (e.g. sort). I think the following program is a fairly good example:

let n = 10000
let x = Int[](count: n, repeatedValue: 1)
for i in 0..n {
    for j in 0..n {
        x[i] = x[j]
    }
}

There is no arithmetic, so we do not need to worry about integer overflows. The only thing that we do is just lots of array references. And the results are here—Swift -O3 loses by factor almost 500 in comparison with -Ofast:

  • C++ -O3: 0.05 s
  • C++ -O0: 0.4 s
  • Java: 0.2 s
  • Python with PyPy: 0.5 s
  • Python: 12 s
  • Swift -Ofast: 0.05 s
  • Swift -O3: 23 s
  • Swift -O0: 443 s

(If you are concerned that the compiler might optimise out the pointless loops entirely, you can change it to e.g. x[i] ^= x[j], and add a print statement that outputs x[0]. This does not change anything; the timings will be very similar.)

And yes, here the Python implementation was a stupid pure Python implementation with a list of ints and nested for loops. It should be much slower than unoptimised Swift. Something seems to be seriously broken with Swift and array indexing.


Edit 4: These issues (as well as some other performance issues) seems to have been fixed in Xcode 6 beta 5.

For sorting, I now have the following timings:

  • clang++ -O3: 0.06 s
  • swiftc -Ofast: 0.1 s
  • swiftc -O: 0.1 s
  • swiftc: 4 s

For nested loops:

  • clang++ -O3: 0.06 s
  • swiftc -Ofast: 0.3 s
  • swiftc -O: 0.4 s
  • swiftc: 540 s

It seems that there is no reason anymore to use the unsafe -Ofast (a.k.a. -Ounchecked); plain -O produces equally good code.


Source: (StackOverflow)

Why does Python code run faster in a function?

def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn't placed within a function,

for i in xrange(10**8):
    pass

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?

Note: The timing is done with the time function in BASH in Linux.


Source: (StackOverflow)

How can you profile a Python script?

Project Euler and other coding contests often have a maximum time to run or people boast of how fast their particular solution runs. With python, sometimes the approaches are somewhat kludgey - i.e., adding timing code to __main__.

What is a good way to profile how long a python program takes to run?


Source: (StackOverflow)

Why is 1000000000000000 in range(1000000000000001) so fast in Python 3?

It is my understanding that the range() function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.

This being the case, I would have expected the following line to take an inordinate amount of time, because in order to determine whether 1 quadrillion is in the range, a quadrillion values would have to be generated:

1000000000000000 in range(1000000000000001)

Furthermore: it seems that no matter how many zeroes I add on, the calculation more or less takes the same amount of time (basically instantaneous).

I have also tried things like this, but the calculation is still almost instant:

1000000000000000000000 in range(0,1000000000000000000001,10) # count by tens

If I try to implement my own range function, the result is not so nice!!

def my_crappy_range(N):
    i = 0
    while i < N:
        yield i
        i += 1
    return

What is the range() object doing under the hood that makes it so fast?


EDIT: This has turned out to be a much more nuanced topic than I anticipated - there seems to be a bit of history behind the optimization of range().

Martijn Pieters' answer was chosen for its completeness, but also see abarnert's first answer for a good discussion of what it means for range to be a full-fledged sequence in Python 3, and some information/warning regarding potential inconsistency for __contains__ function optimization across Python implementations. abarnert's other answer goes into some more detail and provides links for those interested in the history behind the optimization in Python 3 (and lack of optimization of xrange in Python 2). Answers by poke and by wim provide the relevant C source code and explanations for those who are interested.


Source: (StackOverflow)

MyISAM versus InnoDB

I'm working on a projects which involves a lot of database writes, I'd say (70% inserts and 30% reads). This ratio would also include updates which I consider to be one read and one write. The reads can be dirty (e.g. I don't need 100% accurate information at the time of read).
The task in question will be doing over 1 million database transactions an hour.

I've read a bunch of stuff on the web about the differences between MyISAM and InnoDB, and MyISAM seems like the obvious choice to me for the particular database/tables that I'll be using for this task. From what I seem to be reading, InnoDB is good if transactions are needed since row level locking is supported.

Does anybody have any experience with this type of load (or higher)? Is MyISAM the way to go?


Source: (StackOverflow)

Python string formatting: % vs. .format

Python 2.6 introduced the str.format() method with a slightly different syntax from the existing % operator. Which is better and for what situations?

  1. The following uses each method and has the same outcome, so what is the difference?

    #!/usr/bin/python
    sub1 = "python string!"
    sub2 = "an arg"
    
    a = "i am a %s" % sub1
    b = "i am a {0}".format(sub1)
    
    c = "with %(kwarg)s!" % {'kwarg':sub2}
    d = "with {kwarg}!".format(kwarg=sub2)
    
    print a    # "i am a python string!"
    print b    # "i am a python string!"
    print c    # "with an arg!"
    print d    # "with an arg!"
    
  2. Furthermore when does string formatting occur in Python? For example, if my logging level is set to HIGH will I still take a hit for performing the following % operation? And if so, is there a way to avoid this?

    log.debug("some debug info: %s" % some_info)
    

Source: (StackOverflow)

\d is less efficient than [0-9]

I made a comment yesterday on an answer where someone had used [0123456789] in a regular expression rather than [0-9] or \d. I said it was probably more efficient to use a range or digit specifier than a character set.

I decided to test that out today and found out to my surprise that (in the C# regex engine at least) \d appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:

Regular expression \d           took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

It's a surprise to me for two reasons:

  1. I would have thought the range would be implemented much more efficiently than the set.
  2. I can't understand why \d is worse than [0-9]. Is there more to \d than simply shorthand for [0-9]?

Here is the test code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
{
    class Program
    {
        static void Main(string[] args)
        {
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
            {
                //Generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                {
                    //Add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                }
                //In roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                {
                    //Replace one character with a digit, 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                }
                strings.Add(sb.ToString());
            }

            var baseTime = testPerfomance(strings, @"\d");
            Console.WriteLine();
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }

        private static TimeSpan testPerfomance(List<string> strings, string regex)
        {
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            sw.Start();
            foreach (var str in strings)
            {
                if (rex.Match(str).Success)
                {
                    successes++;
                }
            }
            sw.Stop();

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;
        }
    }
}

Source: (StackOverflow)