performance interview questions
Top performance frequently asked interview questions
The following are two methods of building a link that has the sole purpose of running JavaScript code. Which is better, in terms of functionality, page load speed, validation purposes, etc.?
function myJsFunc() {
alert("myJsFunc");
}
<a rel='nofollow' href="#" onclick="myJsFunc();">Run JavaScript Code</a>
or
function myJsFunc() {
alert("myJsFunc");
}
<a rel='nofollow' href="javascript:void(0)" onclick="myJsFunc();">Run JavaScript Code</a>
Source: (StackOverflow)
Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
- Without
std::sort(data, data + arraySize);
, the code runs in 11.54 seconds.
- With the sorted data, the code runs in 1.93 seconds.
Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.
import java.util.Arrays;
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;
// !!! With this, the next loop runs faster
Arrays.sort(data);
// Test
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i)
{
// Primary loop
for (int c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
With a somewhat similar, but less extreme result.
My first thought was that sorting brings the data into the cache, but my next thought was how silly that is, because the array was just generated.
- What is going on?
- Why is a sorted array faster than an unsorted array?
- The code is summing up some independent terms, and the order should not matter.
Source: (StackOverflow)
This was an interview question asked by a senior manager.
Which is faster?
while(1) {
// Some code
}
or
while(2) {
//Some code
}
I said that both have the same execution speed, as the expression inside while
should finally evaluate to true
or false
. In this case, both evaluate to true
and there are no extra conditional instructions inside the while
condition. So, both will have the same speed of execution and I prefer while (1).
But the interviewer said confidently:
"Check your basics. while(1)
is faster than while(2)
."
(He was not testing my confidence)
Is this true?
Source: (StackOverflow)
I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.
The result is the following:
C:
lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320
real 0m11.074s
user 0m11.070s
sys 0m0.000s
python:
lorenzo@enzo:~/erlang$ time ./euler12.py
842161320
real 1m16.632s
user 1m16.370s
sys 0m0.250s
python with pypy:
lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py
842161320
real 0m13.082s
user 0m13.050s
sys 0m0.020s
erlang:
lorenzo@enzo:~/erlang$ erlc euler12.erl
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.7.4 (abort with ^G)
1> 842161320
real 0m48.259s
user 0m48.070s
sys 0m0.020s
haskell:
lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx
842161320
real 2m37.326s
user 2m37.240s
sys 0m0.080s
Summary:
- C: 100%
- python: 692% (118% with pypy)
- erlang: 436% (135% thanks to RichardC)
- haskell: 1421%
I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).
Question 1:
Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT
?
Question 2:
Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)
Question 3:
Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.
EDIT:
Question 4:
Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?
I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.
Source codes used:
#include <stdio.h>
#include <math.h>
int factorCount (long n)
{
double square = sqrt (n);
int isquare = (int) square;
int count = isquare == square ? -1 : 0;
long candidate;
for (candidate = 1; candidate <= isquare; candidate ++)
if (0 == n % candidate) count += 2;
return count;
}
int main ()
{
long triangle = 1;
int index = 1;
while (factorCount (triangle) < 1001)
{
index ++;
triangle += index;
}
printf ("%ld\n", triangle);
}
#! /usr/bin/env python3.2
import math
def factorCount (n):
square = math.sqrt (n)
isquare = int (square)
count = -1 if isquare == square else 0
for candidate in range (1, isquare + 1):
if not n % candidate: count += 2
return count
triangle = 1
index = 1
while factorCount (triangle) < 1001:
index += 1
triangle += index
print (triangle)
-module (euler12).
-compile (export_all).
factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).
factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;
factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
factorCount (Number, Sqrt, Candidate, Count) ->
case Number rem Candidate of
0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
_ -> factorCount (Number, Sqrt, Candidate + 1, Count)
end.
nextTriangle (Index, Triangle) ->
Count = factorCount (Triangle),
if
Count > 1000 -> Triangle;
true -> nextTriangle (Index + 1, Triangle + Index + 1)
end.
solve () ->
io:format ("~p~n", [nextTriangle (1, 1) ] ),
halt (0).
factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
where square = sqrt $ fromIntegral number
isquare = floor square
factorCount' number sqrt candidate count
| fromIntegral candidate > sqrt = count
| number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
| otherwise = factorCount' number sqrt (candidate + 1) count
nextTriangle index triangle
| factorCount triangle > 1000 = triangle
| otherwise = nextTriangle (index + 1) (triangle + index + 1)
main = print $ nextTriangle 1 1
Source: (StackOverflow)
Given the 2 toString()
implementations below, which one is prefered:
public String toString(){
return "{a:"+ a + ", b:" + b + ", c: " + c +"}";
}
or
public String toString(){
StringBuilder sb = new StringBuilder(100);
return sb.append("{a:").append(a)
.append(", b:").append(b)
.append(", c:").append(c)
.append("}")
.toString();
}
?
More importantly - given we have only 3 properties it might not make a difference, but at what point would you switch from +
concat to StringBuilder?
Source: (StackOverflow)
What is the main purpose of using CROSS APPLY?
I have read (vaguely, through posts on the Internet) that cross apply
can be more efficient when selecting over large data sets if you are partitioning. (Paging comes to mind)
I also know that CROSS APPLY
doesn't require a UDF as the right-table.
In most INNER JOIN
queries (one-to-many relationships), I could rewrite them to use CROSS APPLY
, but they always give me equivalent execution plans.
Can anyone give me a good example of when CROSS APPLY
makes a difference in those cases where INNER JOIN
will work as well?
Edit:
Here's a trivial example, where the execution plans are exactly the same. (Show me one where they differ and where cross apply
is faster/more efficient)
create table Company (
companyId int identity(1,1)
, companyName varchar(100)
, zipcode varchar(10)
, constraint PK_Company primary key (companyId)
)
GO
create table Person (
personId int identity(1,1)
, personName varchar(100)
, companyId int
, constraint FK_Person_CompanyId foreign key (companyId) references dbo.Company(companyId)
, constraint PK_Person primary key (personId)
)
GO
insert Company
select 'ABC Company', '19808' union
select 'XYZ Company', '08534' union
select '123 Company', '10016'
insert Person
select 'Alan', 1 union
select 'Bobby', 1 union
select 'Chris', 1 union
select 'Xavier', 2 union
select 'Yoshi', 2 union
select 'Zambrano', 2 union
select 'Player 1', 3 union
select 'Player 2', 3 union
select 'Player 3', 3
/* using CROSS APPLY */
select *
from Person p
cross apply (
select *
from Company c
where p.companyid = c.companyId
) Czip
/* the equivalent query using INNER JOIN */
select *
from Person p
inner join Company c on p.companyid = c.companyId
Source: (StackOverflow)
Here is the extract from the program in question. The matrix img[][]
has the size SIZE×SIZE, and is initialized at:
img[j][i] = 2 * j + i
Then, you make a matrix res[][]
, and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.
for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}
That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.
#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++)
for(j=0;j<SIZE;j++)
img[j][i] = (2*j+i)%8196;
Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:
SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs
The compiler is GCC.
From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.
Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.
I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.
Source: (StackOverflow)
I was implementing an algorithm in Swift and noticed that the performance was very poor. After digging deeper I realised that one of the bottlenecks was something as simple as sorting arrays. The relevant part is here:
let n = 1000000
let x = Int[](count: n, repeatedValue: 0)
for i in 0..n {
x[i] = random()
}
// start clock here
let y = sort(x)
// stop clock here
In C++, a similar operation takes 0.06 s on my computer.
In Python it takes 0.6 s (no tricks, just y = sorted(x) for a list of integers).
In Swift it takes 6 s if I compile it with the following command:
xcrun swift -O3 -sdk `xcrun --show-sdk-path --sdk macosx`
And it takes as much as 88 s if I compile it with the following command:
xcrun swift -O0 -sdk `xcrun --show-sdk-path --sdk macosx`
Timings in Xcode with "Release" vs. "Debug" builds are similar.
What is wrong here? I could understand some performance loss in comparison with C++, but not a 10-fold slowdown in comparison with pure Python.
Edit: mweathers noticed that changing -O3
to -Ofast
makes this code run almost as fast as the C++ version! However, -Ofast
changes the semantics of the language a lot — in my testing, it disabled the checks for integer overflows and array indexing overflows. For example, with -Ofast
the following Swift code runs silently without crashing (and prints out some garbage):
let n = 10000000
println(n*n*n*n*n)
let x = Int[](count: n, repeatedValue: 10)
println(x[n])
So -Ofast
is not what we want; the whole point of Swift is that we have the safety nets in place. Of course the safety nets have some impact on the performance, but they should not make the programs 100 times slower. Remember that Java already checks for array bounds, and in typical cases the slowdown is by a factor much less than 2. And in Clang and GCC we have got -ftrapv
for checking (signed) integer overflows, and it is not that slow, either.
Hence the question: how can we get a reasonable performance in Swift without losing the safety nets?
Edit 2: I did some more benchmarking, with very simple loops along the lines of
for i in 0..n {
x[i] = x[i] ^ 12345678
}
(Here the xor operation is there just so that I can more easily find the relevant loop in the assembly code. I tried to pick an operation that is easy to spot but also "harmless" in the sense that it should not require any checks related to integer overflows.)
Again, there was a huge difference in the performance between -O3
and -Ofast
. So I had a look at the assembly code:
With -Ofast
I get pretty much what I would expect. The relevant part is a loop with 5 machine language instructions.
With -O3
I get something that was beyond my wildest imagination. The inner loop spans 88 lines of assembly code. I did not try to understand all of it, but the most suspicious parts are 13 invocations of "callq _swift_retain" and another 13 invocations of "callq _swift_release". That is, 26 subroutine calls in the inner loop!
Edit 3: In comments, Ferruccio asked for benchmarks that are fair in the sense that they do not rely on built-in functions (e.g. sort). I think the following program is a fairly good example:
let n = 10000
let x = Int[](count: n, repeatedValue: 1)
for i in 0..n {
for j in 0..n {
x[i] = x[j]
}
}
There is no arithmetic, so we do not need to worry about integer overflows. The only thing that we do is just lots of array references. And the results are here—Swift -O3 loses by factor almost 500 in comparison with -Ofast:
- C++ -O3: 0.05 s
- C++ -O0: 0.4 s
- Java: 0.2 s
- Python with PyPy: 0.5 s
- Python: 12 s
- Swift -Ofast: 0.05 s
- Swift -O3: 23 s
- Swift -O0: 443 s
(If you are concerned that the compiler might optimise out the pointless loops entirely, you can change it to e.g. x[i] ^= x[j]
, and add a print statement that outputs x[0]
. This does not change anything; the timings will be very similar.)
And yes, here the Python implementation was a stupid pure Python implementation with a list of ints and nested for loops. It should be much slower than unoptimised Swift. Something seems to be seriously broken with Swift and array indexing.
Edit 4: These issues (as well as some other performance issues) seems to have been fixed in Xcode 6 beta 5.
For sorting, I now have the following timings:
- clang++ -O3: 0.06 s
- swiftc -Ofast: 0.1 s
- swiftc -O: 0.1 s
- swiftc: 4 s
For nested loops:
- clang++ -O3: 0.06 s
- swiftc -Ofast: 0.3 s
- swiftc -O: 0.4 s
- swiftc: 540 s
It seems that there is no reason anymore to use the unsafe -Ofast
(a.k.a. -Ounchecked
); plain -O
produces equally good code.
Source: (StackOverflow)
def main():
for i in xrange(10**8):
pass
main()
This piece of code in Python runs in
real 0m1.841s
user 0m1.828s
sys 0m0.012s
However, if the for loop isn't placed within a function,
for i in xrange(10**8):
pass
then it runs for a much longer time:
real 0m4.543s
user 0m4.524s
sys 0m0.012s
Why is this?
Note: The timing is done with the time function in BASH in Linux.
Source: (StackOverflow)
Project Euler and other coding contests often have a maximum time to run or people boast of how fast their particular solution runs. With python, sometimes the approaches are somewhat kludgey - i.e., adding timing code to __main__
.
What is a good way to profile how long a python program takes to run?
Source: (StackOverflow)
It is my understanding that the range()
function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.
This being the case, I would have expected the following line to take an inordinate amount of time, because in order to determine whether 1 quadrillion is in the range, a quadrillion values would have to be generated:
1000000000000000 in range(1000000000000001)
Furthermore: it seems that no matter how many zeroes I add on, the calculation more or less takes the same amount of time (basically instantaneous).
I have also tried things like this, but the calculation is still almost instant:
1000000000000000000000 in range(0,1000000000000000000001,10) # count by tens
If I try to implement my own range function, the result is not so nice!!
def my_crappy_range(N):
i = 0
while i < N:
yield i
i += 1
return
What is the range()
object doing under the hood that makes it so fast?
EDIT: This has turned out to be a much more nuanced topic than I anticipated - there seems to be a bit of history behind the optimization of range()
.
Martijn Pieters' answer was chosen for its completeness, but also see abarnert's first answer for a good discussion of what it means for range
to be a full-fledged sequence in Python 3, and some information/warning regarding potential inconsistency for __contains__
function optimization across Python implementations. abarnert's other answer goes into some more detail and provides links for those interested in the history behind the optimization in Python 3 (and lack of optimization of xrange
in Python 2). Answers by poke and by wim provide the relevant C source code and explanations for those who are interested.
Source: (StackOverflow)
I'm working on a projects which involves a lot of database writes, I'd say (70% inserts and 30% reads). This ratio would also include updates which I consider to be one read and one write. The reads can be dirty (e.g. I don't need 100% accurate information at the time of read).
The task in question will be doing over 1 million database transactions an hour.
I've read a bunch of stuff on the web about the differences between MyISAM and InnoDB, and MyISAM seems like the obvious choice to me for the particular database/tables that I'll be using for this task. From what I seem to be reading, InnoDB is good if transactions are needed since row level locking is supported.
Does anybody have any experience with this type of load (or higher)? Is MyISAM the way to go?
Source: (StackOverflow)
Python 2.6 introduced the str.format()
method with a slightly different syntax from the existing %
operator. Which is better and for what situations?
The following uses each method and has the same outcome, so what is the difference?
#!/usr/bin/python
sub1 = "python string!"
sub2 = "an arg"
a = "i am a %s" % sub1
b = "i am a {0}".format(sub1)
c = "with %(kwarg)s!" % {'kwarg':sub2}
d = "with {kwarg}!".format(kwarg=sub2)
print a # "i am a python string!"
print b # "i am a python string!"
print c # "with an arg!"
print d # "with an arg!"
Furthermore when does string formatting occur in Python? For example, if my logging level is set to HIGH will I still take a hit for performing the following %
operation? And if so, is there a way to avoid this?
log.debug("some debug info: %s" % some_info)
Source: (StackOverflow)
I made a comment yesterday on an answer where someone had used [0123456789]
in a regular expression rather than [0-9]
or \d
. I said it was probably more efficient to use a range or digit specifier than a character set.
I decided to test that out today and found out to my surprise that (in the C# regex engine at least) \d
appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:
Regular expression \d took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first
It's a surprise to me for two reasons:
- I would have thought the range would be implemented much more efficiently than the set.
- I can't understand why
\d
is worse than [0-9]
. Is there more to \d
than simply shorthand for [0-9]
?
Here is the test code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace SO_RegexPerformance
{
class Program
{
static void Main(string[] args)
{
var rand = new Random(1234);
var strings = new List<string>();
//10K random strings
for (var i = 0; i < 10000; i++)
{
//Generate random string
var sb = new StringBuilder();
for (var c = 0; c < 1000; c++)
{
//Add a-z randomly
sb.Append((char)('a' + rand.Next(26)));
}
//In roughly 50% of them, put a digit
if (rand.Next(2) == 0)
{
//Replace one character with a digit, 0-9
sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
}
strings.Add(sb.ToString());
}
var baseTime = testPerfomance(strings, @"\d");
Console.WriteLine();
var testTime = testPerfomance(strings, "[0-9]");
Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
testTime = testPerfomance(strings, "[0123456789]");
Console.WriteLine(" {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
}
private static TimeSpan testPerfomance(List<string> strings, string regex)
{
var sw = new Stopwatch();
int successes = 0;
var rex = new Regex(regex);
sw.Start();
foreach (var str in strings)
{
if (rex.Match(str).Success)
{
successes++;
}
}
sw.Stop();
Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);
return sw.Elapsed;
}
}
}
Source: (StackOverflow)