latency interview questions
Top latency frequently asked interview questions
I need to simulate a low bandwidth, high latency connection to a server in order to emulate the conditions of a VPN at a remote site. The bandwidth and latency should be tweakable so I can discover the best combination in order to run our software package.
Source: (StackOverflow)
We are developing a program which receives and forwards "messages", while keeping a temporary history of those messages, so that it can tell you the message history if requested. Messages are identified numerically, are typically around 1 kilobyte in size, and we need to keep hundreds of thousands of these messages.
We wish to optimize this program for latency: the time between sending and receiving a message must be below 10 milliseconds.
The program is written in Haskell and compiled with GHC. However, we have found that garbage collection pauses are far too long for our latency requirements: over 100 milliseconds in our real-world program.
The following program is a simplified version of our application. It uses a Data.Map.Strict
to store messages. Messages are ByteString
s identified by an Int
. 1,000,000 messages are inserted in increasing numeric order, and the oldest messages are continually removed to keep the history at a maximum of 200,000 messages.
module Main (main) where
import qualified Control.Exception as Exception
import qualified Control.Monad as Monad
import qualified Data.ByteString as ByteString
import qualified Data.Map.Strict as Map
data Msg = Msg !Int !ByteString.ByteString
type Chan = Map.Map Int ByteString.ByteString
message :: Int -> Msg
message n = Msg n (ByteString.replicate 1024 (fromIntegral n))
pushMsg :: Chan -> Msg -> IO Chan
pushMsg chan (Msg msgId msgContent) =
Exception.evaluate $
let
inserted = Map.insert msgId msgContent chan
in
if 200000 < Map.size inserted
then Map.deleteMin inserted
else inserted
main :: IO ()
main = Monad.foldM_ pushMsg Map.empty (map message [1..1000000])
We compiled and ran this program using:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ ghc -O2 -optc-O3 Main.hs
$ ./Main +RTS -s
3,116,460,096 bytes allocated in the heap
385,101,600 bytes copied during GC
235,234,800 bytes maximum residency (14 sample(s))
124,137,808 bytes maximum slop
600 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6558 colls, 0 par 0.238s 0.280s 0.0000s 0.0012s
Gen 1 14 colls, 0 par 0.179s 0.250s 0.0179s 0.0515s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.652s ( 0.745s elapsed)
GC time 0.417s ( 0.530s elapsed)
EXIT time 0.010s ( 0.052s elapsed)
Total time 1.079s ( 1.326s elapsed)
%GC time 38.6% (40.0% elapsed)
Alloc rate 4,780,213,353 bytes per MUT second
Productivity 61.4% of total user, 49.9% of total elapsed
The important metric here is the "max pause" of 0.0515s, or 51 milliseconds. We wish to reduce this by at least an order of magnitude.
Experimentation shows that the length of a GC pause is determined by the number of messages in the history. The relationship is roughly linear, or perhaps super-linear. The following table shows this relationship. (You can see our benchmarking tests here, and some charts here.)
msgs history length max GC pause (ms)
=================== =================
12500 3
25000 6
50000 13
100000 30
200000 56
400000 104
800000 199
1600000 487
3200000 1957
6400000 5378
We have experimented with several other variables to find whether they can reduce this latency, none of which make a big difference. Among these unimportant variables are: optimization (-O
, -O2
); RTS GC options (-G
, -H
, -A
, -c
), number of cores (-N
), different data structures (Data.Sequence
), the size of messages, and the amount of generated short-lived garbage. The overwhelming determining factor is the number of messages in the history.
Our working theory is that the pauses are linear in the number of messages because each GC cycle has to walk over all the working accessible memory and copy it, which are clearly linear operations.
Questions:
- Is this linear-time theory correct? Can the length of GC pauses be expressed in this simple way, or is the reality more complex?
- If GC pause is linear in the working memory, is there any way to reduce the constant factors involved?
- Are there any options for incremental GC, or anything like it? We can only see research papers. We are very willing to trade throughput for lower latency.
- Are there any ways to "partition" memory for smaller, GC cycles, other than splitting into multiple processes?
Source: (StackOverflow)
In a Swing application, I sometimes need to support read-only access to large, line-oriented text files that are slow to load: logs, dumps, traces, etc. For small amounts of data, a suitable Document
and JTextComponent
are fine, as shown here. I understand the human limitations of browsing large amounts of data, but the problematic stuff seems like it's always in the biggest file. Is there any practical alternative for larger amounts of text in the 10-100 megabyte, million-line range?
Source: (StackOverflow)
I have a ruby on rails application. I am investigating an Apdex decline in my NewRelic portal and I'm seeing that on average, 250-320ms of time is being spent on GC execution. This is a highly disturbing number. I've included a screen shot below.
My Ruby version is:
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
Any suggestions for tuning this would be ideal. This number should be substantially lower.
Source: (StackOverflow)
I'm trying to trigger different actions inside only my own app using buttons of plugged headset (something similar what pressy does). I noticed however that no matter if I use MPRemoteCommandCenter
or remoteControlReceivedWithEvent
delegate, I receive events with a noticable lag. What makes matter worse is that if I try double press button fast I will only get one UIEventTypeRemoteControl
.
Does anyone experience similar issue, know the reason of this or even better know some workaround? Tested under ios8 and ios9.
Source: (StackOverflow)
I'm running centos 6.3 inside of VirtualBox 4.2.1 on my OSX 10.8.2 machine, and have encountered a latency issue I do not comprehend. Basically, every http request to anywhere has an additional 5000ms delay. Ping has no additional delay.
[vagrant@localhost ~]$ curl -o /dev/null https://google.com -w ""
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
110 220 110 220 0 0 42 0 0:00:05 0:00:05 --:--:-- 7586
[vagrant@localhost ~]$ curl -o /dev/null https://google.com -w ""
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
110 220 110 220 0 0 42 0 0:00:05 0:00:05 --:--:-- 7586
[vagrant@localhost ~]$ curl -o /dev/null https://google.com -w ""
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
110 220 110 220 0 0 42 0 0:00:05 0:00:05 --:--:-- 7857
[vagrant@localhost ~]$ ping google.com
PING google.com (74.125.224.66) 56(84) bytes of data.
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=1 ttl=63 time=7.64 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=2 ttl=63 time=6.26 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=3 ttl=63 time=8.69 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=4 ttl=63 time=7.89 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=5 ttl=63 time=5.99 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=6 ttl=63 time=8.64 ms
For the record, I have a ubuntu 12.04.1 virtual machine in the same VirtualBox/OSX environment, that does not experience the issue.
Source: (StackOverflow)
Just came across the register keyword in C++ and I wondered as this seems a good idea (keeping certain variables in a register) surely the compiler does this by default?
So I wondered is this keyword still used?
Source: (StackOverflow)
Through appstats, I can see that my datastore queries are taking about 125ms (api and cpu combined), but often there are long latencies (e.g. upto 12000ms) before the queries are executed.
I can see that my latency from the datastore is not related to my query (e.g. the same query/data has vastly different latencies), so I'm assuming that it's a scheduling issue with app engine.
Are other people seeing this same problem ?
Is there someway to reduce the latency (e.g. admin console setting) ?
Here's a screen shot from appstats. This servlet has very little cpu processing. It does a getObjectByID and then does a datastore query. The query has an OR operator so it's being converted into 3 queries by app engine.
.
As you can see, it takes 6000ms before the first getObjectByID is even executed. There is no processing before the get operation (other than getting pm). I thought this 6000ms latency might be due to an instance warm-up, so I had increased my idle instances to 2 to prevent any warm-ups.
Then there's a second latency around a 1000ms between the getObjectByID and the query. There's zero lines of code between the get and the query. The code simply takes the result of the getObjectByID and uses the data as part of the query.
The grand total is 8097ms, yet my datastore operations (and 99.99% of the servlet) are only 514ms (45ms api), though the numbers change every time I run the servlet. Here is another appstats screenshot that was run on the same servlet against the same data.
Here is the basics of my java code. I had to remove some of the details for security purposes.
user = pm.getObjectById(User.class, userKey);
//build queryBuilder.append(...
final Query query = pm.newQuery(UserAccount.class,queryBuilder.toString());
query.setOrdering("rating descending");
query.executeWithArray(args);
Edited:
Using Pingdom, I can see that GAE latency varies from 450ms to 7,399ms, or 1,644% difference !! This is with two idle instances and no users on the site.
Source: (StackOverflow)
I am trying to prefetch training data to hide I/O latency. I would like to write custom Python code that loads data from disk and preprocesses the data (e.g. by adding a context window). In other words, one thread does data preprocessing and the other does training. Is this possible in TensorFlow?
Update: I have a working example based on @mrry's example.
import numpy as np
import tensorflow as tf
import threading
BATCH_SIZE = 5
TRAINING_ITERS = 4100
feature_input = tf.placeholder(tf.float32, shape=[128])
label_input = tf.placeholder(tf.float32, shape=[128])
q = tf.FIFOQueue(200, [tf.float32, tf.float32], shapes=[[128], [128]])
enqueue_op = q.enqueue([label_input, feature_input])
label_batch, feature_batch = q.dequeue_many(BATCH_SIZE)
c = tf.reshape(feature_batch, [BATCH_SIZE, 128]) + tf.reshape(label_batch, [BATCH_SIZE, 128])
sess = tf.Session()
def load_and_enqueue(sess, enqueue_op, coord):
with open('dummy_data/features.bin') as feature_file, open('dummy_data/labels.bin') as label_file:
while not coord.should_stop():
feature_array = np.fromfile(feature_file, np.float32, 128)
if feature_array.shape[0] == 0:
print('reach end of file, reset using seek(0,0)')
feature_file.seek(0,0)
label_file.seek(0,0)
continue
label_value = np.fromfile(label_file, np.float32, 128)
sess.run(enqueue_op, feed_dict={feature_input: feature_array,
label_input: label_value})
coord = tf.train.Coordinator()
t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord))
t.start()
for i in range(TRAINING_ITERS):
sum = sess.run(c)
print('train_iter='+str(i))
print(sum)
coord.request_stop()
coord.join([t])
Source: (StackOverflow)
I've been looking into ways to reduce the latency caused by transferring data back and forth from the CPU and GPU. When I first started using CUDA I did notice that data transfer between the CPU and GPU did take a few seconds, but I didn't really care because this isn't really a concern for the small programs I'm been writing. In fact, the latency probably isn't much of a problem for vast majority of the programs that utilize GPUs, video games included, because they're stiller a lot faster than if they would have run on the CPU.
However, I'm a bit of an HPC enthusiast and I became concerned with the direction of my studies when I saw the massive discrepancy between the Tianhe-I theoretical peak FLOPS and the actual LINPACK measured performance. This has the my concerned about whether I'm taking the right career path.
Use of pinned memory (page-locked) memory through the use of the cudaHostAlloc() function is one method of reducing latency (quite effective), but are there any other techniques I'm not aware of? And to be clear, I'm talking about optimizing the code, not the hardware itself (that's NVIDIA and AMD's jobs).
Just as a side question, I'm aware that Dell and HP sell Tesla servers. I'm curious as to how well a GPU leverages a database application, where you would need a constant read from the hard drive (HDD or SSD), an operation only the CPU can perform,
Source: (StackOverflow)
I'm doing some performance tuning and capacity planning for a low-latency application and have the following question:
What is the theoretical minimum round-trip time for a packet sent between a host in London and one in New York connected via optical fiber?
Source: (StackOverflow)
What's the latency to GET an object from S3 from an EC2 Instance.
For example, how many ms before the actual data stream for that object starts streaming back when requesting a object by it's full path.
- If the object exists on S3
- If the object does not exist and to send a 404
Latency, not Throughput.
Source: (StackOverflow)
I run a couple of game tunnelling servers and would like to have a page where the client can run a ping on all the servers and find out which is the most responsive. As far as I can see there seems to be no proper way to do this in JavaScript, but I was thinking, does anybody know of a way to do this in flash or some other client browser technology maybe?
Source: (StackOverflow)
I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from
method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?
Boost is a very good library, but quite generic, is there a lower latency alternative?
All opinions on writing low-latency UDP network programs are very welcome.
EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?
Source: (StackOverflow)
I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t
to long long
for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?
Source: (StackOverflow)