EzDevInfo.com

latency interview questions

Top latency frequently asked interview questions

How do I simulate a low bandwidth, high latency environment?

I need to simulate a low bandwidth, high latency connection to a server in order to emulate the conditions of a VPN at a remote site. The bandwidth and latency should be tweakable so I can discover the best combination in order to run our software package.

Source: (StackOverflow)

Reducing garbage-collection pause time in a Haskell program

We are developing a program which receives and forwards "messages", while keeping a temporary history of those messages, so that it can tell you the message history if requested. Messages are identified numerically, are typically around 1 kilobyte in size, and we need to keep hundreds of thousands of these messages.

We wish to optimize this program for latency: the time between sending and receiving a message must be below 10 milliseconds.

The program is written in Haskell and compiled with GHC. However, we have found that garbage collection pauses are far too long for our latency requirements: over 100 milliseconds in our real-world program.

The following program is a simplified version of our application. It uses a Data.Map.Strict to store messages. Messages are ByteStrings identified by an Int. 1,000,000 messages are inserted in increasing numeric order, and the oldest messages are continually removed to keep the history at a maximum of 200,000 messages.

module Main (main) where

import qualified Control.Exception as Exception
import qualified Control.Monad as Monad
import qualified Data.ByteString as ByteString
import qualified Data.Map.Strict as Map

data Msg = Msg !Int !ByteString.ByteString

type Chan = Map.Map Int ByteString.ByteString

message :: Int -> Msg
message n = Msg n (ByteString.replicate 1024 (fromIntegral n))

pushMsg :: Chan -> Msg -> IO Chan
pushMsg chan (Msg msgId msgContent) =
  Exception.evaluate $
    let
      inserted = Map.insert msgId msgContent chan
    in
      if 200000 < Map.size inserted
      then Map.deleteMin inserted
      else inserted

main :: IO ()
main = Monad.foldM_ pushMsg Map.empty (map message [1..1000000])

We compiled and ran this program using:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ ghc -O2 -optc-O3 Main.hs
$ ./Main +RTS -s
   3,116,460,096 bytes allocated in the heap
     385,101,600 bytes copied during GC
     235,234,800 bytes maximum residency (14 sample(s))
     124,137,808 bytes maximum slop
             600 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6558 colls,     0 par    0.238s   0.280s     0.0000s    0.0012s
  Gen  1        14 colls,     0 par    0.179s   0.250s     0.0179s    0.0515s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.652s  (  0.745s elapsed)
  GC      time    0.417s  (  0.530s elapsed)
  EXIT    time    0.010s  (  0.052s elapsed)
  Total   time    1.079s  (  1.326s elapsed)

  %GC     time      38.6%  (40.0% elapsed)

  Alloc rate    4,780,213,353 bytes per MUT second

  Productivity  61.4% of total user, 49.9% of total elapsed

The important metric here is the "max pause" of 0.0515s, or 51 milliseconds. We wish to reduce this by at least an order of magnitude.

Experimentation shows that the length of a GC pause is determined by the number of messages in the history. The relationship is roughly linear, or perhaps super-linear. The following table shows this relationship. (You can see our benchmarking tests here, and some charts here.)

msgs history length  max GC pause (ms)
===================  =================
12500                                3
25000                                6
50000                               13
100000                              30
200000                              56
400000                             104
800000                             199
1600000                            487
3200000                           1957
6400000                           5378

We have experimented with several other variables to find whether they can reduce this latency, none of which make a big difference. Among these unimportant variables are: optimization (-O, -O2); RTS GC options (-G, -H, -A, -c), number of cores (-N), different data structures (Data.Sequence), the size of messages, and the amount of generated short-lived garbage. The overwhelming determining factor is the number of messages in the history.

Our working theory is that the pauses are linear in the number of messages because each GC cycle has to walk over all the working accessible memory and copy it, which are clearly linear operations.

Questions:

Is this linear-time theory correct? Can the length of GC pauses be expressed in this simple way, or is the reality more complex?
If GC pause is linear in the working memory, is there any way to reduce the constant factors involved?
Are there any options for incremental GC, or anything like it? We can only see research papers. We are very willing to trade throughput for lower latency.
Are there any ways to "partition" memory for smaller, GC cycles, other than splitting into multiple processes?

Source: (StackOverflow)

Advertisements

Loading and displaying large text files

In a Swing application, I sometimes need to support read-only access to large, line-oriented text files that are slow to load: logs, dumps, traces, etc. For small amounts of data, a suitable Document and JTextComponent are fine, as shown here. I understand the human limitations of browsing large amounts of data, but the problematic stuff seems like it's always in the biggest file. Is there any practical alternative for larger amounts of text in the 10-100 megabyte, million-line range?

Source: (StackOverflow)

Ruby GC execution exceeding ~250-320ms per request

I have a ruby on rails application. I am investigating an Apdex decline in my NewRelic portal and I'm seeing that on average, 250-320ms of time is being spent on GC execution. This is a highly disturbing number. I've included a screen shot below.

My Ruby version is:

ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]

Any suggestions for tuning this would be ideal. This number should be substantially lower.

enter image description here

Source: (StackOverflow)

latency when pressing headset button in iphone

I'm trying to trigger different actions inside only my own app using buttons of plugged headset (something similar what pressy does). I noticed however that no matter if I use MPRemoteCommandCenter or remoteControlReceivedWithEvent delegate, I receive events with a noticable lag. What makes matter worse is that if I try double press button fast I will only get one UIEventTypeRemoteControl.

Does anyone experience similar issue, know the reason of this or even better know some workaround? Tested under ios8 and ios9.

Source: (StackOverflow)

CentOS VirtualBox 5 second latency on http requests

I'm running centos 6.3 inside of VirtualBox 4.2.1 on my OSX 10.8.2 machine, and have encountered a latency issue I do not comprehend. Basically, every http request to anywhere has an additional 5000ms delay. Ping has no additional delay.

screenshot of latency issue

[vagrant@localhost ~]$ curl -o /dev/null https://google.com -w ""
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
110   220  110   220    0     0     42      0  0:00:05  0:00:05 --:--:--  7586
[vagrant@localhost ~]$ curl -o /dev/null https://google.com -w ""
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
110   220  110   220    0     0     42      0  0:00:05  0:00:05 --:--:--  7586
[vagrant@localhost ~]$ curl -o /dev/null https://google.com -w ""
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
110   220  110   220    0     0     42      0  0:00:05  0:00:05 --:--:--  7857
[vagrant@localhost ~]$ ping google.com
PING google.com (74.125.224.66) 56(84) bytes of data.
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=1 ttl=63 time=7.64 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=2 ttl=63 time=6.26 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=3 ttl=63 time=8.69 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=4 ttl=63 time=7.89 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=5 ttl=63 time=5.99 ms
64 bytes from nuq04s07-in-f2.1e100.net (74.125.224.66): icmp_seq=6 ttl=63 time=8.64 ms

For the record, I have a ubuntu 12.04.1 virtual machine in the same VirtualBox/OSX environment, that does not experience the issue.

screenshot of not latency issue

Source: (StackOverflow)

Is the register keyword still used?

Just came across the register keyword in C++ and I wondered as this seems a good idea (keeping certain variables in a register) surely the compiler does this by default?

So I wondered is this keyword still used?

Source: (StackOverflow)

How can I reduce Google App Engine datastore latency?

Through appstats, I can see that my datastore queries are taking about 125ms (api and cpu combined), but often there are long latencies (e.g. upto 12000ms) before the queries are executed.

I can see that my latency from the datastore is not related to my query (e.g. the same query/data has vastly different latencies), so I'm assuming that it's a scheduling issue with app engine.

Are other people seeing this same problem ?

Is there someway to reduce the latency (e.g. admin console setting) ?

Here's a screen shot from appstats. This servlet has very little cpu processing. It does a getObjectByID and then does a datastore query. The query has an OR operator so it's being converted into 3 queries by app engine.

appstats screenshot . As you can see, it takes 6000ms before the first getObjectByID is even executed. There is no processing before the get operation (other than getting pm). I thought this 6000ms latency might be due to an instance warm-up, so I had increased my idle instances to 2 to prevent any warm-ups.

Then there's a second latency around a 1000ms between the getObjectByID and the query. There's zero lines of code between the get and the query. The code simply takes the result of the getObjectByID and uses the data as part of the query.

The grand total is 8097ms, yet my datastore operations (and 99.99% of the servlet) are only 514ms (45ms api), though the numbers change every time I run the servlet. Here is another appstats screenshot that was run on the same servlet against the same data. enter image description here

Here is the basics of my java code. I had to remove some of the details for security purposes.

user = pm.getObjectById(User.class, userKey);           
//build queryBuilder.append(...
final Query query = pm.newQuery(UserAccount.class,queryBuilder.toString());
query.setOrdering("rating descending");
query.executeWithArray(args);

Edited: Using Pingdom, I can see that GAE latency varies from 450ms to 7,399ms, or 1,644% difference !! This is with two idle instances and no users on the site. enter image description here

Source: (StackOverflow)

How to prefetch data using a custom python function in tensorflow

I am trying to prefetch training data to hide I/O latency. I would like to write custom Python code that loads data from disk and preprocesses the data (e.g. by adding a context window). In other words, one thread does data preprocessing and the other does training. Is this possible in TensorFlow?

Update: I have a working example based on @mrry's example.

import numpy as np
import tensorflow as tf
import threading

BATCH_SIZE = 5
TRAINING_ITERS = 4100

feature_input = tf.placeholder(tf.float32, shape=[128])
label_input = tf.placeholder(tf.float32, shape=[128])

q = tf.FIFOQueue(200, [tf.float32, tf.float32], shapes=[[128], [128]])
enqueue_op = q.enqueue([label_input, feature_input])

label_batch, feature_batch = q.dequeue_many(BATCH_SIZE)
c = tf.reshape(feature_batch, [BATCH_SIZE, 128]) + tf.reshape(label_batch, [BATCH_SIZE, 128])

sess = tf.Session()

def load_and_enqueue(sess, enqueue_op, coord):
  with open('dummy_data/features.bin') as feature_file, open('dummy_data/labels.bin') as label_file:
    while not coord.should_stop():
      feature_array = np.fromfile(feature_file, np.float32, 128)
      if feature_array.shape[0] == 0:
        print('reach end of file, reset using seek(0,0)')
        feature_file.seek(0,0)
        label_file.seek(0,0)
        continue
      label_value = np.fromfile(label_file, np.float32, 128)

      sess.run(enqueue_op, feed_dict={feature_input: feature_array,
                                      label_input: label_value})

coord = tf.train.Coordinator()
t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord))
t.start()

for i in range(TRAINING_ITERS):
  sum = sess.run(c)
  print('train_iter='+str(i))
  print(sum)

coord.request_stop()
coord.join([t])

Source: (StackOverflow)

Techniques to Reduce CPU to GPU Data Transfer Latency

I've been looking into ways to reduce the latency caused by transferring data back and forth from the CPU and GPU. When I first started using CUDA I did notice that data transfer between the CPU and GPU did take a few seconds, but I didn't really care because this isn't really a concern for the small programs I'm been writing. In fact, the latency probably isn't much of a problem for vast majority of the programs that utilize GPUs, video games included, because they're stiller a lot faster than if they would have run on the CPU.

However, I'm a bit of an HPC enthusiast and I became concerned with the direction of my studies when I saw the massive discrepancy between the Tianhe-I theoretical peak FLOPS and the actual LINPACK measured performance. This has the my concerned about whether I'm taking the right career path.

Use of pinned memory (page-locked) memory through the use of the cudaHostAlloc() function is one method of reducing latency (quite effective), but are there any other techniques I'm not aware of? And to be clear, I'm talking about optimizing the code, not the hardware itself (that's NVIDIA and AMD's jobs).

Just as a side question, I'm aware that Dell and HP sell Tesla servers. I'm curious as to how well a GPU leverages a database application, where you would need a constant read from the hard drive (HDD or SSD), an operation only the CPU can perform,

Source: (StackOverflow)

Theoretical minimum round-trip-time for a packet to travel over/under the North Atlantic Sea?

I'm doing some performance tuning and capacity planning for a low-latency application and have the following question:

What is the theoretical minimum round-trip time for a packet sent between a host in London and one in New York connected via optical fiber?

Source: (StackOverflow)

Latency of requesting a object from S3 to EC2 [closed]

What's the latency to GET an object from S3 from an EC2 Instance.

For example, how many ms before the actual data stream for that object starts streaming back when requesting a object by it's full path.

If the object exists on S3
If the object does not exist and to send a 404

Latency, not Throughput.

Source: (StackOverflow)

How to determine latency of a remote server through the browser

I run a couple of game tunnelling servers and would like to have a page where the client can run a ping on all the servers and find out which is the most responsive. As far as I can see there seems to be no proper way to do this in JavaScript, but I was thinking, does anybody know of a way to do this in flash or some other client browser technology maybe?

Source: (StackOverflow)

Low-latency read of UDP port

I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?

Boost is a very good library, but quite generic, is there a lower latency alternative?

All opinions on writing low-latency UDP network programs are very welcome.

EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?

Source: (StackOverflow)

Get CPU cycle count?

I saw this post on SO which contains C code to get the latest CPU Cycle count:

CPU Cycle count based profiling in C/C++ Linux x86_64

Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?

I am using x86-64

EDIT2:

Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)

static inline uint64_t get_cycles()
{
  uint64_t t;
  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}

EDIT3:

From above code I get the error:

"error C2400: inline assembler syntax error in 'opcode'; found 'data type'"

Could someone please help?

Source: (StackOverflow)