Top high-availability frequently asked interview questions

Failover & Disaster Recovery [closed]

What's the difference between failover and disaster recovery?

Web App: High Availability / How to prevent a single point of failure?

Can someone explain to me how high-availability ("HA") works for a web application ... because I assume HA means that there exist no single-point-of-failure.

However, even if a load balancer is used- isn't that the single point of failure?

Source: (StackOverflow)

Application upgrade in a high availability environment

I am writing a NoSQL database engine and I want to provide features to help the developers to upgrade their application to a new version without stopping the operation of the website, i.e 0% downtime during upgrade. So my question is, what are the methods or general design of a web application when it is run 24/7 and is changing its database structure very often? Any examples or success stories would be greatly appreciated.

Source: (StackOverflow)

ZooKeeper alternatives? (cluster coordination service) [closed]

ZooKeeper is a highly available coordination service for data centers. It originated in the Hadoop project. One can implement locking, fail over, leader election, group membership and other coordination issues on top of it. Are there any alternatives to ZooKeeper? (free software of course)

Source: (StackOverflow)

Which part of the CAP theorem does Cassandra sacrifice and why?

There is a great talk here about simulating partition issues in Cassandra with Kingsby's Jesper library.

My question is - with Cassandra are you mainly concerned with the Partitioning part of the CAP theorem, or is Consistency a factor you need to manage as well?

Source: (StackOverflow)

name node Vs secondary name node

Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram.

Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down.

But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes.

What is the major difference between name node and secondary name node that makes hadoop unavailable.

Thanks in advance.

Source: (StackOverflow)

How do you update a live, busy web site in the politest way possible?

When you roll out changes to a live web site, how do you go about checking that the live system is working correctly? Which tools do you use? Who does it? Do you block access to the site for the testing period? What amount of downtime is acceptable?

Source: (StackOverflow)

How to Guarantee Message delivery with Celery?

I have a python application where I want to start doing more work in the background so that it will scale better as it gets busier. In the past I have used Celery for doing normal background tasks, and this has worked well.

The only difference between this application and the others I have done in the past is that I need to guarantee that these messages are processed, they can't be lost.

For this application I'm not too concerned about speed for my message queue, I need reliability and durability first and formost. To be safe I want to have two queue servers, both in different data centers in case something goes wrong, one a backup of the other.

Looking at Celery it looks like it supports a bunch of different backends, some with more features then the others. The two most popular look like redis and RabbitMQ so I took some time to examine them further.

RabbitMQ: Supports durable queues and clustering, but the problem with the way they have clustering today is that if you lose a node in the cluster, all messages in that node are unavailable until you bring that node back online. It doesn't replicated the messages between the different nodes in the cluster, it just replicates the metadata about the message, and then it goes back to the originating node to get the message, if the node isn't running, you are S.O.L. Not ideal.

The way they recommend to get around this is to setup a second server and replicate the file system using DRBD, and then running something like pacemaker to switch the clients to the backup server when it needs too. This seems pretty complicated, not sure if there is a better way. Anyone know of a better way?

Redis: Supports a read slave and this would allow me to have a backup in case of emergencies but it doesn't support master-master setup, and I'm not sure if it handles active failover between master and slave. It doesn't have the same features as RabbitMQ, but looks much easier to setup and maintain.

Questions:

What is the best way to setup celery so that it will guarantee message processing.
Has anyone done this before? If so, would be mind sharing what you did?

Source: (StackOverflow)

Scala + Akka: How to develop a Multi-Machine Highly Available Cluster

We're developing a server system in Scala + Akka for a game that will serve clients in Android, iPhone, and Second Life. There are parts of this server that need to be highly available, running on multiple machines. If one of those servers dies (of, say, hardware failure), the system needs to keep running. I think I want the clients to have a list of machines they will try to connect with, similar to how Cassandra works.

The multi-node examples I've seen so far with Akka seem to me to be centered around the idea of scalability, rather than high availability (at least with regard to hardware). The multi-node examples seem to always have a single point of failure. For example there are load balancers, but if I need to reboot one of the machines that have load balancers, my system will suffer some downtime.

Are there any examples that show this type of hardware fault tolerance for Akka? Or, do you have any thoughts on good ways to make this happen?

So far, the best answer I've been able to come up with is to study the Erlang OTP docs, meditate on them, and try to figure out how to put my system together using the building blocks available in Akka.

But if there are resources, examples, or ideas on how to share state between multiple machines in a way that if one of them goes down things keep running, I'd sure appreciate them, because I'm concerned I might be re-inventing the wheel here. Maybe there is a multi-node STM container that automatically keeps the shared state in sync across multiple nodes? Or maybe this is so easy to make that the documentation doesn't bother showing examples of how to do it, or perhaps I haven't been thorough enough in my research and experimentation yet. Any thoughts or ideas will be appreciated.

Source: (StackOverflow)

Design Patterns (or techniques) for Scalability

What design patterns or techniques have you used that are specifically geared toward scalability?

Patterns such as the Flyweight pattern seem to me to be a specialized version of the Factory Pattern, to promote high scalability or when working within memory or storage constraints.

What others have you used? (Denormalization of Databases, etc.) Do you find that the rules change when high availability or scalability is your primary goal?

Possible situations are:

Mobile devices with more limited memory, processing power, and connectivity than a Desktop or Laptop
High # of users on limited hardware (caching strategies, etc)
Optimization of database schema for efficiency in lieu of a normalized design (e.g. SharePoint column wrapping for storage)

Source: (StackOverflow)

redis: Handling failover?

Redis really seems like a great product with the built in replication and the amazing speed. After testing it out, it feels definitely like the 2010 replacement of memcached.

However, since when normally using memcached, a consistent hashing is being used to evenly spread out the data across the servers in a pool. If one of the servers in the pool goes down and stops being accessible, it is being handled transparently and only the keys that were lost will be recreated and evenly distributed across the remaining available servers in the pool.

Redis has on the other hand also built-in sharding, but also another really interesting feature called automatic replication. Thanks to that, availability of the data could be greatly increased while utilizing slave servers to use in the event of the shit hitting the fan.

However, I have yet not found any good solution to handle changing a redis server's status as a slave to become a new master automatically or by any other way automatically handling the failover with Redis.

How could this be done? What would be an appropriate approach to this?

Source: (StackOverflow)

How to design and verify distributed systems?

I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.

My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.

Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.

Source: (StackOverflow)

ElasticSearch 1.6 seems to lose documents during high availability test

As part of an investigation for using ElasticSearch as a reliable document store, from a Java application, I'm running a basic HA test as follows:

I set up a minimal cluster using a readily available Docker image of ElasticSearch 1.6 (https://registry.hub.docker.com/_/elasticsearch), with:

2 master/data nodes
1 client node (as to always have someone to connect to)

Then I run a small loader app that inserts 500,000 documents of ~1KB each.

This takes approximately 1 minute and a half on my machine. During this time, I restart the current master node (docker restart).

At the end of the run, the Java API has responded OK to 100% of my queries, but when I check the documents count with a curl request, a few documents are missing (somewhere between 2 and 10 depending on runs I made)

Even with an explicit "_refresh" request on the index, my document count is the same.

My main concern of course is not that some documents cannot be stored during a crash but rather the positive result returned by the API (especially since I'm testing with WriteConsistencyLevel.ALL).

I'm aware of this ticket, yet unsure if it applies to my basic scenario

https://github.com/elastic/elasticsearch/issues/7572

My inserts are done as follows:

client.prepareUpdate("test", "test", id)
      .setDoc(doc).setUpsert(doc)
      .setConsistencyLevel(WriteConsistencyLevel.ALL)
      .execute.get.isCreated == true

The rest of the code can be found here : https://github.com/joune/nosql/blob/master/src/main/scala/ap.test.nosql/Loader.scala

Please advise if you think I'm doing something obviously wrong.

(I know some will reply that considering ElasticSearch as a reliable document store is plain wrong, but that's the goal of the study and not the kind of answer I expect)

Update Additional logs as requested by Andrei Stefan

> grep discovery.zen.minimum_master_nodes elasticsearch.yml
discovery.zen.minimum_master_nodes: 2

> curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{"transient":{"logger._root":"DEBUG"}}'
{"acknowledged":true,"persistent":{},"transient":{"logger":{"_root":"DEBUG"}}}%
> curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{"transient": {"logger.index.translog":"TRACE"}}'
{"acknowledged":true,"persistent":{},"transient":{"logger":{"index":{"translog":"TRACE"}}}}%

Run test with 200,000 entries:

0 KO | 200000 OK
> curl -XGET 'localhost:9200/test/test/_count?preference=_primary'
{"count":199991,"_shards":{"total":5,"successful":5,"failed":0}}%

I've put the logs here: https://gist.github.com/ab1ed844f2038c30e63b

Source: (StackOverflow)

mobile detection high traffic site

i have a high traffic website (1+ Millions visitors per day) and i need to detect their user agent. i have a list over 1000 mobile devices.

i run memcache to output dynamic content based on what page they access and params they put eg:

/document/page/1?textsize=large

and i don't have static pages nor i cant use sub-domains.

i found different scripts that check user agent:

http://www.mobile-phone-specs.com/user-agent-browser/0/

http://detectmobilebrowsers.mobi/

http://detectmobilebrowsers.com/

my question is, performing these check every time a page load will make my site slow with the traffic i get

edit: i need to know in my php code if it's a mobile or not browser.

how can i make this check to run faster?

Source: (StackOverflow)

Looking for a scalable "at" implementation

I'm looking for a scalable "at" replacement, with high availability. It must support adding and removing jobs at runtime.

Some background: I have an application where I trigger millions of events, each event occurs just once. I don't need cron like mechanisms (first Sunday of the month, etc), simply date, time and context.

Currently I'm using the Quartz scheduler, and while it is a very good project, it has difficulties to handle the amount of events we throw at it, even after a lot of tweaking (sharding, increasing polling interval, etc.) due to the basic locking it performs on the underline database. Also, it is a bit overkill for us, as basically we have millions of one time triggers, and relatively small number of jobs.

I'd appreciate any suggestion

Source: (StackOverflow)

EzDevInfo.com

high-availability interview questions

Failover & Disaster Recovery [closed]

Web App: High Availability / How to prevent a single point of failure?

Application upgrade in a high availability environment

ZooKeeper alternatives? (cluster coordination service) [closed]

Which part of the CAP theorem does Cassandra sacrifice and why?

name node Vs secondary name node

How do you update a live, busy web site in the politest way possible?

How to Guarantee Message delivery with Celery?

Scala + Akka: How to develop a Multi-Machine Highly Available Cluster

Design Patterns (or techniques) for Scalability

redis: Handling failover?

How to design and verify distributed systems?

ElasticSearch 1.6 seems to lose documents during high availability test

mobile detection high traffic site

Looking for a scalable "at" implementation