failover interview questions
Top failover frequently asked interview questions
I'm interested in hearing people's thoughts about the pros and cons of database mirroring vs. log shipping in this scenario: we need to setup a database backup situation wherein there is exactly one secondary server that need not automatically pick up when the primary fails. Recovering and starting with the secondary should not have to take too long though.
Source: (StackOverflow)
I am trying to create a Redis message bus failover scenario with a SignalR app.
At first, we tried a simple hardware load-balancer failover, that simply monitored two Redis servers. The SignalR application pointed to the singular HLB endpoint. I then failed one server, but was unable to successfully get any messages through on the second Redis server without recycling the SignalR app pool. Presumably this is because it needs to issue the setup commands to the new Redis message bus.
As of SignalR RC1, Microsoft.AspNet.SignalR.Redis.RedisMessageBus
uses Booksleeve's RedisConnection()
to connect to a single Redis for pub/sub.
I created a new class, RedisMessageBusCluster()
that uses Booksleeve's ConnectionUtils.Connect()
to connect to one in a cluster of Redis servers.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using BookSleeve;
using Microsoft.AspNet.SignalR.Infrastructure;
namespace Microsoft.AspNet.SignalR.Redis
{
/// <summary>
/// WIP: Getting scaleout for Redis working
/// </summary>
public class RedisMessageBusCluster : ScaleoutMessageBus
{
private readonly int _db;
private readonly string[] _keys;
private RedisConnection _connection;
private RedisSubscriberConnection _channel;
private Task _connectTask;
private readonly TaskQueue _publishQueue = new TaskQueue();
public RedisMessageBusCluster(string serverList, int db, IEnumerable<string> keys, IDependencyResolver resolver)
: base(resolver)
{
_db = db;
_keys = keys.ToArray();
// uses a list of connections
_connection = ConnectionUtils.Connect(serverList);
//_connection = new RedisConnection(host: server, port: port, password: password);
_connection.Closed += OnConnectionClosed;
_connection.Error += OnConnectionError;
// Start the connection - TODO: can remove this Open as the connection is already opened, but there's the _connectTask is used later on
_connectTask = _connection.Open().Then(() =>
{
// Create a subscription channel in redis
_channel = _connection.GetOpenSubscriberChannel();
// Subscribe to the registered connections
_channel.Subscribe(_keys, OnMessage);
// Dirty hack but it seems like subscribe returns before the actual
// subscription is properly setup in some cases
while (_channel.SubscriptionCount == 0)
{
Thread.Sleep(500);
}
});
}
protected override Task Send(Message[] messages)
{
return _connectTask.Then(msgs =>
{
var taskCompletionSource = new TaskCompletionSource<object>();
// Group messages by source (connection id)
var messagesBySource = msgs.GroupBy(m => m.Source);
SendImpl(messagesBySource.GetEnumerator(), taskCompletionSource);
return taskCompletionSource.Task;
},
messages);
}
private void SendImpl(IEnumerator<IGrouping<string, Message>> enumerator, TaskCompletionSource<object> taskCompletionSource)
{
if (!enumerator.MoveNext())
{
taskCompletionSource.TrySetResult(null);
}
else
{
IGrouping<string, Message> group = enumerator.Current;
// Get the channel index we're going to use for this message
int index = Math.Abs(group.Key.GetHashCode()) % _keys.Length;
string key = _keys[index];
// Increment the channel number
_connection.Strings.Increment(_db, key)
.Then((id, k) =>
{
var message = new RedisMessage(id, group.ToArray());
return _connection.Publish(k, message.GetBytes());
}, key)
.Then((enumer, tcs) => SendImpl(enumer, tcs), enumerator, taskCompletionSource)
.ContinueWithNotComplete(taskCompletionSource);
}
}
private void OnConnectionClosed(object sender, EventArgs e)
{
// Should we auto reconnect?
if (true)
{
;
}
}
private void OnConnectionError(object sender, BookSleeve.ErrorEventArgs e)
{
// How do we bubble errors?
if (true)
{
;
}
}
private void OnMessage(string key, byte[] data)
{
// The key is the stream id (channel)
var message = RedisMessage.Deserialize(data);
_publishQueue.Enqueue(() => OnReceived(key, (ulong)message.Id, message.Messages));
}
protected override void Dispose(bool disposing)
{
if (disposing)
{
if (_channel != null)
{
_channel.Unsubscribe(_keys);
_channel.Close(abort: true);
}
if (_connection != null)
{
_connection.Close(abort: true);
}
}
base.Dispose(disposing);
}
}
}
Booksleeve has its own mechanism for determining a master, and will automatically fail over to another server, and am now testing this with SignalR.Chat
.
In web.config
, I set the list of available servers:
<add key="redis.serverList" value="dbcache1.local:6379,dbcache2.local:6379"/>
Then in Application_Start()
:
// Redis cluster server list
string redisServerlist = ConfigurationManager.AppSettings["redis.serverList"];
List<string> eventKeys = new List<string>();
eventKeys.Add("SignalR.Redis.FailoverTest");
GlobalHost.DependencyResolver.UseRedisCluster(redisServerlist, eventKeys);
I added two additional methods to Microsoft.AspNet.SignalR.Redis.DependencyResolverExtensions
:
public static IDependencyResolver UseRedisCluster(this IDependencyResolver resolver, string serverList, IEnumerable<string> eventKeys)
{
return UseRedisCluster(resolver, serverList, db: 0, eventKeys: eventKeys);
}
public static IDependencyResolver UseRedisCluster(this IDependencyResolver resolver, string serverList, int db, IEnumerable<string> eventKeys)
{
var bus = new Lazy<RedisMessageBusCluster>(() => new RedisMessageBusCluster(serverList, db, eventKeys, resolver));
resolver.Register(typeof(IMessageBus), () => bus.Value);
return resolver;
}
Now the problem is that when I have several breakpoints enabled, until after a user name has been added, then disable all breakpoints, the application works as expected. However, with the breakpoints disabled from the beginning, there seems to be some race condition that may be failing during the connection process.
Thus, in RedisMessageCluster()
:
// Start the connection
_connectTask = _connection.Open().Then(() =>
{
// Create a subscription channel in redis
_channel = _connection.GetOpenSubscriberChannel();
// Subscribe to the registered connections
_channel.Subscribe(_keys, OnMessage);
// Dirty hack but it seems like subscribe returns before the actual
// subscription is properly setup in some cases
while (_channel.SubscriptionCount == 0)
{
Thread.Sleep(500);
}
});
I tried adding both a Task.Wait
, and even an additional Sleep()
(not shown above) - which were waiting/etc, but still getting errors.
The recurring error seems to be in Booksleeve.MessageQueue.cs
~ln 71:
A first chance exception of type 'System.InvalidOperationException' occurred in BookSleeve.dll
iisexpress.exe Error: 0 : SignalR exception thrown by Task: System.AggregateException: One or more errors occurred. ---> System.InvalidOperationException: The queue is closed
at BookSleeve.MessageQueue.Enqueue(RedisMessage item, Boolean highPri) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\MessageQueue.cs:line 71
at BookSleeve.RedisConnectionBase.EnqueueMessage(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 910
at BookSleeve.RedisConnectionBase.ExecuteInt64(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 826
at BookSleeve.RedisConnection.IncrementImpl(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 277
at BookSleeve.RedisConnection.BookSleeve.IStringCommands.Increment(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 270
at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.SendImpl(IEnumerator`1 enumerator, TaskCompletionSource`1 taskCompletionSource) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 90
at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.<Send>b__2(Message[] msgs) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 67
at Microsoft.AspNet.SignalR.TaskAsyncHelper.GenericDelegates`4.<>c__DisplayClass57.<ThenWithArgs>b__56() in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 893
at Microsoft.AspNet.SignalR.TaskAsyncHelper.TaskRunners`2.<>c__DisplayClass42.<RunTask>b__41(Task t) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 821
--- End of inner exception stack trace ---
---> (Inner Exception #0) System.InvalidOperationException: The queue is closed
at BookSleeve.MessageQueue.Enqueue(RedisMessage item, Boolean highPri) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\MessageQueue.cs:line 71
at BookSleeve.RedisConnectionBase.EnqueueMessage(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 910
at BookSleeve.RedisConnectionBase.ExecuteInt64(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 826
at BookSleeve.RedisConnection.IncrementImpl(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 277
at BookSleeve.RedisConnection.BookSleeve.IStringCommands.Increment(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 270
at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.SendImpl(IEnumerator`1 enumerator, TaskCompletionSource`1 taskCompletionSource) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 90
at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.<Send>b__2(Message[] msgs) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 67
at Microsoft.AspNet.SignalR.TaskAsyncHelper.GenericDelegates`4.<>c__DisplayClass57.<ThenWithArgs>b__56() in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 893
at Microsoft.AspNet.SignalR.TaskAsyncHelper.TaskRunners`2.<>c__DisplayClass42.<RunTask>b__41(Task t) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 821<---
public void Enqueue(RedisMessage item, bool highPri)
{
lock (stdPriority)
{
if (closed)
{
throw new InvalidOperationException("The queue is closed");
}
Where a closed queue exception is being thrown.
I foresee another issue: Since the Redis connection is made in Application_Start()
there may be some issues in "reconnecting" to another server. However, I think this is valid when using the singular RedisConnection()
, where there is only one connection to choose from. However, with the intorduction of ConnectionUtils.Connect()
I'd like to hear from @dfowler
or the other SignalR guys in how this scenario is handled in SignalR.
Source: (StackOverflow)
Can someone provide a really simple (but not simpler than possible) explanation of a transaction as applied to computing (even if copied from Wikipedia)?
Source: (StackOverflow)
Does anyone know how to determine the active node of a SQL Active-Passive Failover Cluster programmatically from T-SQL?
@@SERVERNAME
only returns the virtual server name, which is identical from both nodes.
I don't plan to make any decisions based on the data - I trust the failover to do its thing - but I would like to include the information in an event log so I can tell which node in the cluster was active when the event occurred, or help determine if exceptions come up as a result of a failover.
Source: (StackOverflow)
Here's my scenario (designed by my predecessor):
Two Apache servers serving reverse proxy duty for a number of mixed backend web servers (Apache, IIS, Tomcat, etc.). There are some sites for which we have multiple backend web servers, and in those cases, we do something like:
<Proxy balancer://www.example.com>
BalancerMember http://192.168.1.40:80
BalancerMember http://192.168.1.41:80
</Proxy>
<VirtualHost *:80>
ServerName www.example.com:80
CustomLog /var/log/apache2/www.example.com.log combined
<Location />
Order allow,deny
Allow from all
ProxyPass balancer://www.example.com/
ProxyPassReverse balancer://www.example.com/
</Location>
</VirtualHost>
So in this example, I've got one site (www.example.com) in the proxy servers' configs, and that site is proxied to one or the other of the two backend servers, 192.168.1.40 and .41.
I'm evaluating this to make sure that we are fault tolerant on all of our web services (I've already put the two reverse proxy servers into a shared IP cluster for this reason), and I want to make sure that the load-balanced backend servers are fault tolerant as well. But I'm having trouble figuring out if backend failure detection (and the logic to avoid the failed backend server) is built into the mod_proxy_balancer module...
So if 192.168.202.40 goes down, will Apache detect this (I'll understand if it takes a failed request first) and automatically route all requests to the other backend, 192.168.202.41? Or will it continue to balance requests between the failed backend and the operational backend?
I've found some clues in the Apache documentation for mod_proxy and mod_proxy_balancer that seem to indicate that failure can be detected ("maxattempts = Maximum number of failover attempts before giving up.", "failonstatus = A single or comma-separated list of HTTP status codes. If set this will force the worker into error state when the backend returns any status code in the list."), but after a few days of searching, I've found nothing conclusive saying for sure that it will (or at least "should") detect backend failure and recovery.
I will say that most of the search results reference using the AJP protocol to pass the traffic to the backend servers, and this apparently does support failure detection-- but my backends are a mixture of Apache, IIS, Tomcat and others, and I am fairly sure that many of them don't support AJP. They are also a mixture of Windows 2k3/2k8 and Linux (mostly Ubuntu Lucid) boxes running various different applications with various different requirements, so add-on modules like Backhand and LVS aren't an option for me.
I've also tried to empirically test this feature, by creating a new test site like this:
<Proxy balancer://test.example.com>
BalancerMember http://192.168.1.40:80
BalancerMember http://192.168.1.200:80
</Proxy>
<VirtualHost *:80>
ServerName test.example.com:80
CustomLog /var/log/apache2/test.example.com.log combined
LogLevel debug
<Location />
Order allow,deny
Allow from all
ProxyPass balancer://test.example.com/
ProxyPassReverse balancer://test.example.com/
</Location>
</VirtualHost>
Where 192.168.1.200 is a bogus address that isn't running any web server, to simulate a backend failure. The test site was served up without a problem for a bunch of different client machines, but even with the LogLevel set to debug, I didn't see anything logged to indicate that it detected that one of the backend servers was down... And I'd like to make 100% sure that I can take our load-balanced backends down for maintenance (one at a time, of course) without affecting production sites.
Source: (StackOverflow)
I am a beginner in nodejs. I am trying to use nodejs in production. I wanted to achieve nodejs failover. As I am executing chat app, if a node server fails the chat should not breakup and should be automatically connected to a different nodeserver and the same socket id should be used for further chatting so that the chat message shouldn't go off. Is this can be achieved? Any samples.
I should not use Ngnix/HAProxy. Also let me know how the node servers should be: Either Active-Active or Active-Passive
Source: (StackOverflow)
I have to setup a database that can handle failover (if one crashes the other takes over). For that, I decided to use mongodb:
I set up a replica set with two instances. Each instance is running on a separate VM. I have several questions:
It is recommended to use at least 3 instances in a replica set. Is it ok to use only two ?
I have two instances, and then two IP addresses. Which IP should I give to my application that will need to read/write in the database ? When a database is down, how the request will be redirected to the instance that is still up ?
Some help to get started would be great !
Source: (StackOverflow)
I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS.
I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.
My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.
The connection pool is configured like this:
asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool
If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.
If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:
Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 940,715 milliseconds ago. The last packet sent successfully to the server was 935,598 milliseconds ago.
It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.
Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.
Things I have tried without success:
- connection leak timeout/reclaim
- statement leak timeout/reclaim
- statement timeout
- using a different validation method
- using
MysqlDataSource
instead of MysqlConnectionPoolDataSource
How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced?
Or how can I let GlassFish detect a database failover?
Source: (StackOverflow)
I am using mod_proxy_balancer to manage failover of backend servers. Backend servers may return an error code instead of timing out when some other backend service fails such as NFS and we want such servers also to be marked as failed nodes. Hence we are using failonstatus directive.
<Proxy balancer://avatar>
ProxySet failonstatus=503
BalancerMember http://active/ retry=30
# the hot standby
BalancerMember http://standby/ status=+H retry=0
</Proxy>
Currently the failover works perfectly with one glitch. When active node fails the user gets a 503 error and from the next request the Standby server takes over.
I dont want even a single request to fail though. Cant mod_proxy failover with out ever returning an error to the client? If active node fails I want mod_proxy to try the Standby for the same request and not just from the subsequent request!
Source: (StackOverflow)
I've been searching (with little success) for a free/opensource session clustering and replication solution for asp.net. I've run across the usual suspects (indexus sharedcache, memcached), however, each has some limitations.
- Indexus - Very immature, stubbed session interface implementation. Its otherwise a great caching solution, though.
- Memcached - Little replication/failover support without going to a db backend.
Several SF.Net projects - All aborted in the early stages... nothing that appears to have any traction, and one which seems to have gone all commercial.
- Microsoft Velocity - Not OSS, but seems nice. Unfortunately, I didn't see where CTP1 supported failover, and there is no clear roadmap for this one. I fear that this one could fall off into the ether like many other MS dev projects.
I am fairly used to the Java world where it is kind of taken for granted that many solutions to problems such as this will be available from the FOSS world.
Are there any suitable alternatives available on the .Net world?
Source: (StackOverflow)
I tried to run a simple app in a distributed manner to test failover-takeover features but failed.
What I want to:
The application is myapp_api
with a rest api, it has myapp
application as a dependency. I want to start myapp_api
on 3 nodes, I want the whole app (myapp_api
+ myapp
) to be working only at one node at the same time.
What is wrong:
The main app (myapp_api
) works as expected: only at one node with failover and takeover. But for some reason depended myapp
always starts at every node. I want it to be working only at one node at the same time.
What I do:
My config for the first node as an example.
[
{kernel,
[{distributed, [{myapp_api,
1000,
['n1@myhost', {'n2@myhost', 'n3@myhost'}]}]},
{sync_nodes_optional, ['n2@myhost', 'n3@myhost']},
{sync_nodes_timeout, 5000}
]}
].
I call
erl -sname nI -config nI.config -pa apps/*/ebin deps/*/ebin -s myapp_api
at every node.
Source: (StackOverflow)
What would be the best way to replicate individual DB tables from a Master postgresql server to a slave machine? It can be done with cron+rsync, or with whatever postgresql might have build in, or some sort of OSS tool, but so far the postgres docs don't seem to cover how to do table replication. I'm not able to do a full DB replication because some tables have license->IP stuff connected to it, and I can't replicate those on the slave machine. I don't need instant replication, hourly would be acceptable as well.
If I need to just rsync, can someone help identify what files within the /var/lib/pgsql directory would need to be synced, or how I would know what tables they are.
Source: (StackOverflow)
In David Fowler's blog, SQL Server has been added to the list of scale out providers for service bus.
I am in the process of implementing Redis on our Windows servers. Based on what I know about Redis, I'm guessing it will be significantly faster than using SQL Server - is that a fair assumption?
If so, how does the Windows version of Redis implement fail-over?
Source: (StackOverflow)