EzDevInfo.com

failover interview questions

Top failover frequently asked interview questions

Mirroring vs. Log Shipping in Sql Server 2005

I'm interested in hearing people's thoughts about the pros and cons of database mirroring vs. log shipping in this scenario: we need to setup a database backup situation wherein there is exactly one secondary server that need not automatically pick up when the primary fails. Recovering and starting with the secondary should not have to take too long though.

Source: (StackOverflow)

Using SignalR with Redis messagebus failover using BookSleeve's ConnectionUtils.Connect()

I am trying to create a Redis message bus failover scenario with a SignalR app.

At first, we tried a simple hardware load-balancer failover, that simply monitored two Redis servers. The SignalR application pointed to the singular HLB endpoint. I then failed one server, but was unable to successfully get any messages through on the second Redis server without recycling the SignalR app pool. Presumably this is because it needs to issue the setup commands to the new Redis message bus.

As of SignalR RC1, Microsoft.AspNet.SignalR.Redis.RedisMessageBus uses Booksleeve's RedisConnection() to connect to a single Redis for pub/sub.

I created a new class, RedisMessageBusCluster() that uses Booksleeve's ConnectionUtils.Connect() to connect to one in a cluster of Redis servers.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using BookSleeve;
using Microsoft.AspNet.SignalR.Infrastructure;

namespace Microsoft.AspNet.SignalR.Redis
{
    /// <summary>
    /// WIP:  Getting scaleout for Redis working
    /// </summary>
    public class RedisMessageBusCluster : ScaleoutMessageBus
    {
        private readonly int _db;
        private readonly string[] _keys;
        private RedisConnection _connection;
        private RedisSubscriberConnection _channel;
        private Task _connectTask;

        private readonly TaskQueue _publishQueue = new TaskQueue();

        public RedisMessageBusCluster(string serverList, int db, IEnumerable<string> keys, IDependencyResolver resolver)
            : base(resolver)
        {
            _db = db;
            _keys = keys.ToArray();

            // uses a list of connections
            _connection = ConnectionUtils.Connect(serverList);

            //_connection = new RedisConnection(host: server, port: port, password: password);

            _connection.Closed += OnConnectionClosed;
            _connection.Error += OnConnectionError;


            // Start the connection - TODO:  can remove this Open as the connection is already opened, but there's the _connectTask is used later on
            _connectTask = _connection.Open().Then(() =>
            {
                // Create a subscription channel in redis
                _channel = _connection.GetOpenSubscriberChannel();

                // Subscribe to the registered connections
                _channel.Subscribe(_keys, OnMessage);

                // Dirty hack but it seems like subscribe returns before the actual
                // subscription is properly setup in some cases
                while (_channel.SubscriptionCount == 0)
                {
                    Thread.Sleep(500);
                }
            });
        }


        protected override Task Send(Message[] messages)
        {
            return _connectTask.Then(msgs =>
            {
                var taskCompletionSource = new TaskCompletionSource<object>();

                // Group messages by source (connection id)
                var messagesBySource = msgs.GroupBy(m => m.Source);

                SendImpl(messagesBySource.GetEnumerator(), taskCompletionSource);

                return taskCompletionSource.Task;
            },
            messages);
        }

        private void SendImpl(IEnumerator<IGrouping<string, Message>> enumerator, TaskCompletionSource<object> taskCompletionSource)
        {
            if (!enumerator.MoveNext())
            {
                taskCompletionSource.TrySetResult(null);
            }
            else
            {
                IGrouping<string, Message> group = enumerator.Current;

                // Get the channel index we're going to use for this message
                int index = Math.Abs(group.Key.GetHashCode()) % _keys.Length;

                string key = _keys[index];

                // Increment the channel number
                _connection.Strings.Increment(_db, key)
                                   .Then((id, k) =>
                                   {
                                       var message = new RedisMessage(id, group.ToArray());

                                       return _connection.Publish(k, message.GetBytes());
                                   }, key)
                                   .Then((enumer, tcs) => SendImpl(enumer, tcs), enumerator, taskCompletionSource)
                                   .ContinueWithNotComplete(taskCompletionSource);
            }
        }

        private void OnConnectionClosed(object sender, EventArgs e)
        {
            // Should we auto reconnect?
            if (true)
            {
                ;
            }
        }

        private void OnConnectionError(object sender, BookSleeve.ErrorEventArgs e)
        {
            // How do we bubble errors?
            if (true)
            {
                ;
            }
        }

        private void OnMessage(string key, byte[] data)
        {
            // The key is the stream id (channel)
            var message = RedisMessage.Deserialize(data);

            _publishQueue.Enqueue(() => OnReceived(key, (ulong)message.Id, message.Messages));
        }

        protected override void Dispose(bool disposing)
        {
            if (disposing)
            {
                if (_channel != null)
                {
                    _channel.Unsubscribe(_keys);
                    _channel.Close(abort: true);
                }

                if (_connection != null)
                {
                    _connection.Close(abort: true);
                }                
            }

            base.Dispose(disposing);
        }
    }
}

Booksleeve has its own mechanism for determining a master, and will automatically fail over to another server, and am now testing this with SignalR.Chat.

In web.config, I set the list of available servers:

<add key="redis.serverList" value="dbcache1.local:6379,dbcache2.local:6379"/>

Then in Application_Start():

        // Redis cluster server list
        string redisServerlist = ConfigurationManager.AppSettings["redis.serverList"];

        List<string> eventKeys = new List<string>();
        eventKeys.Add("SignalR.Redis.FailoverTest");
        GlobalHost.DependencyResolver.UseRedisCluster(redisServerlist, eventKeys);

I added two additional methods to Microsoft.AspNet.SignalR.Redis.DependencyResolverExtensions:

public static IDependencyResolver UseRedisCluster(this IDependencyResolver resolver, string serverList, IEnumerable<string> eventKeys)
{
    return UseRedisCluster(resolver, serverList, db: 0, eventKeys: eventKeys);
}

public static IDependencyResolver UseRedisCluster(this IDependencyResolver resolver, string serverList, int db, IEnumerable<string> eventKeys)
{
    var bus = new Lazy<RedisMessageBusCluster>(() => new RedisMessageBusCluster(serverList, db, eventKeys, resolver));
    resolver.Register(typeof(IMessageBus), () => bus.Value);

    return resolver;
}

Now the problem is that when I have several breakpoints enabled, until after a user name has been added, then disable all breakpoints, the application works as expected. However, with the breakpoints disabled from the beginning, there seems to be some race condition that may be failing during the connection process.

Thus, in RedisMessageCluster():

    // Start the connection
    _connectTask = _connection.Open().Then(() =>
    {
        // Create a subscription channel in redis
        _channel = _connection.GetOpenSubscriberChannel();

        // Subscribe to the registered connections
        _channel.Subscribe(_keys, OnMessage);

        // Dirty hack but it seems like subscribe returns before the actual
        // subscription is properly setup in some cases
        while (_channel.SubscriptionCount == 0)
        {
            Thread.Sleep(500);
        }
    });

I tried adding both a Task.Wait, and even an additional Sleep() (not shown above) - which were waiting/etc, but still getting errors.

The recurring error seems to be in Booksleeve.MessageQueue.cs ~ln 71:

A first chance exception of type 'System.InvalidOperationException' occurred in BookSleeve.dll
iisexpress.exe Error: 0 : SignalR exception thrown by Task: System.AggregateException: One or more errors occurred. ---> System.InvalidOperationException: The queue is closed
   at BookSleeve.MessageQueue.Enqueue(RedisMessage item, Boolean highPri) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\MessageQueue.cs:line 71
   at BookSleeve.RedisConnectionBase.EnqueueMessage(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 910
   at BookSleeve.RedisConnectionBase.ExecuteInt64(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 826
   at BookSleeve.RedisConnection.IncrementImpl(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 277
   at BookSleeve.RedisConnection.BookSleeve.IStringCommands.Increment(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 270
   at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.SendImpl(IEnumerator`1 enumerator, TaskCompletionSource`1 taskCompletionSource) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 90
   at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.<Send>b__2(Message[] msgs) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 67
   at Microsoft.AspNet.SignalR.TaskAsyncHelper.GenericDelegates`4.<>c__DisplayClass57.<ThenWithArgs>b__56() in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 893
   at Microsoft.AspNet.SignalR.TaskAsyncHelper.TaskRunners`2.<>c__DisplayClass42.<RunTask>b__41(Task t) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 821
   --- End of inner exception stack trace ---
---> (Inner Exception #0) System.InvalidOperationException: The queue is closed
   at BookSleeve.MessageQueue.Enqueue(RedisMessage item, Boolean highPri) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\MessageQueue.cs:line 71
   at BookSleeve.RedisConnectionBase.EnqueueMessage(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 910
   at BookSleeve.RedisConnectionBase.ExecuteInt64(RedisMessage message, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\RedisConnectionBase.cs:line 826
   at BookSleeve.RedisConnection.IncrementImpl(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 277
   at BookSleeve.RedisConnection.BookSleeve.IStringCommands.Increment(Int32 db, String key, Int64 value, Boolean queueJump) in c:\Projects\Frameworks\BookSleeve-1.2.0.5\BookSleeve\IStringCommands.cs:line 270
   at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.SendImpl(IEnumerator`1 enumerator, TaskCompletionSource`1 taskCompletionSource) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 90
   at Microsoft.AspNet.SignalR.Redis.RedisMessageBusCluster.<Send>b__2(Message[] msgs) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Redis\RedisMessageBusCluster.cs:line 67
   at Microsoft.AspNet.SignalR.TaskAsyncHelper.GenericDelegates`4.<>c__DisplayClass57.<ThenWithArgs>b__56() in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 893
   at Microsoft.AspNet.SignalR.TaskAsyncHelper.TaskRunners`2.<>c__DisplayClass42.<RunTask>b__41(Task t) in c:\Projects\Frameworks\SignalR\SignalR.1.0RC1\SignalR\src\Microsoft.AspNet.SignalR.Core\TaskAsyncHelper.cs:line 821<---



public void Enqueue(RedisMessage item, bool highPri)
{
    lock (stdPriority)
    {
        if (closed)
        {
            throw new InvalidOperationException("The queue is closed");
        }

Where a closed queue exception is being thrown.

I foresee another issue: Since the Redis connection is made in Application_Start() there may be some issues in "reconnecting" to another server. However, I think this is valid when using the singular RedisConnection(), where there is only one connection to choose from. However, with the intorduction of ConnectionUtils.Connect() I'd like to hear from @dfowler or the other SignalR guys in how this scenario is handled in SignalR.

Source: (StackOverflow)

Advertisements

What is a database transaction?

Can someone provide a really simple (but not simpler than possible) explanation of a transaction as applied to computing (even if copied from Wikipedia)?

Source: (StackOverflow)

Determine Active Node in SQL Failover Cluster

Does anyone know how to determine the active node of a SQL Active-Passive Failover Cluster programmatically from T-SQL?

@@SERVERNAME only returns the virtual server name, which is identical from both nodes.

I don't plan to make any decisions based on the data - I trust the failover to do its thing - but I would like to include the information in an event log so I can tell which node in the cluster was active when the event occurred, or help determine if exceptions come up as a result of a failover.

Source: (StackOverflow)

Failover & Disaster Recovery [closed]

What's the difference between failover and disaster recovery?

Source: (StackOverflow)

Apache proxy load balancing backend server failure detection

Here's my scenario (designed by my predecessor):

Two Apache servers serving reverse proxy duty for a number of mixed backend web servers (Apache, IIS, Tomcat, etc.). There are some sites for which we have multiple backend web servers, and in those cases, we do something like:

<Proxy balancer://www.example.com>
    BalancerMember http://192.168.1.40:80
    BalancerMember http://192.168.1.41:80
</Proxy>
<VirtualHost *:80>
    ServerName www.example.com:80
    CustomLog /var/log/apache2/www.example.com.log combined
    <Location />
        Order allow,deny
        Allow from all
        ProxyPass balancer://www.example.com/
        ProxyPassReverse balancer://www.example.com/
    </Location>
</VirtualHost>

So in this example, I've got one site (www.example.com) in the proxy servers' configs, and that site is proxied to one or the other of the two backend servers, 192.168.1.40 and .41.

I'm evaluating this to make sure that we are fault tolerant on all of our web services (I've already put the two reverse proxy servers into a shared IP cluster for this reason), and I want to make sure that the load-balanced backend servers are fault tolerant as well. But I'm having trouble figuring out if backend failure detection (and the logic to avoid the failed backend server) is built into the mod_proxy_balancer module...

So if 192.168.202.40 goes down, will Apache detect this (I'll understand if it takes a failed request first) and automatically route all requests to the other backend, 192.168.202.41? Or will it continue to balance requests between the failed backend and the operational backend?

I've found some clues in the Apache documentation for mod_proxy and mod_proxy_balancer that seem to indicate that failure can be detected ("maxattempts = Maximum number of failover attempts before giving up.", "failonstatus = A single or comma-separated list of HTTP status codes. If set this will force the worker into error state when the backend returns any status code in the list."), but after a few days of searching, I've found nothing conclusive saying for sure that it will (or at least "should") detect backend failure and recovery.

I will say that most of the search results reference using the AJP protocol to pass the traffic to the backend servers, and this apparently does support failure detection-- but my backends are a mixture of Apache, IIS, Tomcat and others, and I am fairly sure that many of them don't support AJP. They are also a mixture of Windows 2k3/2k8 and Linux (mostly Ubuntu Lucid) boxes running various different applications with various different requirements, so add-on modules like Backhand and LVS aren't an option for me.

I've also tried to empirically test this feature, by creating a new test site like this:

<Proxy balancer://test.example.com>
    BalancerMember http://192.168.1.40:80
    BalancerMember http://192.168.1.200:80
</Proxy>
<VirtualHost *:80>
    ServerName test.example.com:80
    CustomLog /var/log/apache2/test.example.com.log combined
    LogLevel debug
    <Location />
        Order allow,deny
        Allow from all
        ProxyPass balancer://test.example.com/
        ProxyPassReverse balancer://test.example.com/
    </Location>
</VirtualHost>

Where 192.168.1.200 is a bogus address that isn't running any web server, to simulate a backend failure. The test site was served up without a problem for a bunch of different client machines, but even with the LogLevel set to debug, I didn't see anything logged to indicate that it detected that one of the backend servers was down... And I'd like to make 100% sure that I can take our load-balanced backends down for maintenance (one at a time, of course) without affecting production sites.

Source: (StackOverflow)

Nodejs failover

I am a beginner in nodejs. I am trying to use nodejs in production. I wanted to achieve nodejs failover. As I am executing chat app, if a node server fails the chat should not breakup and should be automatically connected to a different nodeserver and the same socket id should be used for further chatting so that the chat message shouldn't go off. Is this can be achieved? Any samples.

I should not use Ngnix/HAProxy. Also let me know how the node servers should be: Either Active-Active or Active-Passive

Source: (StackOverflow)

Failover with mongodb

I have to setup a database that can handle failover (if one crashes the other takes over). For that, I decided to use mongodb: I set up a replica set with two instances. Each instance is running on a separate VM. I have several questions:

It is recommended to use at least 3 instances in a replica set. Is it ok to use only two ?
I have two instances, and then two IP addresses. Which IP should I give to my application that will need to read/write in the database ? When a database is down, how the request will be redirected to the instance that is still up ?

Some help to get started would be great !

Source: (StackOverflow)

Configure GlassFish JDBC connection pool to handle Amazon RDS Multi-AZ failover

I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS. I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.

My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.

The connection pool is configured like this:

asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool

If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.

If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:

Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 940,715 milliseconds ago.  The last packet sent successfully to the server was 935,598 milliseconds ago.

It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.

Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.

Things I have tried without success:

connection leak timeout/reclaim
statement leak timeout/reclaim
statement timeout
using a different validation method
using MysqlDataSource instead of MysqlConnectionPoolDataSource

How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced? Or how can I let GlassFish detect a database failover?

Source: (StackOverflow)

httpd mod_proxy_balancer failover failonstatus - transparent switching

I am using mod_proxy_balancer to manage failover of backend servers. Backend servers may return an error code instead of timing out when some other backend service fails such as NFS and we want such servers also to be marked as failed nodes. Hence we are using failonstatus directive.

<Proxy balancer://avatar>
    ProxySet failonstatus=503 
    BalancerMember http://active/ retry=30
    # the hot standby
    BalancerMember http://standby/ status=+H retry=0
</Proxy>

Currently the failover works perfectly with one glitch. When active node fails the user gets a 503 error and from the next request the Standby server takes over.

I dont want even a single request to fail though. Cant mod_proxy failover with out ever returning an error to the client? If active node fails I want mod_proxy to try the Standby for the same request and not just from the subsequent request!

Source: (StackOverflow)

FOSS ASP.Net Session Replication Solution?

I've been searching (with little success) for a free/opensource session clustering and replication solution for asp.net. I've run across the usual suspects (indexus sharedcache, memcached), however, each has some limitations.

Indexus - Very immature, stubbed session interface implementation. Its otherwise a great caching solution, though.
Memcached - Little replication/failover support without going to a db backend. Several SF.Net projects - All aborted in the early stages... nothing that appears to have any traction, and one which seems to have gone all commercial.
Microsoft Velocity - Not OSS, but seems nice. Unfortunately, I didn't see where CTP1 supported failover, and there is no clear roadmap for this one. I fear that this one could fall off into the ether like many other MS dev projects.

I am fairly used to the Java world where it is kind of taken for granted that many solutions to problems such as this will be available from the FOSS world.

Are there any suitable alternatives available on the .Net world?

Source: (StackOverflow)

How to start distributed Erlang app without starting dependencies at every node?

I tried to run a simple app in a distributed manner to test failover-takeover features but failed.

What I want to:

The application is myapp_api with a rest api, it has myapp application as a dependency. I want to start myapp_api on 3 nodes, I want the whole app (myapp_api + myapp) to be working only at one node at the same time.

What is wrong:

The main app (myapp_api) works as expected: only at one node with failover and takeover. But for some reason depended myapp always starts at every node. I want it to be working only at one node at the same time.

What I do:

My config for the first node as an example.

[
    {kernel,
    [{distributed, [{myapp_api,
        1000,
        ['n1@myhost', {'n2@myhost', 'n3@myhost'}]}]},
        {sync_nodes_optional, ['n2@myhost', 'n3@myhost']},
        {sync_nodes_timeout, 5000}
    ]}
].

I call erl -sname nI -config nI.config -pa apps/*/ebin deps/*/ebin -s myapp_api at every node.

Source: (StackOverflow)

Replicate selected postgresql tables between two servers?

What would be the best way to replicate individual DB tables from a Master postgresql server to a slave machine? It can be done with cron+rsync, or with whatever postgresql might have build in, or some sort of OSS tool, but so far the postgres docs don't seem to cover how to do table replication. I'm not able to do a full DB replication because some tables have license->IP stuff connected to it, and I can't replicate those on the slave machine. I don't need instant replication, hourly would be acceptable as well.

If I need to just rsync, can someone help identify what files within the /var/lib/pgsql directory would need to be synced, or how I would know what tables they are.

Source: (StackOverflow)

Does PHP have a built in mechanism to failover from one database server to another?

I found this:

http://www.evolt.org/failover-database-connection-with-php-mysql

and similar examples. But is there a better way?

I am thinking along the lines of the Automatic Failover Client in the MS SQL Native Client.

Source: (StackOverflow)

Is Redis a better option for SignalR scale out over SQL Server, and do each support failover?

In David Fowler's blog, SQL Server has been added to the list of scale out providers for service bus.

I am in the process of implementing Redis on our Windows servers. Based on what I know about Redis, I'm guessing it will be significantly faster than using SQL Server - is that a fair assumption?

If so, how does the Windows version of Redis implement fail-over?

Source: (StackOverflow)