EzDevInfo.com

disaster-recovery interview questions

Top disaster-recovery frequently asked interview questions

HowTo create a redumdant instance of SOLR?

I need to build out a fully redundant and resilient SOLR search platform where search data is replicated between data centers.

Does SOLR cluster provide any vehicle for this inter-datacenter replication to occur? If not in the engine, if I was to create a snapshot of the data and ship that across the wire and have a second SOLR engine on the remote side sitting at idle waiting for the primary to die, would that work?

Looking for recommendations as I can not support a production implementation of SOLR without having DR capability.


Source: (StackOverflow)

Undo SVN delete ./* --force

I didn't realize svn delete would delete my local copy, I just wanted it out of the repository. Now all my files are gone, and they aren't in the trash bin either. Is there any way I can recover them?


I should clarify, these files never made it into the repository. I was trying to get rid of some old junk in the repository so that I could check these in.

I'm running Ubuntu on an ext3 filesystem. It's okay though.... I managed to redo what I deleted in about 2 hours.


Source: (StackOverflow)

Advertisements

Restore deleted file directly from Eclipse local history

Some git mistakes happened and I lost a lot of changes for one file. I was using Eclipse as my IDE but the git mishap included deleting the project and re cloning the directory. So I can't do the restore from within Eclipse. I believe I have found the local history file that contains the code I want to restore but I'm not sure how to cat this file. It kinda looks like a json.

Anyone know how to restore or read the .metadata.plugins\org.eclipse.core.resources.history files?


Source: (StackOverflow)

T-SQL copy Logins, Users, roles, permissions etc

We have implemented log shipping as a database disaster recovery solution and want to know if there is a way I can use T-SQL to script all the logins, users, roles permissions etc to the master database on the secondary server so that the T-SQL can be sheduled to run as an SQL Job?

My aim is that in the event of a D/R situation we can simply restore the transaction logs for each database to the secondary server and not have to worry about orphaned users etc.

Thanks for you help!


Source: (StackOverflow)

CloudBees, Availability Zones and Disaster Recovery

What is the difference between what CloudBees calls a region-specific deployment and what they (and Amazon) call an availability zone?

From what I can tell, CloudBees allows you to deploy in 1 of 2 regions/zones: USA and Europe. Are those my only options (for both region-specific deployments and availability zones)?

Is it a solid Disaster Recovery plan to have a pool of idle instances on standby deployed to the Europe "zone" in the event of a total failure of their USA data center(s)? How is DR usually handled by CloudBees clients?


Source: (StackOverflow)

How do I re-create a MySQL InnoDB table from an .ibd file?

Assume that the following MySQL files have been restored from a backup tape:

  • tablename.frm
  • tablename.ibd

Furthermore, assume that the MySQL installation was running with innodb_file_per_table and that the database was cleanly shutdown with mysqladmin shutdown.

Given a fresh install of the same MySQL version that the restored MySQL files were taken from, how do I import the data from tablename.ibd/tablename.frm into this new install?


Source: (StackOverflow)

Failover & Disaster Recovery [closed]

What's the difference between failover and disaster recovery?


Source: (StackOverflow)

Cannot access workspace of the user in new server when trying to shift the TFS server to another PC

We are using TFS 2012. All the PC are connected by using Workgroup.(we are not using domain). The TFS PC is having PC name as PCTFS . We have about 15 developers who use TFS. As we use workgroup so we have created all windows users of 15 developers in the TFS server pc. So now TFS server PC has windows users like PCTFS\user1 , PCTFS\user2 ,PCTFS\user3 so on.

All these users connect to TFS server PC from their PC. As they have same user created in TFS server PC the authentication works by matching windows user name. So developer can connect to TFS server and work on it. Each user has its own permissions like create workspace , shelveset, chekin permission etc.

When trying to shift the TFS server to another PC. So PC name having TFS installed will change from PCTFS to PCNEWTFS . So when After restoring the sql databases of all collection.

So , First we remapped using : TFSConfig RemapDBs /DatabaseName:PCNEWTFS\SQLEXPRESS;TFS_Configuration /SQLInstances:PCNEWTFS\SQLEXPRESS

Then we reset the owner using :

TFSConfig Accounts /ResetOwner /SQLInstance: PCNEWTFS \SQLEXPRESS /DatabaseName:TFS_Configuration

Then we added the account using :

TfsConfig Accounts /add /AccountType:ApplicationTier /account: PCNEWTFS \user1 
/SQLInstance: PCNEWTFS \SQLEXPRESS /DatabaseName:Tfs_Configuration

At last we changed the identities using :

TFSConfig Identities /change /fromdomain: PCTFS /todomain: PCNEWTFS

All the operation were successful and all the newely created users are accessible after changing identities.

Like ,

   Account name                          Exists                  Matches 

   PCNEWTFS\user1                        True                    False 

Please help me in the following questions

We referred the following article .

When trying to access workspace of the user in new server all the workspaces are referring to old server they are not shifted to new server. How to make them refer new server ?

Edit: We have shifted tfs server from PC110 to PC78 by using the steps mentioned above. In the screenshoot below you can see that all the workspaces are still referring to old server (PC110) instead of PC78 (new server)

What are the necessary steps to be followed to make the workspace to refer new server during disaster recovery ?(We can manually transfer our workspace of each user of using command or altering the database tables but it is not the best method when there is many no of users)

Are we missing any steps in disaster recovery plan which makes workspaces still refer old server?

enter image description here


Source: (StackOverflow)

mysql can't drop corrupt innodb table

So I'm dealing with a MySQL server which ran out of disk space and has mostly InnoDB tables which of course got corrupted. I'm trying to drop and recreate the damaged tables but MySQL won't let me do anything with them, including repair. As you can see, this is no end of fun. It should be noted that only this one table seems to cause any of these errors.

mysql> drop table myschema.mytable;
ERROR 2013 (HY000): Lost connection to MySQL server during query

mysql> repair table myschema.mytable;
#results in the following
| myschema.mytable | repair | Error    | Out of memory; restart server and try again (needed 2 bytes)               |
| myschema.mytable | repair | Error    | Incorrect information in file: './myschema/mytable.frm' |
| myschema.mytable | repair | error    | Corrupt 

mysql> describe myschema.mytable; 
ERROR 1037 (HY001): Out of memory; restart server and try again (needed 2 bytes) 

If I stop the server and move the table's .frm and ibd files out of the way, then restart, I can't recreate it, because the server says it already exists (even though it can't be seen in INFORMATION_SCHEMA). In this state I can't drop it either because the server says it isn't there.

I've looked high and low for answers but at this point, but I'm no DBA so I'm lost. I can't figure out how to fix this table, and I can't figure out how to get rid of it.

Any suggestions?


Source: (StackOverflow)

Microsoft Azure Backups not reducing available recovery points or destination usage after retention period has been reduced

Microsoft Azure Backups not reducing available recovery points or destination usage after retention period has been reduced.

I had the retention period set to 30 day with around 6.8TB of backups.Over a week ago I changed the retention period down to 7 days, and it took a couple of days for the total recovery points to go down to 7. The destination usage was still going up.

I have come in today (monday) and the total backups is now 10 and the destination usage is now 7.57TB.

I only have the one server backing up to Azure Backups Services.

My questions are;
1. how long does it take for azure to delete backups from outside the retention period*
2. why are there move recovery points that the retention period?
3. Is there a way to purge just the backups outside the retention period?

*It does say the backup agent, "space allocation data is updated on a daily basis".

It would be a great help if anyone had some information on this as the storage costs are getting very high.

Thanks for the help in advance

Z


Source: (StackOverflow)

chain of events analysis and reasoning

My boss said logs in current state are not acceptable for the customer. If there is a fault, a dozen of different modules of the device report their own errors and they all land in logs. The original reason of the fault may be buried somewhere in the middle of the list, may not appear on the list (given module being too damaged to report), or appear way late after everything else finished reporting problems that result from the original fault. Anyway, there are few people outside the system developers who can properly interprete the logs and come up with what actually happened.

My current task is writing a module that does a customer-friendly fault-reporting. That is, gather all the events that were reported over the last ~3 seconds (which is about the max interval between origin of the fault occurring and the last resulting after-effects), do some magic processing of this data, and come up with one clear, friendly line what is broken and needs to be fixed.

The problem is the magic part: how, given a number of fault reports, to come up with the original source of the fault. There is no simple list of cause-effect list. There are just commonly occurring chains of events displaying certain regularities.

Examples:

  • short circuit detected, resulting in limited operation mode, the limited operation does not remove the fault, so emergency state is escalated, total output power disconnected.
  • safety line got engaged. No module reported engaging it within 3s since it was engaged, so an "unknown-source or interference" is attributed as the reason of system halt.
  • most output modules report no output voltage. About 1s later the power supply monitoring module reports power is out, which is the original reason.
  • an output module reports no output voltage in all of its output lines. No report from power supply module. The reason is a power line disconnected from the module.
  • an output module reports no output voltage in one of its output lines. No other faults reported. The reason is a burnt fuse.
  • an output module did not report back with applying received state. Shortly after, control module reports illegal state or output lines, (resulting from the output module really not updating the state in a timely manner.) The cause is the output module (which introduced the fault), not the control module (which halted the system due to fault detected).
  • a fault of input module switches the device to backup-failsafe mode. An output module not used so far, which was faulty gets engaged in this mode and the fault mode gets escalated to critical. The original reason is not the input, which is allowed to report false-positives concerning faults, but the broken backup output which aborted the operation.
  • there is no activity of any kind from an output module, for the last 2 seconds. This means it's broken and a fault mode must be entered.

There is no comprehensive list of rules as to what causes what. The rules will be added as new kinds of faults occur "in the wild" and are diagnosed and fixed. Some of them are heuristics - if this error is accompanied with these errors, then the fault is most likely this. Some faults will not be solved - a bland list of module reports will have to suffice. Some answers will be ambigous, one set of symptoms may suggest two different faults. This is more of a "best effort" than a "guaranteed solution" one.

Now for the (overly general and vague) question: how to solve this? Are there specific algorithms, methods or generalized solutions to this kind of problem? How to write the generalized rulesets and match against them? How to do the soft-matching? (say, an input module broke right in the middle of an emergency halt, it's a completely unrelated event to be ignored.) Help please?


Source: (StackOverflow)

DRP for postgres-xl

After installing and setting up a 2 node cluster of postgres-xl 9.2, where coordinator and GTM are running on node1 and the Datanode is set up on node2.

Now before I use it in production I have to deliver a DRP solution. Does anyone have a DR plan for postgres-xl 9.2 architechture?

Best Regards, Aviel B.


Source: (StackOverflow)

AWS Account-to-Account Backup Solution for EC2 and RDS?

Desired Result: After hearing many horror stories of malicious users gaining access to AWS accounts and wiping out resources, I'm interested in creating a system that can copy RDS Snapshots and EC2 AMIs/Volumes to a completely separate AWS account for use as a 'time-capsule' or 'ice-cold-recovery' site.

Security Basics: I use IAM with MFA for all existing accounts, and I restrict who-can-do-what based on need-to-access. Most users have read-only access to everything, and a select few are power users. We never use the root account.

Initial discoveries: Since there isn't a native way to copy AMIs or Snapshots to another account, my current understanding is that I would need to use our current account to allow the 'vault' account access to the AMIs/Snapshots, then use the vault account to launch an instance/DB from the AMI/SS, then make another AMI/SS of the instance/DB in order to make a complete copy in another account.

Questions:

  1. Is this stupid?
  2. Is there a better way?
  3. Is anyone aware of a service or scripting solution that could accomplish this is a simple manner?

I'm sure with enough time I could use the SDK and make something that does this, but I'm very open to NOT coding it myself.


Source: (StackOverflow)

Does anyone has the experience of using the new p4 replicate command in their Perforce back-up /restore script?

we recently performed an upgrade of our whole perforce system to 2009.02

During this exercise, we noticed that the back-up /restore process that was installed here by the Perforce consultant a year ago was not completely working. Basically, the verify command has never worked (scary !).

As we are obliged to revisit our Back-Up/Restore scripts, I was toying with the idea of using the new p4 replicate command. The idea is to use it alongside an rsync of the data files, so that in case of crash we will lose at worst an hour of work (if we execute them every hour).

Does anyone has the experience or an example of back-up/restore scripts using the p4 replicate command of the 2009.02 version ?

Thanks,

Thomas


Source: (StackOverflow)

Downloading very large number of files from S3

I would like to setup a Disaster Recovery copy for an s3 bucket with ~2 million files.

This does not have to be automated since we trust Amazon's promise for high reliability, we have enabled versioning and setup MFA for deleting the bucket itself.

So I just want to periodically download (manually) the contents of the bucket to keep an offline copy.

I have tried a few S3 clients but most of them hang when dealing with such a large folder.

Is there any tool that is right for the job or do we have to resort to Amazon's data export service (where we have to send them a usb drive everytime we need an offline backup).

thanks in advance for your advice!


Source: (StackOverflow)