Disaster recovery - the complete guide

Disaster recovery - the complete guide


Part-1

Disaster recovery is one of the most important topics in today’s world. It is like insurance for a company’s data, infrastructure, etc. Though in normal day-to-day work, disaster recovery seems unimportant, when any disaster occurs, the data and infrastructure are fully or partially destroyed, and the reputation and business are at high risk. Then, we can realize the importance of it.
So, hello, everyone. This is Utpal. In this article, I discuss disaster recovery in on-premise and cloud architecture.
As this is a huge topic, so I have written this in a series of articles where it’s importance, case study, DR steps, various types, DR demo, third party DR service are given.

Utpal Bhattacharjee

For more questions, connect with me on LinkedIn.

What is disaster recovery (DR)? In simple words, it is the process of recovery and resuming normal operation after the occurrence of a disaster event. With DR organizations regain their access to and functionality of data and infrastructure. There are various types of disasters-

  1. Natural disasters- like an earthquake, flood, tornadoes, typhoons or bad weather etc. that fully or partially damaged IT infrastructure.
  2. Technical disasters- like a power outage, short-circuit internet outage, various networking loops, misconfiguration, hardware or software failure, misbehaviour or destruction, vendor problem etc.
  3. Man-made disasters- there are two types unintentional and intentional.
    1. Unintentional means by mistake any IT admin may delete any critical data. Or due to less knowledge and experience, an admin misconfigured the hardware or software. Or due to negligence system logs, alarms, security warnings etc are overlooked and then disaster occurs.
    2. Intentional means intentionally an internal or external person do harmful actions. An unsatisfied employee can harm the IT infrastructure. Various types of hacking, cyber-attack, social engineering can compromise the data and infrastructure. In recent time, ransomware, DDoS, BGP hijacking etc comes into the news. I will discuss some of them later.

So, it is important to backup data in a remote location or cloud.

Though maximum organizations don’t care about backups because this is expensive (initial and then maintenance cost), separate workload needed, loosely defined RPO & RTO, they may affect performance, and some older databases don’t support backup technology.
In 2019 International Data Corporation (IDC) estimated that more than half of applications are not covered by any DR plan. So, if disasters strike then upto half of all organizations will face difficulty to survive.
If we look into the past, we can see that many disaster events harmed organizations a lot, even destroyed some of them. I have discussed them later in the article in detail. A few of them are-

  • In March 2018 Ransomware SamSam attacked Atlanta city, the capital of Georgia.
  • In 2013 lightning struck an office building in South Carolina and then the network and hardware infrastructure of Cantey Technology was burned.
  • In November 2016 virus-infected Northern Lincolnshire and Goole NHS Foundation Trust in the UK. The virus crippled their systems and halted operations at three separate hospitals for five days.
  • In August 2017 many small businesses were destroyed by Harvey (Hurricane) in Southeast Texas.
  • On 23rd July 2020 massive ransomware attack happens on the sport and fitness technology brand Garmin’s corporate network which costs 10 million dollars.
  • In June 2020 a three hours long global network outage happened on the IBM cloud due to incorrect routing of a third party. This was a BGP Hijacking.
  • In January 2020 a 7.5 hours-long outage happened on the VPC subsystem of the Sydney region (ap-southeast-2) of AWS.

When disaster strikes, then organizations don’t care about the costs of getting the data back as early as possible. They want to go back to the last working day’s snapshot.

There are several steps of disaster recovery-

Risk assessment- Firstly an organization has to identify the threat and dangers which can mostly harm their business by considering their whole IT infrastructure.
The risk equation is- Risk=Threat frequency x Vulnerability x Asset (Threat frequency means how often a disaster event can occur.)
The risk assessment can be done with the following steps-

  1. Identify the assets and prioritize them as per their importance, and value. Assets can be data, software, hardware, users, personnel, IT security policy and architecture, network topology, information flow etc. They can be classified as critical, major, or minor.
  2. The organization should identify what are the possible threats (I have already discussed about them).
  3. Now the organization should identify what are the weaknesses that can let a threat harm them, these are called vulnerabilities. To do this, they should conduct analysis, audit, and review the NIST vulnerability database, penetration test, information security test etc.
  4. The organization should determine the likelihood of a disaster event.
  5. The organization should assess the impact of disaster events on its assets.
  6. The organization should determine the level of risk to their system for each disaster event. The risk-level matrix can be used.
  7. Finally, they should develop a risk assessment report which will help to make proper decisions.

Business impact analysis- To know the limitations of business operations, and understand the cost associated with downtime, a business impact analysis has to be performed. Organizations should find the most critical data. Here two recovery objectives have to be determined- RTO & RPO.

  • RTO or Recovery Time Objective is the maximum acceptable amount of time that an application can stay offline without affecting business operations. Within this time, backup resources are gathered, so by determining the RTO an organization can determine the investment in backup resources in the DR plan.
  • RPO or Recovery Point Objective is the maximum acceptable amount of time during which the application can bear data loss. If an application needs frequently modified data then RPO is less and if an application can work with data that is modified infrequently, then RPO is higher.

If the values of RTO & RPO are small, the cost to run the application is higher and vice versa in case of disaster. Depending on the RTO & RPO, the disaster recovery plan has to be made.
Also, the organizations should determine the most critical data that needs a higher priority.

Create a backup site based on inventory, RTO & RPO, and DR approach- For a non-cloud DR strategy, a backup or redundant site has to be created in another geographical location based on the DR plan with security compliance. If the primary data center is affected by any disaster then the backup site will take place i.e. fail-over happens. The backup site is used only for emergency purposes and when the primary data center is restored the organization should fail back to it. Thus, resiliency is improved.
The investment on the backup sited will be based on the values RTO, RPO and DR strategy. For example, the pilot lite approach needs less investment than the warm or hot standby approach.
For this, the organization should conduct a complete inventory of hardware, software and other accessories. This information states how much equipment is required to restore the entire infrastructure to its original state and how much equipment is required to support the mission-critical workloads.

Backup to the cloud- This is a cheaper alternative backup option of creating a backup site i.e. the traditional backup. “The key to an effective recovery plan is to rely on the public cloud, not your own datacenter or hardware.” -by Poojan Kumar, CEO and Co-Founder of Clumio.

Cloud backup has many advantages over traditional backup-
In the traditional backup, the organization has to create and maintain the buildings, servers, storage etc. Has to maintain electricity, internet. Has to employ skilled manpower for support. Has to use third-party data bunker service. So, there is undifferentiated heavy lifting.
In cloud backup, the cloud integrates with existing backup frameworks. Also, the cloud service providers have a pay-as-you-go pricing model with less or no upfront charges. There is no need to tape backups, instead low-cost, highly scalable virtual storage is available with nominal disruption. Also, there are options for insights using AI.

Disaster response document- When a disaster event happens, then recovery should be conducted as soon as possible within minimum time. So, proper documentation is required which describes the detailed information of the recovery and restore process with standard guidelines and flowcharts. This also includes the guidelines for different types of disasters. The document should be written in easy, clear language that employees can understand.

DR team management and third-party DR service- Based on the plan, an organization can make a special team that will implement it. The members of the team should be properly trained with the disaster response document, and assigned responsibilities.
If the organization takes DRaaS from a third party then they should consult with the third party as per their own plan and clear all the doubts and questions. Some of them are-

  • International Business Machine has been recognized as a Leader in the Forrester WaveTM: DRaaS Providers, Q2 2019. Their core DRaaS offerings include IBM Resiliency Orchestration and IBM Resiliency Disaster Recovery as a Service. To know more visit their website.
  • Acronis (Singapore based MNC) is one of the best DRaaS provider who has won 100+ awards. Their centralized cyber protection for remote workforces is really very good. To know more visit their website.
  • Veeam (HQ in Switzerland) is another leading DRaaS provider that also has won many awards. Veeam Cloud Connect provides fully integrated, affordable and efficient DR through a service provider. To know more visit their website.
  • Axcient is another great DRaaS provider whose x360Recover is a cost-effective BCDR solution for Managed Service Providers (MSPs). To know more visit their website.
  • Druva is a California based software company that is another great DRaaS provider and has acquired CloudRanger. Their SaaS platform delivers all-in-one cloud backup and disaster recovery to deploy data protection across all workloads and data centres. To know more visit their website.

There are many more. I have discussed details in the next articles in the series.
For cloud-based disaster recovery, the service providers (AWS, Azure, GCP) have their own approaches. Organizations should determine which one they should take.
Also, while assessing the cloud provider, organizations should consider the reliability, speed of recovery, usability, simplicity, scalability, and security.
I have discussed them later in this article.

Store critical documents in a remote location- This is another important task of disaster recovery. To avoid unexpected loss of the most critical data and infrastructure documents, they should be stored in the backup site or any other reliable trusted remote location. If this task is avoided and anyhow the critical data gets lost, then it is very hard to recover them or even impossible. This may affect the business operations and damage the organizational reputation.
The infrastructure documents contain the information of the original systems and the components. They are very important to recover and restore the IT infrastructure after a disaster event. If they got lost, then it takes a huge time to recover and thus service downtime increases.

Test and update the DR plan- The DR plan should be tested multiple times to find if any drawback or loophole exists or not. With the test experience, the drawbacks are removed and the DR plan is improved.

Disaster recovery is a subset of business continuity. Business Continuity Plan (BCP) is a long-term plan that ensures the continuity of business operation after a disaster event. It is composed of Operational response, IT disaster recovery and Emergency management.
BCP and DRP are not the same. If the office or datacenter of any organization got affected by a disaster, then BCP will suggest that all employees work remotely and the DRP will focus on how to get all employees back into a single office and how to replace all of the equipments that have been affected.

Disaster Recovery Architecture in Microservices-

Traditionally, a large set of critical processes rely on a single monolithic application. Thus a one-size-fits-all approach is used in disaster recovery.
Single database systems and VMs are backed up using snapshots on a schedule and replicated to another region in a quite simple way. In case of disaster, failover is initiated to the second datacentre which may take a full day. Here the same procedure is used for all applications.
But the important data don't need to be centralized into a fixed database, there may be some persistent data stored elsewhere. Thus problem occurs.

In the microservice architecture, a large system is broken down into many independent microservices. Each of them manages their own persistent data in different databases or storage locations to reduce dependencies; thus a distributed system is formed.

BAC theorem for microservices states that 'When backing up an entire microservices architecture, it is not possible to have both availability and consistency.'
The traditional backup and recovery approach cannot cover the need for this distributed system. Instead, a set of DR capabilities is needed to be built and maintained for each microservice. Also, continuous resiliency tests should be conducted in this architecture. A highly visible risk management practice is important to find resilience risks and assess their impact on business.
To avoid cascading failure of services in a highly distributed system (i.e. to save a service from failure when other services fail) we should find out which service depends on which services, and what to do when those services are not available. To do this, we should keep a service dependency map between processes and services and regularly update it. Thus resiliency can be improved.
In a distributed architecture all services don't need the same level of resiliency. So we should document the resiliency capabilities and test this continuously.
A practical example of DR in distributed architecture will be given in the next articles of the series.

Now let's look at the various DR approaches of top cloud service providers- AWS, Azure, and GCP.

AMAZON WEB SERVICES

Amazon Web Services has several backup options. There are 24 regions each having multiple AZs. AWS has several backup and recovery partners- CloudEndure, Druva CloudRanger, and N2WS are some of them.

Also, AWS has four types of storage partners-

  • Primary storage- Avere, NetApp, Panzura etc.
  • Backup and Recovery- IBM, DellEMC, Commvault, Cloudberry, Veritas etc.
  • Archive- Sonian, NetApp, Commvault etc.
  • BCDR- Zerto, CloudEndure, Druva, NetApp etc.

AWS has four disaster recovery strategies-

  1. Backup and Restore
  2. Pilot Light for Simple Recovery into AWS
  3. Warm Standby Solution
  4. Hot Standby with Multi-site Solution

The cheapest and simplest option is backup and recovery. It requires maximum recovery time among the four strategies. Here data from on-premise infrastructure is backed up to the Amazon S3 bucket (which gives 11 9’s durability) via AWS Direct Connect or AWS VPN or AWS Storage Gateway.

In Pilot Light, more critical elements of on-premises infrastructure are run on the AWS cloud. On the cloud, systems are in standby or powered off state and data is replicated to the cloud. They needed to be preconfigured and updated as the primary data centre.
When a disaster occurs, then on the cloud, we need to initiate the application servers, and web servers, provision the Load-Balancer, mount data volume, and reconfigure Route53 to direct traffic to the AWS cloud as a failover site.

Warm standby maintains a scaled-down version of the primary data centre environment on the cloud in a running state.

When a disaster occurs, we need to add additional capacities on the cloud i.e. scale-out occurs in EC2 auto-scaling. Also, upgrade the types of EC2 instances to meet higher capacity. Reconfigure Route53 to direct traffic to the AWS cloud as a failover site. This has a shorter recovery time than Pilot Light.

Hot standby maintains an active-active configuration both in on-premises and AWS cloud where synchronous data transfer occurs in both directions, constantly. This provides very low RPO & RTO architecture but costs the highest among all four strategies.

When a disaster occurs, traffic is automatically redirected towards the AWS cloud as per the failover routing policy of Route53. Optionally we can scale up the capacity.
Amazon S3 has 99.999999999 (11 9’s) durability with the same region, cross-region, cross-account replication option. S3 Glacier and S3 Glacier Deep Archive is low-cost storage for archival data.

There are lots of S3 tutorials on the internet. Here I have shown the S3 Glacier Vault.

From the AWS management console open S3 Glacier (under Storage). Then create a vault.

In step-1 select the region and enter the vault name.

In step-2 create or choose an SNS topic for event notification or don’t enable notification.

In step 3 for the new SNS topic, enter the topic name and display name. To trigger the notification, you have to select the job type.

At step 4 review the information and submit it.

After a few moments, you can view the vault and by clicking on it, view the details.

In the Vault Lock tab, you can set a vault lock policy to enforce compliance controls and automate actions in the Glacier vault in JSON. 

In the upper left corner, click on Settings. Here you can set the data retrieval policy to manage the retrieval cost. Also, you can set the provisioned capacity where currently each capacity unit costs $100 per month.

Amazon Storage Gateway is a hybrid cloud storage service that connects on-premises appliances with cloud storage. It supports NFS, SMB, and iSCSI. There are 3 variants-

  1. File gateway can be used to store and access objects in S3 via NFS or SMB mount points from on-premises or EC2. It also provides a local cache with upto 32 TB.
  2. Volume gateway can be used to replicate block storage volume asynchronously to S3 using the iSCSI interface as a point-in-time snapshot. Also, the primary data can be stored in S3 by caching the frequently accessed data into on-premises.
  3. Tape gateway can be used as a virtual tape library or tape shelf with existing backup software. It consists of a virtual tape driver and virtual media changer and uses iSCSI.

From the AWS management console open AWS Storage Gateway and click on Get Started.

In the first step, we have to select the gateway type. I have selected the File gateway.


In the next step, we have to select the host platform and download the image. I have selected VMware ESXi and downloaded the image in Ova format (aws-storage-gateway-latest. ova). Then imported it into my local VMware workstation and added an extra hard disk. Here all the data (we want to sync with AWS file gateway) have to be stored.

Turned the imported device on and logged in with username (admin) and password (password). Here the IP address of gateway is given which we need later, so copied it into notepad. Also, several configuration options are shown. After doing the required configuration exit the session.

Next again on the AWS storage gateway console click on next.
Here we have to select the service endpoint. I have selected a public endpoint as I am not using any VPC.

At the next step, we have to enter the IP address of the gateway which was shown on the imported device on VMware. Then clicked on Connect to the gateway.

After some moment it had connected and asked for activation. I have selected GMT+5:30 as the Gateway time zone, entered the gateway name, entered two tags and activated the gateway.

After some time, it had been activated. It identified the drive which was set up on the appliance (ova file) as a local hard disk. We can allocate this as a cache or keep it unallocated.

In the next step, we can create or select the CloudWatch log group for this gateway. I have selected Disable logging and clicked on Exit.

The Gateway details have been shown.

Here we can create file share to S3 or create volume or create tape. After clicking on Create file share, I entered my S3 bucket name, entered a file share name, selected NFS, and selected the gateway. We can optionally enable automated cache refresh.

Then in the next step, we have to select the storage class, object metadata, an IAM role for S3 bucket access and the encryption type.

At the next step, reviewed the settings and created the file share.

Then the file share had been created. By clicking on it, details can be seen. There we can see the commands to map this file share to Linux, Windows or macOS.

I have copied the Windows command and run this in the command prompt with Z drive. Then it was shown. We can copy files to this and after some time it will automatically sync with the S3 bucket.

AWS backup is a service that simplifies backup scheduling and lifecycle management across AWS services, centrally manages backup activities, consistently archive data to meet compliance requirements.
From the AWS management console open AWS backup. Then from the left menu click on Dashboard.
Here we can view the overview, the number of backup, restore and copy jobs in the last 24 hours.

From the left menu, click Backup Vaults. At first, a default vault exists. By clicking on create we can create another one.
In the general section, enter a name (case sensitive) for the vault and select the encryption key. Then optionally enter tags and click on Create backup vault.

We can view the details.

From the left menu, click Backup Plans and click on Create.
Here we can create the backup plan with a template or build a new plan from scratch or define the plan using JSON. I have selected the second option. Then entered a name.
In the backup rule configuration section, enter a name. Selected daily backup frequency. I have customized the backup window and entered 11:00 PM (UTC) as the start time (you can enter as you want). Then selected start within 8 hours and complete within 7 days.
In the lifecycle, scheduled that after creation data will transit to cold storage in 31 days. Also, after creation data will be expired in 365 days.
Then select the backup vault I have created before. Here optionally we can copy the backup to another region by selecting the region.
We can optionally add a tag to the recovery point and backup plan.
From September 2020, AWS Backup supports application-consistent backup of Microsoft workloads (running on EC2) via the Microsoft Volume Shadow Copy Service. We can optionally enable Windows VSS to create, manage and restore consistent backups of Microsoft server instances and applications.
Finally, click on Create Plan.

We can view the details of the plan. At first, we need to assign resources to the plan by clicking on Assign resources.

Enter a name and select the IAM role. Then assign the resources by tags or resource ID. Click on Assign resources.

Now it will be shown.

We can create an on-demand backup for certain resources. From the left menu click on Protected resources and then create. Then select the resource to backup, select ‘Create backup now’ or ‘Customize backup window’ to start backup within specified hours. We can specify when the backup will expire.
Next, select the backup vault and IAM role that AWS Backup will assume when creating and managing backups. Optionally add tags and click on Create on-demand backup.

Then the backup job will be shown.

When it will be completed, we can view the protected resources. By clicking on the resource, we can view details and also can restore it.

While planning disaster recovery on AWS cloud, organizations should keep in mind these best practices-

  • Consider the principle of least privilege in the AWS IAM. Also, don’t use the root account for daily use, keep it’s credential in the safest place possible. Enable MFA in root user and other users.
  • Use AWS CloudTrail to monitor API calls and all the activities of IAM users. Ignorance will highly cost if any man-made disaster happens. Also, use Amazon CloudWatch to monitor resources and applications. You can also use AWS Config to automatically resolve some security issues like close unnecessarily or mistakenly opened ports in the security group.
  • Copy the EBS volume into another AZ or another region. Hence in case of any disaster in one AZ the data in the EBS volume will not be lost.
  • While launching Amazon RDS, consider multi-AZ.
  • In the EC2 autoscaling group select multiple AZs to launch instances and configure the ELB to send traffic to multiple AZs. Hence AZ-level outages can be avoided. 
  • Amazon S3 automatically replicates data to multiple locations within same region. But on February 28, 2017 a problem occurred with Amazon S3 in the Northern Virginia region, their customer companies were unable to access their data for approximately four hours. So, consider cross-region replication in S3 bucket to recover data if any region-wise outage happens.
  • Consider global-table in Amazon DynamoDB as similar to S3, DynamoDB automatically replicate data to multiple locations within same region. In the DynamoDB global-table, a multi-region multi-master database is deployed and any change to a table is propagated to tables in multiple regions. Also, the DynamoDB data can be backed up to S3 bucket.
 

 

So, this is the end of the first part of the series. In the next articles, disaster recovery options in Microsoft Azure, GCP, IBM and many others will be explained.

In case of any feedback, please enter your comments.

Also, You can connect with me on Linkedin
If you like, share on-           Buffer      Evernote     

Popular posts from this blog

The DNS Tales

AWS Systems Manager or SSM to manage infrastructure