- An Information System Contingency Plan (ISCP) is a pre-established plan for restoration of the services of a given information system after a disruption.
On a
given point in time, disaster occurs and systems needs to
be recovered.
At this point the Recovery Point
Objective (RPO) determines the maximum acceptable amount of data loss measured in time. For example, the maximum tolerable data loss is 15 minutes.
Stage 3: Recovery
At this stage
the system are recovered and back online but not ready for production yet. The Recovery Time
Objective (RTO) determines the maximum tolerable amount of time needed to bring all critical systems back online. This covers, for example, restore data from back-up or fix of a failure. In most cases this part
is carried out by
system administrator,
network administrator, storage administrator
etc.
Stage 4: Resume Production
At this stage all systems are recovered,
integrity of the system or data is verified and all critical systems can resume normal operations. The Work Recovery Time (WRT) determines the maximum tolerable amount of time that
is needed to verify the system and/or data integrity. This could be, for example, checking the databases and logs, making sure the applications or services are running and are available. In most cases those tasks
are performed by application administrator, database administrator
etc. When all systems affected by the disaster
are verified and/or recovered, the environment is ready to resume the production again
The sum of RTO and WRT is defined as the Maximum Tolerable Downtime (MTD) which defines the total
amount of time that a business process can
be disrupted without causing any unacceptable consequences.
This value should be defined by the business management team or someone like CTO, CIO or IT manager.
This is
of course a simple example of a Business Continuity/Disaster Recovery plan and should
be included in your Business Impact Analysis (BIA).
https://defaultreasoning.com/2013/12/10/rpo-rto-wrt-mtdwth/
- Recovery Point Objective (RPO)
Maximum Tolerable Downtime (MTD)
Recovery Time
Objective (RTO)
Recovery Point
Objective (RPO)
The RPO
is defined by business and departmental managers, and any designated data owners.
It is described as a
period of time. For example, the recovery point may be "one hour", "end of previous business day", or "one week".
It is derived based on:
How often data
is updated;
How much expense and/or effort would
be required of your users to reconstruct data created or updated since the last backup, if possible;
If reconstruction wouldn't be possible, how much recent data your company can tolerate losing permanently, considering the likelihood of a catastrophic data loss event.
Maximum Tolerable Downtime (MTD)
MTD is the maximum
amount of time an application or data can be unavailable to users, as specified by business management.
This is based on the impact on business functions, and analysis of
anticipated lost revenue and other costs that
are incurred for every hour, day, or week
a given application or database might be unavailable.
https://www.jdfoxexec.com/resource-center/articles/mtd-rto-rpo/
https://en.wikipedia.org/wiki/Information_System_Contingency_Plan
- 5 WAYS TO TEST IT DISASTER RECOVERY PLANS
the five types of disaster recovery tests:
Paper test: Individuals read and annotate recovery plans.
Walkthrough test: Groups walk through plans to identify issues and changes.
Simulation: Groups go through a simulated disaster to identify whether emergency response plans are adequate.
Parallel test:
Recovery systems are built/set up and tested to see if they can perform actual business transactions to support key processes. Primary systems still carry the full production workload.
Cutover test:
Recovery systems are built/set up to assume the full production workload. You disconnect primary systems
https://www.dummies.com/programming/networking/5-ways-to-test-it-disaster-recovery-plans/
- TESTING DISASTER RECOVERY PLANS
Structured Walk-Through Testing
During a structured walk-through test, disaster recovery team members meet
to verbally walk through the specific steps of each component of the disaster recovery process as documented in the disaster recovery plan. The purpose of the structured walk-through test is to confirm the effectiveness of the plan and to identify gaps, bottlenecks or other weaknesses in the plan.
Checklist Testing
A checklist test determines if
sufficient supplies are stored at the backup site, telephone number listings are current, quantities of forms are adequate, and a copy of the recovery plan and necessary operational manuals are available.
Simulation Testing
During this test, the organization simulates a disaster so
normal operations will not be interrupted. A disaster scenario should take into consideration the purpose of the
test, objectives, type of test, timing, scheduling, duration, test participants, assignments, constraints, assumptions, and test steps.
Parallel Testing
A parallel test can be performed in conjunction with the checklist test or simulation test. Under this scenario, historical transactions, such as yesterday’s transactions,
are processed against the preceding day’s backup files at the contingency processing site or hot-site. All reports produced at the alternate site for the current business date should agree with those reports produced at the existing processing site.
Full-interruption Testing
A full-interruption test activates the total disaster recovery plan. This test is costly and could disrupt normal operations.
https://www.drj.com/drj-world-archives/dr-plan-testing/testing-disaster-recovery-plans.html
- Disaster Recovery as a Service (DRaaS) is the replication and hosting of physical or virtual servers by a third-party to provide failover in the event of a man-made or natural catastrophe.
Typically,
DRaaS requirements and expectations
are documented in a service-level agreement (SLA) and the third-party vendor provides failover to a cloud computing environment, either through a contract or pay-per-use basis
http://whatis.techtarget.com/definition/disaster-recovery-as-a-service-DRaaS
- Veeam® enables Disaster Recovery-as-a-Service (DRaaS) as part of a comprehensive availability strategy, embracing investments made in your datacenter and extending them through the hybrid cloud.
https://www.veeam.com/disaster-recovery-as-a-service-draas.html
High availability is a feature which provides redundancy and fault tolerance
What is Redundancy
Redundancy is basically extra hardware or software that can
be used as backup if the main hardware or software fails.
Redundancy can be achieved via load clustering, failover, RAID, load balancing, high
availabiltiy in an automated fashion.
A higher layer of redundancy is achieved when the backup device is
completely separate from the primary device. For
example a backup internet line
is provided by another ISP provider, so a
completely separate physical link and connection from the primary internet connection, or a redundant piece of hardware which
resides in another building.
http://www.internet-computer-security.com/Firewall/Failover.html
A High Availability system is one that
is designed to be available 99.999% of the time, or as close to it as possible. Usually this means configuring a failover system that can handle the same workloads as the primary system.
FAULT TOLERANCE
A Fault Tolerant system is
extremely similar to HA, but goes one step further by guaranteeing zero downtime. HA still comes with a small portion of downtime, hence the ideal of a perfect HA strategy reaching “five nines” rather than 100% uptime. The time it takes for the intermediary layer, like the load balancer or
hypervisor, to detect a problem and restart the VM can add up to minutes or even hours over
the course of yearly runtime.
https://www.greenhousedata.com/blog/high-availability-vs-fault-tolerance-vs-disaster-recovery
- Disaster Recovery as a Service (DRaaS) from Node4 is ideal for companies who need continuous protection of the data and applications that are essential for the operation of their critical business functions.DRaaS is delivered using award-winning software from Zerto to replicate your virtual machines and maintain standby copies on N4Compute, our highly resilient Cloud virtualisation platform
http://www.node4.co.uk/cloud/draas/
- Fujitsu Backup as a Service (BaaS) provides a resilient, cloud-based backup and recovery service. Fujitsu Backup as a Service supports full system recovery, providing much more than folder and file backup and recovery. Delivered from FUJITSU Cloud, it offers the levels of speed, convenience and reliability demanded by organizations today.
http://www.fujitsu.com/global/services/infrastructure/iaas/baas/
- Backup as a Service (BaaS) provides backup and recovery operations from the cloud. The cloud-based BaaS provider maintains necessary backup equipment, applications, process and management in their data center. The customer will have some on-site installation – an appliance and backup agents are common – but there is no need to buy backup servers and software, run upgrades and patches, or purchase dedupe appliances.
- DRaaS/RaaS. Disaster Recovery as a Service, or more simply Recovery as a Service, offers more recovery options than the backup recovery of BaaS. BaaS will recover your backed up files, and RaaS recovers your files and applications within contracted RTO and/or RPO periods. It is more costly than BaaS but can be a good option if you do not want to perform your own storage infrastructure recovery in case of disaster.
http://www.datamation.com/cloud-computing/backup-as-a-service-to-baas-or-not-to-baas-1.html
- Backup as a service (BaaS) is an approach to backing up data that involves purchasing backup and recovery services from an online data backup provider. Instead of performing backup with a centralized, on-premises IT department, BaaS connects systems to a private, public or hybrid cloud managed by the outside provider. Backup as a service is easier to manage than other offsite services. Instead of worrying about rotating and managing tapes or hard disks at an offsite location, data storage administrators can offload maintenance and management to the provider.
http://searchdatabackup.techtarget.com/definition/backup-as-a-service-BaaS
- What is disaster recovery?
Disaster recovery (DR) consists of IT technologies and best practices designed to prevent or minimize data loss and business disruption resulting from catastrophic events—everything from equipment failures and localized power outages to cyberattacks, civil emergencies, criminal or military attacks, and natural disasters.
Business continuity planning
Business continuity planning creates systems and processes to ensure that all areas of your enterprise will be able to maintain essential operations or be able to resume them as quickly as possible in the event of a crisis or emergency. Disaster recovery planning is the subset of business continuity planning that focuses on recovering IT infrastructure and systems.
Disaster recovery planning
Business impact analysis
The creation of a comprehensive disaster recovery plan begins with business impact analysis. When performing this analysis, you’ll create a series of detailed disaster scenarios that can then be used to predict the size and scope of the losses you’d incur if certain business processes were disrupted.
Risk analysis
Assessing the likelihood and potential consequences of the risks your business faces is also an essential component of disaster recovery planning
Prioritizing applications
Separate your systems and applications into three tiers, depending on how long you could stand to have them be down and how serious the consequences of data loss would be.
Documenting dependencies
The next step in disaster recovery planning is creating a complete inventory of your hardware and software assets. It’s essential to understand critical application interdependencies at this stage
Establishing recovery time objectives, recovery point objectives, and recovery consistency objectives
By considering your risk and business impact analyses, you should be able to establish objectives for how long you’d need it to take to bring systems back up, how much data you could stand to use, and how much data corruption or deviation you could tolerate.
Your recovery time objective (RTO) is the maximum amount of time it should take to restore application or system functioning following a service disruption.
Your recovery point objective (RPO) is the maximum age of the data that must be recovered in order for your business to resume regular operations.
A recovery consistency objective (RCO) is established in the service-level agreement (SLA) for continuous data protection services. It is a metric that indicates how many inconsistent entries in business data from recovered processes or systems are tolerable in disaster recovery situations
Regulatory compliance issues
All disaster recovery software and solutions that your enterprise have established must satisfy any data protection and security requirements that you’re mandated to adhere to
Choosing technologies
Backups serve as the foundation upon which
any solid disaster recovery plan is built.
Choosing recovery site locations
On the one hand, a copy of your data should be stored somewhere that’s geographically distant enough from your headquarters or office locations that it won’t be affected by the same seismic events, environmental threats, or other hazards as your main site. On the other hand, backups stored offsite always take longer to restore from than those located on-premises at the primary site, and network latency can be even greater across longer distances.
Continuous testing and review
Simply put, if your disaster recovery plan has not been tested, it cannot be relied upon
All employees with relevant responsibilities should participate in the disaster recovery test exercise, which may include maintaining operations from the failover site for a period of time.
Disaster Recovery-as-a-Service (DRaaS)
Disaster-Recovery-as-a-Service (DRaaS) is one of the most popular and fast-growing managed IT service offerings available today. Your vendor will document RTOs and RPOs in a service-level agreement (SLA) that outlines your downtime limits and application recovery expectations.
Cloud DR
Most on-premises DR solutions will incur costs for hardware, power, labor for maintenance and administration, software, and network connectivity. In addition to the upfront capital expenditures involved in the initial setup of your DR environment, you’ll need to budget for regular software upgrades. Because your DR solution must remain compatible with your primary production environment, you’ll want to ensure that your DR solution has the same software versions.Depending upon the specifics of your licensing agreement, this might effectively double your software costs.
moving to a DRaaS subscription reduce your hardware and software expenditures, it can lower your labor costs by moving the burden of maintaining the failover site to the vendor.
https://www.ibm.com/cloud/learn/disaster-recovery
- Section 1. Example: Major goals of a disaster recovery plan
Here are the major goals of a disaster recovery plan.
Section 2. Example: Personnel
You can use the tables in this topic to record your data processing personnel. You can include a copy of the organization chart with your plan.
Section 3. Example: Application profile
You can use the Display Software Resources (DSPSFWRSC) command to complete the table in this topic.
Section 4. Example: Inventory profile
You can use the Work with Hardware Products (WRKHDWPRD) command to complete the table in this topic.
Section 5. Information services backup procedures
Use these procedures for information services backup.
Section 6. Disaster recovery procedures
For any disaster recovery plan,
these three elements should be addressed.
Section 7. Recovery plan for mobile site
This topic provides information about how to plan your recovery task at a mobile site.
Section 8. Recovery plan for hot site
An alternate hot site plan should provide for an alternative (backup) site. The alternate site has a backup system for temporary use while the home site is being reestablished.
Section 9. Restoring the entire system
You can learn how to restore the entire system.
Section 10. Rebuilding process
The management team must assess the damage and begin the reconstruction of a new data center.
Section 11. Testing the disaster recovery plan
In successful contingency planning, it is important to test and
evaluate the plan regularly.
Section 12. Disaster site rebuilding
Use this information to do disaster site rebuilding.
Section 13. Record of plan changes
Keep your plan current, and keep records of changes to your configuration, your applications, and your backup schedules and procedures.
https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rzarm/rzarmdisastr.htm