The Price of Downtime: Calculating Your RTOs & RPOs

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two important metrics for business continuity planning.

RTO refers to the maximum acceptable length of time that a business process can be disrupted before there is an unacceptable impact on the business. It is the time it takes after a disruption to restore a process or service to its service level.

RPO refers to the maximum acceptable amount of data loss measured in time. It is the maximum tolerable period in which data might be lost from an IT service due to a disruption.

Setting appropriate RTOs and RPOs for critical business processes and IT services is essential for effective business continuity planning. They help quantify maximum acceptable downtime and data loss, which informs strategies for resilience such as backup frequency, redundancy, and disaster recovery.

Setting Recovery Time Objectives

The recovery time objective (RTO) is the maximum acceptable length of time that a system or application can be down after a failure or disaster occurs. The RTO defines the time frame in which systems, applications, or functions must be restored to avoid unacceptable consequences for the business.

RTO is calculated as the amount of time between the disaster and the full recovery of the affected systems or applications. For example, if a critical sales system goes down at 2pm on Tuesday, and is fully restored by 11 am on Wednesday, the RTO would be 21 hours.

Typical RTOs vary greatly depending on the criticality and purpose of the system:

  • Mission-critical systems – These systems are essential for core business operations. Typical RTOs are 0-4 hours. Examples may include ERP systems, trading systems, or e-commerce sites.
  • Business-critical systems – Important for business operations but not immediately essential. Typical RTOs are 4-24 hours. Examples include email, internal communications/intranet sites, and HR systems.
  • Business operational systems – Needed for optimal business operations but can sustain some downtime. Typical RTOs are 1-5 days. Examples include reporting systems, supply chain systems, and CRM systems.
  • Non-critical systems – Will not significantly impact business in the short term if unavailable. Typical RTOs are 1 week or longer. Examples include learning management systems, and external marketing sites.

The costs of extended downtime and data loss should be analyzed when setting RTOs. Shorter RTOs require greater investments in resilience but reduce potential downtime costs. Organizations should set RTOs based on business needs, costs, and risks.

Setting Recovery Point Objectives

The recovery point objective (RPO) defines the maximum amount of data loss that is acceptable in the event of a disruption. It is a measurement of the time between the last backup of data and when the disruption occurred. For example, if you have an RPO of 24 hours, you are willing to lose up to 24 hours’ worth of data in the event of an outage.

RPO is calculated by looking at the backup frequency and data loss tolerance. Most organizations perform regular backups of critical systems and data, whether hourly, daily, weekly or otherwise. The more frequent the backups, the lower the potential RPO. Performing hourly backups would result in a lower RPO than daily or weekly backups.

It’s important to analyze which systems can tolerate some data loss versus no data loss. Transactional systems like e-commerce or banking may require an RPO of near zero, with constant data replication. Other systems like file servers may be able to sustain some data loss without major impact. Setting appropriate backup frequencies for each system will help minimize potential data loss.

The costs of lowering RPO need to be weighed against the business impact of potential data loss. More frequent backups require additional investments in storage, network capacity, and backup tools. Business leaders need to define their tolerance for data loss across critical systems, which will guide the technical requirements and investments for achieving the target RPO.

Business Impact Analysis

A business impact analysis (BIA) is a key component of defining RTOs and RPOs. The goal of a BIA is to identify and prioritize critical systems, data, and operations within an organization. This allows leadership to make informed decisions about recovery objectives based on potential impacts.

Conducting a thorough BIA involves several steps:

Inventory systems and data. Catalog all IT systems, software, data stores, and equipment. This provides a starting point to evaluate.

  1. Classify by criticality. Not all systems are created equal. Classify each item by how critical it is to sustain key business functions and operations. This helps identify priorities for recovery.
  2. Identify dependencies. Understand how systems depend on one another, as well as on external providers or vendors. An outage in one area can cripple others if dependencies are not mapped.
  3. Estimate downtime impacts. Quantify the impacts of potential outages in terms of revenue loss, productivity loss, reputational harm, regulatory compliance, and other business risks.
  4. Calculate outage costs. Translate downtime impacts into dollar amounts to understand the financial costs of disruption. This data informs RTO and RPO decisions.
  5. Define recovery priorities. With a complete understanding of downtime impacts and costs, the business can decide which systems truly need near-instant recovery, and which can sustain longer outages. This allows leadership to set realistic RTOs and RPOs.

Conducting a BIA is a collaborative process between IT and business leaders. IT brings an understanding of systems dependencies and risks, while business leaders understand operations impacts. Working together brings clarity to recovery objectives.

Calculating Downtime Costs

When determining RTOs and RPOs, it’s important to calculate the potential costs of downtime for your organization. This provides justification for investments in minimizing recovery times. There are three main categories of downtime costs to consider. 

financial costs

Financial Costs

  • Revenue losses from being unable to conduct business. Estimate revenue per hour and multiply by estimated downtime.
  • Penalties for breaching contracts or SLAs. Check agreements for clauses related to outages.
  • Lost productivity of employees who can’t work. Multiply employee loaded wages by downtime.
  • Costs to rent temporary infrastructure, office space, equipment etc. Research rental costs.

Reputational Costs

  • Loss of customer trust and loyalty. Estimate customer churn and acquisition costs.
  • Damage to brand reputation. Harder to quantify, but important.
  • Regulatory fines for failing compliance or regulations. Check guidelines.

Productivity Costs 

  • Time spent recovering data and rebooting systems. Estimate employee loaded wages.
  • Inefficiencies from employees having to use downtime workaround procedures. Estimate productivity loss.
  • Revenue losses from billing delays, shipment delays etc. Estimate revenue per hour.

Quantification Models

There are a few common models used to quantify downtime costs:

  • Annual revenue or budget divided by total hours per year. Gives an hourly loss estimate.
  • Average revenue or wages per employee x number of affected employees. Estimates productive losses.
  • Typical customer lifetime value x estimated customer defections. Quantifies reputational damage.
  • Use past outage costs if available. Inflate to current state.

The goal is to determine realistic downtime cost estimates to inform your RTO and RPO decisions.

Recovery Strategies

Organizations have several options for recovery strategies to meet their RTO and RPO targets. The key is choosing the right approach based on business needs and costs.

Backup

Backups create copies of data that can be used to restore systems and data after a disruption. Backups can be performed on-premises using local storage or in the cloud. More frequent backups allow for lower RPO.

backup

Incremental backups capture only changes since the last backup, reducing storage needs. Full backups capture everything but take more time and resources. A common approach is weekly full backups plus daily incrementals.  

Mirroring

Mirroring maintains a synchronized, continuously updated copy of data at a secondary site. This allows for very low RPO, as little data is lost between outages. Reads and writes happen in near real-time.

Mirroring requires high bandwidth and low latency between sites. Costs are higher than simple backup, but recovery is faster. Mirrored data can also be accessed for reporting and analytics.

Redundancy 

Having redundant, fault-tolerant components avoids single points of failure. Examples include RAID disk arrays, redundant power supplies, duplicate internet links, and hot standby servers.

Redundancy comes with higher upfront costs but prevents downtime from component failures. Systems remain operational if an element fails, allowing for repairs on a non-urgent timeline.

Testing & Auditing

Regular testing and auditing are crucial for maintaining recovery time and recovery point objectives. Organizations should conduct tests to validate that recovery plans are effective and ensure they meet stated objectives.

Disaster recovery simulations exercise an organization’s ability to restore systems and resume operations after an outage. These simulations can range from simple tests focused on specific components to full-scale exercises that replicate a real disaster scenario. Tabletop exercises that walk through response procedures without actual recovery operations are also valuable.

After completing a test, organizations should document the results and identify any gaps between the test outcomes and stated recovery objectives. Any issues that surface during testing provide an opportunity to improve plans. Testing may reveal the need for additional redundancy, revised procedures, more training, or other enhancements.

Audits complement testing by providing an independent assessment of recovery capabilities. Internal or third-party auditors can examine the completeness of recovery plans, policy compliance, system configurations, and other aspects. Regular audits help ensure organizations adhere to established RTOs and RPOs over time. They also prompt reviews of objectives to confirm they remain appropriate as business needs evolve.

By continually testing and auditing their recovery preparations, organizations can verify that recovery plans match business needs. This helps avoid situations where untested plans fail to deliver the expected RTOs and RPOs during an actual disruption. A test and audit program provides ongoing assurance that recovery investments effectively mitigate downtime risks.

Cloud Considerations

The adoption of cloud services has changed how organizations think about recovery objectives. With infrastructure and applications hosted in the cloud, downtime risks shift from hardware failures to internet outages.

Cloud service providers offer service level agreements (SLAs) that guarantee uptime percentages. A common SLA is 99.9% uptime, equating to about 9 hours of total downtime per year. While impressive, this still allows for disruptive outages. Organizations relying on cloud services need to understand provider SLAs and factor downtime into RTO and RPO planning.

cloud

Another major risk with cloud-based infrastructure is internet connectivity. Redundant internet connections from multiple providers is a best practice for minimizing downtime from cable cuts, ISP issues, etc. Organizations should invest in diverse connections and automatic failover to meet recovery objectives. 

The ability for users to work remotely also plays into cloud recovery planning. Applications remaining available during an outage is only part of the equation – users still need internet access to utilize them. Remote work options can enhance recovery capabilities if executed properly, but should be tested.

Recovery objectives fundamentally shape cloud architecture decisions. Organizations must balance the costs of high availability with the business impact of downtime across all systems. 

Emerging Trends

The world of business continuity and disaster recovery is constantly evolving as new technologies emerge and lessons are learned from major events. Here are some of the key trends impacting RTOs and RPOs:

New Technologies

  • Cloud computing: With critical systems and data stored in the cloud, downtime can potentially be reduced and recovery accelerated. Cloud providers have built-in redundancy and typically offer SLAs around uptime. However, internet connectivity issues can still cause outages.
  • Containers & microservices: Breaking monolithic apps into containerized microservices can limit the blast radius of outages and make recovery faster. With proper orchestration, failed services can be restarted without impacting others.
  • Immutable infrastructure: Treating servers and other infrastructure as immutable, disposable resources allows faulty ones to simply be destroyed and recreated quickly from a known good state. This speeds recovery.
  • Automated failover: Automating the failover process reduces human error and speeds RTO. Machine learning can even predict failures before they occur.

COVID Lessons Learned

The COVID-19 pandemic provided several lessons around RTO and RPO:

  1. Remote work is possible: Many companies successfully shifted to remote work during lockdowns. This provides more flexibility for workers to continue operating during a disruption at company facilities.
  2. Supply chain matters: Reliance on just-in-time global supply chains proved fragile when travel restrictions were enacted. Companies may need to increase inventory buffers going forward.
  3. Digital transformation: Companies that had already digitally transformed had a much easier time adapting. Moving critical systems to the cloud helps minimize disruptions.
  4. Practice response plans: Disaster recovery plans were tested like never before. Periodic testing and updates are critical to keep plans effective.
  5. Communication is key: Keeping employees, customers and stakeholders informed during a crisis helps maintain confidence and trust.

The pandemic underscored the importance of planning for low probability but high impact events. RTOs and RPOs should be reevaluated accordingly.

Key Takeaways

  • Conducting a Business Impact Analysis and setting RTOs and RPOs is critical for disaster preparedness and aligning business and IT. Focus first on identifying and categorizing all critical assets and systems.
  • Calculate potential costs of downtime to make the business case for desired RTOs/RPOs. Factor in loss of revenue, productivity, reputation, regulatory fines, etc.
  • Get alignment between business leaders and IT on acceptable recovery timeframes and data loss. Formalize this in an IT disaster recovery plan.
  • Implement appropriate recovery strategies based on asset criticality. Options include cold/warm/hot sites, cloud backups, high availability, etc.
  • Regularly test failover processes and audit recovery capabilities to ensure they meet RTO/RPO targets. Adjust plans and investments as needed. 
  • With cloud adoption, focus shifts from local redundancy to internet redundancy. Assess single points of failure like ISP, firewalls, DNS, etc.
  • Enable remote work as part of a recovery strategy, but recognize potential productivity/cultural challenges.
  • Maintain open communication between IT and business leaders. Review recovery objectives periodically as a standard business risk management practice.

Leave a Reply

Your email address will not be published. Required fields are marked *

Making technology work for business since 1992

CIT is designated autism-friendly by autism speaks

Resources

Get in contact: email us at info@cit-net.com or call 651.255.5780

Copyright: © 2024. All Rights Reserved.