Measure and Improve Your Application Resilience with AWS Resilience Hub
- November 11, 2021
I am excited to announce the immediate availability of AWS Resilience Hub, a new AWS service designed to help you define, track, and manage the resilience of your applications.
You are building and managing resilient applications to serve your customers. Building distributed systems is hard; maintaining them in an operational state is even harder. The question is not if a system will fail, but when it will, and you want to be prepared for that.
Resilience targets are typically measured by two metrics: Recovery Time Objective (RTO), the time it takes to recover from a failure, and Recovery Point Objective (RPO), the maximum window of time in which data might be lost after an incident. Depending on your business and application, these can be measured in seconds, minutes, hours, or days.
AWS Resilience Hub lets you define your RTO and RPO objectives for each of your applications. Then it assesses your application’s configuration to ensure it meets your requirements. It provides actionable recommendations and a resilience score to help you track your application’s resiliency progress over time. Resilience Hub gives a customizable single dashboard experience, accessible through the AWS Management Console, to run assessments, execute prebuilt tests, and configure alarms to identify issues and alert the operators.
AWS Resilience Hub discovers applications deployed by AWS CloudFormation (this includes SAM and CDK applications), including cross Regions and cross account stacks. Resilience Hub also discovers applications from Resource Groups and tags or chooses from applications already defined in AWS Service Catalog AppRegistry.
The term “application” here refers not just to your application software or code; it refers to the entire infrastructure stack to host the application: networking, virtual machines, databases, and so on.
Resilience assessment and recommendations
AWS Resilience Hub’s resilience assessment utilizes best practices from the AWS Well-Architected Framework to analyze the components of your application and uncover potential resilience weaknesses caused by incomplete infrastructure setup, misconfigurations, or opportunities for additional configuration improvements. Resilience Hub provides actionable recommendations to improve the application’s resilience.
For example, Resilience Hub validates that the application’s Amazon Relational Database Service (RDS), Amazon Elastic Block Store (EBS), and Amazon Elastic File System (Amazon EFS) backup schedule is sufficient to meet the application’s RPO and RTO you defined in your resilience policy. When insufficient, it recommends improvements to meet your RPO and RTO objectives.
The resilience assessment generates code snippets that help you create recovery procedures as AWS Systems Manager documents for your applications, referred to as standard operating procedures (SOPs). In addition, Resilience Hub generates a list of recommended Amazon CloudWatch monitors and alarms to help you quickly identify any change to the application’s resilience posture once deployed.
Continuous resilience validation
After the application and SOPs have been updated to incorporate recommendations from the resilience assessment, you may use Resilience Hub to test and verify that your application meets its resilience targets before it is released into production. Resilience Hub is integrated with AWS Fault Injection Simulator (FIS), a fully managed service for running fault injection experiments on AWS. FIS provides fault injection simulations of real-world failures, such as network errors or having too many open connections to a database. Resilience Hub also provides APIs for development teams to integrate their resilience assessment and testing into their CI/CD pipelines for ongoing resilience validation. Integrating resilience validation into CI/CD pipelines helps ensure that every change to the application’s underlying infrastructure does not compromise its resilience.
AWS Resilience Hub provides a comprehensive view of your overall application portfolio resilience status
through its dashboard. To help you track the resilience of applications, Resilience Hub aggregates and
organizes resilience events (for example, unavailable database or failed resilience validation), alerts, and insights from services like Amazon CloudWatch and AWS Fault Injection Simulator (FIS). Resilience Hub also generates a resilience score, a scale that indicates the level of implementation for recommended resilience tests, alarms and recovery SOPs. This score can be used to measure resilience improvements over time.
The intuitive dashboard sends alerts for issues, recommends remediation steps, and provides a single place to manage application resilience. For example, when a CloudWatch alarm triggers, Resilience Hub alerts you and recommends recovery procedures to deploy.
AWS Resilience Hub in Action
I developed a non-resilient application made of a single EC2 instance and an RDS database. I’d like Resilience Hub to assess this application. The CDK script to deploy this application on your AWS Account is available on my GitHub repository. Just install CDK v2 (
npm install -g [email protected]) and deploy the stack (
cdk bootstrap && cdk deploy --all).
There are four steps when using Resilience Hub:
To start, I open my browser and navigate to the AWS Management Console. I select AWS Resilience Hub and select Add application.
My sample app is deployed with three CloudFormation stacks: a network, a database, and an EC2 instance. I select these three stacks and select Next on the bottom of the screen:
Resilience Hub detects the resources created by these stacks that might affect the resilience of my applications and I select the ones I want to include or exclude from the assessments and click Next. In this example, I select the NAT gateway, the database instance, and the EC2 instance.
I create a resilience policy and associate it with this application. I can choose from policy templates or create a policy from scratch. A policy includes a name and the RTO and RPO values for four types of incidents: the ones affecting my application itself, like a deployment error or a bug at code level; the ones affecting my application infrastructure, like a crash of the EC2 instance; the ones affecting an availability zone; and the ones affecting an entire region. The values are expressed in seconds, minutes, hours, or days.
Finally, I review my choices and select Publish.
Once this application and its policy are published, I start the assessment by selecting Assess resiliency.
Without surprise, Resilience Hub reports my resilience policy is breached.
I select the report to get the details. The dashboard shows how Region, availability zone, infrastructure and application-level incident expected RTO/RPO compare to my policy.
I have access to Resiliency recommendations and Operational recommendations.
In Resiliency recommendations, I see if components of my application are compliant with the resilience policy. I also discover recommendations to Optimize for availability zone RTO/RPO, Optimize for cost, or Optimize for minimal changes.
In Operational recommendations, on the first tab, I see a list of proposed Alarms to create in CloudWatch.
The second tab lists recommended Standard operating procedures. These are Systems Manager documents I can run on my infrastructure, such as Restore from Backup.
The third tab (Fault injection experiment templates) proposes experiments to run on my infrastructure to test its resilience. Experiments are run with FIS. Proposed experiments are Inject memory load or Inject process kill.
When I select Set up recommendations, Resilience Hub generates CloudFormation templates to create the alarms or to execute the SOP or experiment proposed.
The follow up screens are quite self-explanatory. Once generated, templates are available to execute in the Templates tab. I apply the template and observe how it impacts the resilience score of the application.
The CDK script you used to deploy the sample applications also creates a highly available infrastructure for the same application. It has a load balancer, an auto scaling group, and a database cluster with two nodes. As an exercise, run the same assessment report on this application stack and compare the results.
Pricing and Availability
AWS Resilience Hub is available today in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Frankfurt). We will add more regions in the future.
As usual, you pay only for what you use. There are no upfront costs or minimum fees. You are charged based on the number of applications you described in Resilience Hub. You can try Resilience Hub free for 6 months, up to 3 applications. After that, Resilience Hub‘s price is $15.00 per application per month. Metering begins once you run the first resilience assessment in Resilience Hub. Remember that Resilience Hub might provision services for you, such as CloudWatch alarms, so additional charges might apply. Visit the pricing page to get the details.
Let us know your feedback and build your first resilience dashboard today.