AWS Reliability Pillar

Updated : 20-Dec-2020

Design Principles

Automatically recover from failure – use KPIs to trigger automatic system recovery
Test automatic recovery – validate recovery procedures
Scale horizontally to increase aggregate workload availability – use autoscaling
Stop guessing capacity – monitor demand and utilization to trigger scaling in or out
Manage change in automation – automate all changes to infrastructure for reliable recovery

Best Practices

Foundations – consider service quotas and network capacity
Workload architecture – design failure prevention and failure mitigation
Change management – design for changes in demand and capacity with monitoring and triggering in response to KPI changes
Failure Management – failure detection and automatic repair, backup and recovery, DR planning and testing

Services

AutoScaling
AWS Backup
AWS Cloudwatch

Russell Jamieson

Share This Post