AWS Reliability Pillar
Last Updated : 20-Dec-2020
Design Principles
- Automatically recover from failure - use KPIs to trigger automatic system recovery
- Test automatic recovery - validate recovery procedures
- Scale horizontally to increase aggregate workload availability - use autoscaling
- Stop guessing capacity - monitor demand and utilization to trigger scaling in or out
- Manage change in automation - automate all changes to infrastructure for reliable recovery
Best Practices
- Foundations - consider service quotas and network capacity
- Workload architecture - design failure prevention and failure mitigation
- Change management - design for changes in demand and capacity with monitoring and triggering in response to KPI changes
- Failure Management - failure detection and automatic repair, backup and recovery, DR planning and testing
Services
- AutoScaling
- AWS Backup
- AWS Cloudwatch
Using Template: Template Post