0: Incident / Event Response Overview

What’s in here?

  • Troubleshooting
  • Restoring operations
  • Automating event mgmt + alerting
  • Implement automated healing
  • Event-driven automated actions

Which Whitepapes to Study?

  • CloudWatch: Detect issues, automate events
    • Monitoring thresholds trigger events
  • OpsWorks (stacks): Auto-heal failed instances
    • (It’s in stack layer settings)
  • Auto-scaling: Monitor metrics (scale or heal instances)
  • CloudFormation: Store templates in multiple regions for redundancy
  • AWS Health Dashboard: Availability and Operations of services