Failure Scenarios and Recovery Procedures
Added by Thomas Im , last edited by Thomas Im on Apr 30, 2009
( view change)
Host Resources
Description |
Conditions / Alerts |
Recovery Strategy/Procedure |
Contact |
Priority |
Host unavailable |
Host does not respond to common availability queries such as a ping or SNMP host resources probe for a determined amount of time. |
Machine will likely be unable to be accessed remotely, determine if network is down (entire or between monitor and host), check remote logs, if none are available, detach ebs volume, review logs before starting new instance |
Tom |
Blocker |
Degraded instance |
Receive notice that host has degraded due to hardware failure |
If services are running, check usability, integrity of data, bring down services and detatch volumes, start new instance and attach volumes |
Tom |
Critical |
Disk unavailable |
Volumes are no longer mounted and/or have lost association to instance, results in application failure, alert from probe |
Reassociate/remount disk, if some data loss or corruption has occurred, create new disk from last good snapshot, remount disk. |
Tom |
Blocker |
Disk full |
Application failure, slow performance, alert from host resources probe |
When disk becomes nearly full and space cannot be cleared, stop services, take snapshot, create larger ebs volume based on snapshot, mount and resize filesystem using something like resize2fs |
Tom |
Critical |
High CPU Load |
Extended periods of slow response, alert from host resources probe |
If this becomes a persistent problem, rebundle ami (if not up to date) and relaunch ami as "High CPU" instance or consider moving cpu intensive apps to different instance. |
Tom |
Critical |
Application Failure (general)
Description |
Conditions / Alerts |
Recovery Strategy/Procedure |
Contact |
Priority |
Service unavailable |
For webapps, a 5xx HTTP response indicating a server-side error |
If it's not a host resource failure, check app logs, restore from backup, snapshot as necessary and restart application |
Tom |
Blocker |
Database corruption |
Application failure |
Rebuild indexes if possible or restore from last good ebs snapshot of database volume or database dump, a rebuild may be needed after this to sync db to on disk indexes for apps like Alfresco. |
Tom |
Blocker |
Out of memory |
Application producing exceptions, slow performance |
For java webapps, allocate more memory to jvm. If memory is not available to allocate, option is available to rebundle and relaunch ami as a "Large" or "Extra Large" Instance which provide 7.5 and 15 gb respectviely. Another option is to move memory intensive apps to difference instance. |
Tom |
Critical |
|
|
|
Tom |
|
|