Dashboard > CI Development > ... > Collaboration Tools > Failure Scenarios and Recovery Procedures
Log In   View a printable version of the current page.
CI Development
Failure Scenarios and Recovery Procedures
Added by Thomas Im , last edited by Thomas Im on Apr 30, 2009  (view change)
Labels: 
(None)



Host Resources


Description
Conditions / Alerts
Recovery Strategy/Procedure
Contact
Priority
Host unavailable
Host does not respond to common availability queries such as a ping or SNMP host resources probe for a determined amount of time.
Machine will likely be unable to be accessed remotely, determine if network is down (entire or between monitor and host), check remote logs, if none are available, detach ebs volume, review logs before starting new instance
Tom Blocker
Degraded instance
Receive notice that host has degraded due to hardware failure
If services are running, check usability, integrity of data, bring down services and detatch volumes, start new instance and attach volumes
Tom Critical
Disk unavailable
Volumes are no longer mounted and/or have lost association to instance, results in application failure, alert from probe
Reassociate/remount disk, if some data loss or corruption has occurred, create new disk from last good snapshot, remount disk.
Tom Blocker
Disk full
Application failure, slow performance, alert from host resources probe
When disk becomes nearly full and space cannot be cleared, stop services, take snapshot, create larger ebs volume based on snapshot, mount and resize filesystem using something like resize2fs
Tom Critical
High CPU Load
Extended periods of slow response, alert from host resources probe
If this becomes a persistent problem, rebundle ami (if not up to date) and relaunch ami as "High CPU" instance or consider moving cpu intensive apps to different instance.
Tom Critical

Application Failure (general)


Description
Conditions / Alerts
Recovery Strategy/Procedure
Contact
Priority
Service unavailable
For webapps, a 5xx HTTP response indicating a server-side error
If it's not a host resource failure, check app logs, restore from backup, snapshot as necessary and restart application
Tom Blocker
Database corruption
Application failure
Rebuild indexes if possible or restore from last good ebs snapshot of database volume or database dump, a rebuild may be needed after this to sync db to on disk indexes for apps like Alfresco.
Tom Blocker
Out of memory
Application producing exceptions, slow performance
For java webapps, allocate more memory to jvm. If memory is not available to allocate, option is available to rebundle and relaunch ami as a "Large" or "Extra Large" Instance which provide 7.5 and 15 gb respectviely. Another option is to move memory intensive apps to difference instance.
Tom Critical


  Tom  

Powered by Atlassian Confluence 2.7.1, the Enterprise Wiki. Bug/feature request - Atlassian news - Contact administrators