LISA’11: Tech Sessions – GameDay

Print Friendly, PDF & Email

The next talk is from Jesse Robbins co-founder of Opscode and ex-Amazon “Master of Disaster”!

“You don’t choose the moment, the moment chooses you. You only choose how prepared you are when it happens.”

Operations is work that matters. How do we make those systems more reliable? Jesse explained through a process called GameDay based on experiences and training as a firefighter. The subject matter Resilience Engineering is not new, just new to us. Based on the ability to adapt to failure.

“resilience is a function or people and culture.”

Complexity results in lots of failure. scaling out results in higher failure. FAILURE HAPPENS! Recommends book: “Normal Accidents” by Charles Perrow.

MTTR > MTBF

It’s better to have systems that tolerate failure and can recover quickly. “It’s a normal accident”. Treating failure as a “normal” occurrence helps to find ways to deal with it. The model consists of 3 stages.

Preparation

  • identify and mitigate risk and impact of failure
  • reduce frequency of failure MTBF
  • reduce duration of recovery MTTR

Participation

  • builds confidence and competence responding to a failure under stress
  • strengthen individual and cultural ability to anticipate, mitigate and respond, etc.

Exercises

  • trigger and expose latent defects
  • choose discover them, instead of letting that be determined bu the next real disaster.

Start small – show someone something new but exercise a risk (e.g. power off a server)

Increase awareness – communicate that you’re going to do it.

Build confidence – more powerful, learning that org grows through experiencing failure.

Full scale live fire exercises – power off a data centre! pick the worst survivable scenario. People will be terrified but it will do them a lot a good. Then the week of the day comes. People ask you’re not really going to do it? Oh yes it has to happen. Never slip a date. You will learn so much from it. The reason you do this is make sure that the precautions become part of the culture.

“Observe, Orient, Decide, Act” – John Boyd’s OODA loop.