Turn off the Internet: AWS is no longer responding!

Turn off the Internet: AWS is no longer responding!

A few days ago, an incident impacting the AWS cloud provider had a significant impact on many companies and services directly affected by this instability.

I saw on social networks many reactions, often beside the subject (unfortunately) and I thought it could be useful to give you my analysis of the subject.

Rewind

Let’s start by recalling the incident a bit.

On Wednesday evening (French time), AWS encountered a growing number of errors in some of its services in the us-east-1 (North Virginia) region.

Although it is considered that an incident in a single area is not supposed to have a huge impact on AWS, it should be considered that this area is the “core” area of AWS. Most global services (IAM, CloudFront, Route53, etc.) depend heavily on it.

It should also be taken into account that it is considered today that about 30% of the Internet depends directly on AWS, which is just colossal.

This is not the first AWS incident, and it won’t be the last. It’s been in the news because of the length of time it’s been going on and the visibility it’s been given. A lot of companies have indeed cleared themselves of any worries by sending the ball back to Amazon, but things are a little more complicated than that.

“Everything fails all the time.”

This sentence does not come from me, but from the CTO of Amazon Web Services, Werner Vogels.

It is based on an elementary rule: wherever you host your service, you have to take into account that it will sometimes fall down, like any infrastructure.

Having been in production for more than 10 years, I’m familiar with the subject: you can redund a service, you just reduce the risk of an incident. There is no such thing as zero risk.

During my AWS training sessions, this is a point that often comes up, the question to ask yourself about the design of your infrastructure is not “how to prevent my infrastructure from falling down”, but rather “what to do when my infrastructure falls down”.

Those of you who, like me, work on AWS on a daily basis know that there are incidents every day, less impacting, but there are still incidents.

The risks of hyperconvergence

The public clouds and especially the GAFAMs have allowed a hyper-convergence of many services.

Indeed, because of the bricks proposed, it is very easy to say to oneself that one is going to place all one’s logs at AWS/Azure/GCP.

A company can indeed find something to meet almost all its needs of the shelf, very often with directly managed services and the support that goes with it.

This aspect has been an important vector of acceleration in recent years, allowing us to deploy and therefore innovate ever faster.

The concern is that this has also made AWS an Internet colossus. Many companies have become completely dependent on this provider.

But to just say that the concern is there is to see only the tip of the iceberg.

Design and risk assessment

In an IT department, the role of an architect is to design a resilient infrastructure that meets technical and functional needs.

When we talk about resilience, we often think of always having an application running 24/7 with an availability rate of 99.99%.

In reality it’s more complicated than that.

Risk assessment VS. cost

Normally, when designing an architecture, one asks oneself the question of the targeted availability rate as well as the cost of unavailability.

In addition, we need to see if there are mechanisms to mitigate unavailability.

For example, let’s imagine a home automation device: it sends its metrics to AWS every 10 minutes.

What happens if my service that has to receive these metrics is unavailable?

2 choices here:

  • I lose my metrics: the availability of my service is therefore essential
  • I have a buffer locally on the device, allowing delaying in case of unavailability: in this case, I can afford unavailability, because I potentially lose only the “real time” part

These two choices are already determined by the amount of money you want to put into a project.

Having local storage costs additional hardware, additional assembly complexity, and additional development and maintenance costs.

Also, it means that my server is able to manage in “catch-up” mode.

The other aspect is to ask what the cost of downtime is:

  • In terms of service provided
  • In terms of potential contractual penalties
  • In terms of image

We then put this unavailability against the probability of the latter. Very often, simplistically, the potential availability of infrastructure is that of the element with the lowest availability.

Then, we can ask ourselves how much it will cost to cover this unavailability.

Then we make a simple comparison, and quite obviously, the most cost-effective solution is the one that is taken into account.

This is why unavailability is not always a problem. It is the role of the architects to evaluate it upstream.

The chaos theory

To have confidence in your infrastructure, you have to test it. This is the basic idea of chaos engineering.

Starting from the observation that your infrastructure is going to fall is a point, to make it fall voluntarily, in a framework that you can control:

Increase confidence in your high availability
to test your automated procedures and/or remediation.
Tools exist for this purpose, if you want to know more, I invite you to read the excellent posts of my colleague Akram on this subject on the WeScale blog [French links]:

Le guide de Chaos Engineering : Partie 1
Le Chaos Engineering (CE) est une discipline d’expérimentation dans les systèmescomplexes et distribués, qui vise à se rassurer sur leur comportement face à desincidents et perturbations. Dans cet article je vais m’attacher à simplementapporter les bases du chaos et quelques définitions pour pose…
Le guide de Chaos Engineering : Partie 2
Introduction Nous avons vu dans la première partie du guide du Chaos Engineering (CE)[/2019/09/26/le-guide-de-chaos-engineering-part-1/] que le CE ne peut pas avoirlieu sans expérimentations. Nous allons continuer notre voyage fantastique dans ce monde en passant par lepays du “Chaos Tools Cou…

Some companies go so far as to use these tools in production. For example, Netflix uses its simian army in production to ensure the resilience of its infrastructure.

Destroying while mastering also allows you to better know how to react when your infrastructure falls, as it is a habit.

To conclude

The companies that put the blame on AWS are the only ones at fault.

Either they underestimated the impact of unavailability, or the unavailability costs less than expected.

Either way, it is a design choice to depend on a single supplier and a single availability area. Blaming one supplier is cowardly and does not represent the reality of things.

It is possible to prepare and simulate these unavailabilities in advance of the phase in order to react in the best possible way; repetition creates trust.

The other concern also comes from the fact that many companies have bet everything on AWS, the incidents of the last few days may be going to reshuffle the cards a bit, but is it worse?

Advocating for a sovereign cloud provider is not the answer either if you put all your marbles in one place. Once again, good infrastructure design is essential, better safe than sorry. You can learn more about this subject with Damy. R’s post [French link]:

AWS tombe, l’internet tremble · Damy.R