Amazon S3: The Cloud Service Manhandled by Human Error
During the night of Tuesday, February 28 to Wednesday, March 1, some Web Services were unavailable after the shutdown of some servers. On Thursday March 2, AWS revealed the source of the Amazon S3 failure: human error.
On the night of Tuesday to last Wednesday in Virginia, the breakdown of an AWS data center caused the shutdown of sites, applications, connected objects and services linked to the platform. A possible shortfall for users of the cloud service powered by the web giant who had to wait three hours before reactivating the services.
Obviously, customers and Amazon wanted to know the origin of this failure. The Amazon S3 system for Simple Storage Service has paid the price for human error. The maintenance team, while trying to resolve the problem with the billing system, extended the outage to other departments. The cause ? Human error. An unfortunate typo.
The typo that messes up services
This line of code which was “supposed to remove a small number of servers from one of the S3 subsystems used in the S3 billing process” also affected two other subsystems according to the report on the Amazon S3 case .
The first manages ” metadata and location information for all objects in the region ” and the second is allocated to allocate new storage and requests that the indexed subsystems function properly to perform its function “.
The queries managed by these subsystems ensure the operability of Amazon S3 APIS. In the event of a breakdown, the services are therefore unavailable. The solution: restart all the subsystems concerned, thus affecting the US EAST 1 region. Coupon: YHQ5N
Amazon S3: enhanced maintenance
As a result, the repair took longer than expected and resulted in downtime for traders for almost three and a half hours. It must be said that the restart operation had not been carried out for almost a year. The Amazon S3 team learns from its mistakes and therefore plans to partition the subsystems.
If AWS apologizes flatly for the unavailability of the service , it is mainly an awareness for suppliers like Instagram, Slack or American Airlines whose services have been directly impacted. Finally, owners of connected objects who have experienced serious problems: unavailable Smart TVs, open connected locks, etc. Being a market leader does not prevent you from encountering problems and some customers should follow Instagram’s path next.