TheCatAPI // Incident report 15-17.11.2018

It’s always good practice to write up an incident report if there’s been service interruption, even for a side project. It should be an honest, unemotive & blame-free account.

TheCatAPI.com had some downtime overnight (GMT+11) on the 15th Nov, and then very high latency until the 17th.

It’s always good practice to write up an incident report if there’s been service interruption, even for a side project. Here’s the template i normally use

Forensically walking through the root causes & resolution steps leads to improving the service as a whole as well as providing a reference for your future self.  It should be an honest, unemotive & blame-free account.

It’s a natural human instinct to avoid embarrassment by not talking about it again and trying to quickly move on. Processing the experience is cathartic, covering it up is not.

— Incident Report – TheCatAPI.com – 15-17.11.2018

Summary:

  • At 21:45 (GMT+11) on 15th Nov.2018, there was a significant increase in requests to a new API route that was not properly cached. This led to an overload of the database, which led to all users seeing 503 responses for any public request.
  • This was wrongly identified as a broken code deployment due to conflicting errors, and a rollback to a previous release had minimal effect.
  • Extra database resources were provisioned, along with increasing the size of the Servers. This stabilised the service but left the response times in the seconds.
  • A patch put live at 12:30 on the 17th brought the response times back down to ~15ms, after which all services operated normally.

Timeline: (GMT +11)

  • 2018-11-14 14:10 Release went live which would produce confusing error reports due to new logging
  • 2018-11-15 21:20: increase in traffic to images/{image_id}/analysis route which was not cached
  • 2018-11-15 21:30: application overwhelms the database, Most responses are 503
  • 2018-11-16 08:30: I saw the errors and messages on the forum and began investigation
  • 2018-11-16 08:35: Error logs indicated that the codebase was producing errors, so a rollback was initiated. This caused the service to be resumed, albeit with a high response time.
  • 2018-11-16 09:45: Response time was still high so Database & Server capacity was increased, bringing down the response time.
  • 2018-11-16 10:30: Response time climbs again, however no 503’s were returned and service responds successfully. Log investigation reveals no caching on  ‘images/{image_id}/analysis route’
  • 2018-11-16 11:10: Code fix if created, and tested locally. However no time was available to deploy until the next day.
  • 2018-11-17 13:37: Fix deployed successfully, bringing the response time back down to ~15ms.

Root Cause:

  • Having an API route that made a request to the Database on every request without caching or rate-limiting. This should have been caught in testing, or profiled in load-testing.

Resolution:

  • Used the Memoization pattern to cache the response from the DB within the API route, with a fallback to Redis if available, the ultimately the read DB(s)
  • Revised error logging to prevent misdiagnosis in the future.

Future prevention:

  • Reduce the time between incident and investigation, by having my phone create an Alarm on receiving error alters, or use a service like Pagerduty.
  • Research adding a load testing service like LoadImpact into the CI/CD testing process via CircleCI.
  • Integrate Siege into local automated tests.
  • Consider bringing down rate-limit for new routes
  • Use the Strangler application pattern to split routes like signup & upload over to serverless microservices to prevent them getting interrupted in the future.