Today we are going to tell with a real story from which we learned something: recently, one engineer from our team changed how client profiles are loaded from database. It doesn’t matter how, but a bug was introduced and a premium client was affected, being classified as free tier client. The result: almost 80% of his calls were rejected by our throttling layer. The entire incident lasted less than 20 minutes and, fortunately, we have returned to normal state before our client became aware of that problem.
Why do we think we were lucky?
We must admit that we have discovered the problem by mistake. Right after deployment number of accepted calls have dropped to almost half of normal load. But among the customers who were not affected by the issue were one that fetches data from our system in batches, so a higher latency for their calls is normally. So, because number of calls has dropped and between those active have been some with higher latency, the alarm for higher latency triggered. Fewer calls, higher latency. Funny, huh?!
Lesson learned from this incident
After the rollback, we started to think how we could have prevented this and and how we could have been alarmed faster and more efficiently. And the response is very simple: our traffic pattern is quite regular. We do have alarms for traffic spikes to trigger autoscaling, but we don’t have if traffic is going down too much. So one of the corrective measures was to add an alarm for each API if traffic is going down with more than 33% of expected value.
Our advice is to do the same thing and, if you have better suggestions how to detect this kind of circumstances, share them in a comment below!