Do you use AWS SQS? Then build this tool for critical situations

Being one of the first services launched more than 10 years ago, Simple Queue Service is one the AWS platform pillars. On the same time, being so old, it’s very hard for us to write about this service because most of the topics have already been debated. In fact, long time ago we came up with 2 suggestions to reduce your SQS bill and it was received very well.

But today our intention is to tell a story inspired from real life which happened to one of our collaborators.

In his use case, SQS is used in a pipeline architecture. Basically, requests from clients are sequentially processed by multiple services that augment the request and, in the end, the final result is composed and send back to the client.

The incident

In order to support a new feature, one service changed the output message format. The next service was trained to ingest the new format and tested in staging, but one aspect was ignored: a few “special” clients had a slightly different schema which wasn’t covered by the code change and of course not tested in staging.

The result is quite predictable: tons of exceptions in logs, alarms triggered, angry clients and a very difficult rollback because that would have led to data loss.

The solution

As expected, the first action was to revert the service that produce the message in the new format. But that fixed the problem only for the new messages. The real problem was the output queue that contained messages in both formats. And the service that read those message was doing well until it was reading a message with the new schema, because that was throwing an exception caught by nobody and the process was terminated. Basically a clean queue was needed, but how to do that without losing data? Unfortunately, SQS supports only queue purge, but not a queue dump in S3 or anywhere else.

But because miracles happen when you need it, someone remembered he has a trivial main Java class that reads messages from one queue, writes them in another queue and acknowledges them. So the service that was reverted a while ago was stopped until the queue was cleaned and after that anything has returned to normal.

In peacetime, the messages copied in a different queue were processed, those in the old format being re-inserted in the queue, those in the new format being reverted to the initial client request and resend on behalf of the client. In the end, there were no data loss, but only delayed requests.

Lesson learned

Besides clichés with testing and preparation for critical situations, we learned that it’s a good idea to build operational tools. In this specific case, if you use SQS, then build a tool that dumps the queue content somewhere. It’s your decision if is S3 or another SQS queue or any durable storage easy to access.

Maybe this sounds too trivial to take into consideration, but we believe it’s better to learn from the mistakes made by others, because we would have no excuse if we were to.

Have a similar story that ends with an advice which could be used by others? Send us a message and your story will become our topic!

Also, send us a message if you want to meet us at re:Invent!