Lessons learned from processing 25M SQS messages daily

It’s been almost 2 months since we finished to migrate a component that uses RabbitMQ to AWS SQS. The client decided to do that because not because he had problems with RabbitMQ, but because he is working to deprecate self-hosted solutions in favor of managed ones .

The first noticed is a higher latency SQS has comparing to RabbitMQ. Puts in SQS take around 70-90ms, whereas in RabbitMQ are around 2-3x faster. Well, for us this is not a problem, because that component does async work and fast response time is not important in this case, but maybe it’s something good to know.

General aspects

  • Implement your application thinking that every single call is going to fail. Set the retry policy according to your needs, declare a dead letter queue for each queue and monitor it.
  • Add something in the message to ease investigations. This could be a message id, a readable date field, etc.
  • Keep in mind that SQS has “at least once” delivery policy. That means you can (and you will) receive a message more than once. Usually, the duplicates rate is very very small (several duplicates for 1M messages), but it’s important to consider that.
  • Visibility timeout can be your biggest enemy. Worst case scenario is when you set a value less than interval necessary to process that message. In this case, the message is not acknowledged and it becomes again visible, ending by being processed again and again by other workers. The result: all workers are busy to process the same messages. Contrary, if value is too high and message fails to be processed, it will take a lot of time to become again available for processing.
  • Don’t set visibility timeout per message. It complicates your troubleshooting. It’s better to have this defined per queue. And if the message cannot be processed in that time interval, the associated dead letter queue should have a larger timeout.
  • DLQ retention considers when messages reached SQS. If a message spent 2 days in the main queue and the associated DLQ expires messages after 3 days, it will be actually deleted from DLQ after one day. This is an important aspect to avoid messages loss.

Sending messages to SQS

  • When possible, send messages in batches. Availability of SQS has a lot of 9’s in it and, from what we logged, there aren’t more than 10 requests per day that fail.
  • If possible, send more messages in a single SQS message. A single message supports up to 256Kb payload. Even this seems like a trivial optimization, when you have to deal with millions of messages, if you can cut number of requests several times, it could be a big improvement.

Processing messages from SQS

  • If possible, enable long polling. This means that your call is going to be blocked for a defined period of time or until it reads the required number of messages. Long polling is a mechanism to avoid empty results and consequently to waste money.
  • Group messages per processing time. Don’t put in a queue messages that require 1 second and 1 hour to be processed because you can end up in a scenario when all workers are busy with processing messages that takes 1 hour and all your application is stuck.
  • We noticed that with long polling enabled (10 messages or 5 seconds wait), almost 80% of requests manage to read 10 messages from the queue. For the rest, even they complete is less than 5 seconds, they don’t have 10 messages.

Maybe SQS is not the most bidding queue system, but as you can see, with several small tricks, it could be a great choice, considering the facts it has a very good availability, good prices and it’s managed.

If you have other tips to share, a comment below is more than welcome.