The story we are going to tell is about a recent event. But is something that could have had very bad consequences and for that we want to share it with you.
Recently one of our collaborators called us complaining that in a microservice he has, from time to time “one host is going crazy”. What this means? Host reaches 100% CPU, logs nothing, but is healthy in the load balancer, returning 500 to all requests.
Being curious to see what happens, we started to investigate.
What have we done
- Tried to update elastic beanstalk version, believing it’s a dependency that maybe is not compatible. No effect.
- Process restart has no effect.
- Host reboot has no effect
- Quite randomly we discovered that the EBS volume where keeps all its running details became full. We simply ran
sudo truncate -s 0 /var/log/web-1.log
And in several seconds the process started to behave normally.
That microservice runs on m4.large that doesn’t have any HDD, but only EBS volume whose default size is pretty small: 8Gb. And a recent feature come along with some logging that triggered this incident.
The solution was simple:
- increase EBS size
- revise what log entries are not necessary anymore
- make sure log files are properly rotated and published into S3.
To increase the EBS size, open the configuration section of your Elastic Beanstalk environment, then go to Instances and in the Root Volume section, you can set the desired size.
We also tried to add an alarm for EBS volume usage, but this was quite complicated. We has to install an agent on our hosts (collectd), but about this we’ll discuss in another episode.
If you have similar stories that could help others, share with us! For any question/feedback, send us a message!