Let’s imagine that we have an application where new entries are constantly loaded into Redshift. Recently we published a blog post in which some best practices about data imports were presented and another one where was described a way to organize tables if your use case implies frequent data imports. But enough with the bounce, let’s get back to the subject we want to talk about here.
The Redshift console provides for each cluster a tab with details about data imports…
…but this is useful only if you want to inspect manually the process. If data loading is done automatically, then is a good practice to add some kind of monitoring and eventually to trigger an alarm if some thresholds are breached. If Cloudwatch is a good solution for monitoring, the problem is how do we get this information? And the response is very simple: several system tables provided by Redshift: STL_LOAD_COMMITS, STL_FILE_SCAN, STL_LOAD_ERRORS and STL_S3CLIENT.
If you are not familiar with this concept, then the long story short is that Redshift provides many tables and views that hold clues about how the cluster is behaving. Basically, these tables can be queried and the selected rows can be the input for Cloudwatch. After that, the things become straightforward. Of course there are other tables that can be queried, but the list defined above contains many useful metrics like:
- number of records loaded
- number of records rejected
- number of bytes loaded
- time to copy from S3/to load records
- number of files loaded per COPY execution
In the end, a small observation about table STL_LOAD_ERRORS: this one is useful for metrics publishing, but on the same time it can be manually investigated to see why some imports are failing or why one entry was rejected and to sanitize your files in order to reduce the number of entries that don’t match the schema and end up being refused by AWS Redshift. If you are a Big Data engineer or ML scientist you are probably aware that this step – data cleansing – is one that never ends, so this table gives you all the details why a record was dismissed.
In a future post, we’ll come up with a library that does this kind of thing, but until then we invite you to think how this could be implemented.
Looking forward to hear your comments!