Tips to improve AWS Redshift load management

Quite often we noticed that people who start to use AWS Redshift tend to focus on creating the best table schema in order to improve the executions time. But of course this is not the only place where adjustments can be made to reduce the queries latency.

Actually, since the process to setup a new cluster is so simple – few clicks  to do, several input fields to fill and 5 minutes to wait – you will end up focusing on running data imports and queries. But if you proceed in that way, you are making a very common mistake and, more important, you are skipping some housekeeping things that can definitely help you to optimize the cluster load management.

The mistake we were talking about above is to use the master user for any interaction with the Redshift cluster. Even this could be the subject of another discussion, the idea is very simple: don’t use the master user for any operation because is not secure and it could become very hard or impossible to troubleshoot a problem: who runs which queries. Our advice is to use master user just to create other users and to cancel unresponsive queries.

But going back to the subject of this conversation: you can (and have to do that) control how Redshift handles the load management. It handles queries using the queue paradigm: first come, first served. By default, only one queue is defined and it can run up to 5 parallel queries. The good part is that you can define up to 8 queues that has to share a total of 50 slots. The official documentation advices not to overcome 15 slots per queue.

Similar to other Big Data frameworks, you have to understand the traffic pattern in your application and to separate it into different queues according to the needs and priorities. This means that each queue will have a proper number of slots to parallelize as many queries as possible and a convenient percent of memory to avoid spill to disk.

Let’s see in more detail what involves this process:

  • The very first thing to do is to create distinct groups for Extract (SELECT queries), Transform (INSERT, UPDATE, DELETE) and Load (COPY).
  • After that, you can create users that will handle your data. A user can be anything: a business analyst that runs manual queries, an application that constantly loads data or trigger various reports generation, etc. Don’t forget to mention which group you have defined above this user belongs. If you skipped this step, then this command can be used to assign a user to a group.
  • The last step involves working the Redshift console or with the CLI. For each group defined on step 1, you have to create a queue and specify at least the concurrency level and the amount of memory. For example, if your query use case requires many groups (one for long running queries and one for short one), then it’s a good idea to have distinct groups and queues and to specify here the time interval a query can run before is timeouted. We invite you to allocate a fair amount of time to understand this section because it can bring important benefits.

Most probably this post didn’t manage to cover all details of this very important topic, but we hope that we brought to your attention an aspect that, after is properly put into practice, will bring real value for your interaction with AWS Redshift.