The official description of AWS Redshift starts with: “a fast, fully managed, petabyte-scale data warehouse”. Our experiences with Redshift confirm these specifications, with a single but very important mention: in order to really see the advantages and the incredible power of this service,we had to put into action some solutions that at first seemed unimportant.
We are going to start a series of blog posts where we’ll describe our way to interact with Redshift in order to improve both performance and cost efficiency. This first blog post lists 4 ideas to load data efficiently into AWS Redshift using the COPY command, which is actually the recommended mechanism to import massive amount of data.
- Correlate number of files to load with number of slices in cluster
From the service documentation: “A compute node is partitioned into slices; one slice per disk for each core …”. Thus, in order to maximize resources, put in the manifest file a multiple of the number of slices your cluster has. To determine the number of slice your cluster has, find the number of slices for the node type of your cluster and multiply with the number of nodes.
We ran a simple experiment that confirms this idea. On a single-node cluster (dc1.large, 2 slices), we loaded 10M entries that were grouped initially in a single file and then split in 2 and 4 files. Results presented below represent the average of 3 runs for the same step.
2. … but don’t create files with few kilobytes of data
We continued experiment started above and tries to load the same 10M entries, but organized into 1000 files, each having 10k entries and around 200Kb. It took around 2 minutes to complete this load and reason is very simple: for each file to load it’s necessary a time interval ( usually from tens to few hundreds of millis) to complete the connection to the remote S3 server. And if there are thousands of files, this connection interval becomes visible.
On the other side, we loaded 100M entries representing 6Gb of plain text files, but organized in different ways: 2 files x 3Gb per file, 4 files x 1.5Gb per files and 8 files x 750Mb per file and overall load time was similar in all cases (3 minutes – 3min 10sec).
The conclusion in this case is quite clear: merge small files into larger files that have at least few MB of data. Having files larger than several tens of megabytes is not the best idea if we consider upload time into S3. The longer the connection is open, the more chances to occur a problem.
3. Use compression
Redshift supports loading compressed files from S3. In that way, you save time when upload them and money for data transfer and storage. Our experiments revealed that time to load compressed and uncompressed files is similar, but compression ratio for text files is usually very good, reducing to half or even less the initial file size.
4. Enhance the COPY command with proper parameters to avoid whole load fail
In case one single row is incorrectly formatted or one field exceeds the maximum length, then the entire operation fails. In this case, maybe it’s better to accept to lose a small number of entries (maxerror 100 indicates to continue the load if up to 100 entries are invalid) and to truncate columns that are too long by using truncatecolumns. If you add these parameters, don’t omit to inspect stl_load_errors system table to inspect the entries that weren’t loaded and the error reason.
We hope you find these ideas useful! If you have any questions or comments, let us know in the in the comment section of the blog. Also, please share this blog post with anyone interested in this topic!s