As you probably know, in Redshift constraints (uniqueness, primary key, foreign key, not null) are informational only. This means that if you insert 2 times an entry in a table that has defined a primary key, that table will contain that entry 2 times.
Now, let’s imagine the following scenario: your application follows the recommendations and loads data using the COPY command and loads files from S3. But you use S3 just like a queue:
- on one side, files with records and inserted in S3.
- on the other side, we run the COPY command and after that, we delete files that were inserted.
But what happens if the delete operation from S3 fails? Indeed, it’s loaded again, until is properly deleted and we’ll end up having duplicate entries into our Redshift tables. In that case, we have to ask ourselves how to check if a file was loaded in a previous COPY run.
Let’s see what options are available here:
- Store somewhere (a DynamoDB table) the files that were loaded and before sending them again to be copied, check if that table doesn’t contains a given file. Out of the extra-complexity and associated costs, this solution doesn’t have any major drawback, but overall it’s not very easy to build if we want to treat all the corner cases.
- Use a system table provided by Redshift. There is a table name called STL_S3CLIENT where are recorded information about files copied from S3. Instead of implementing solution 1 and spend additional money to persist this info, you can use what is provided for free by Redshift.
- Use Kinesis Data Firehose to load data into Redshift. This seems to be the recommended approach and from our investigations, it seems to solve the non-idempotency problem Redshift has. Even more, it is very nice integrated with AWS Lambda. But there are some possible issues/minuses even with this approach:
- Kinesis Firehose is not part of the Free Tier program
- Kinesis Firehose follows the “at least once” paradigm, so there is a small chances to end up with duplicates records in your Redshift table
- If your Redshift cluster is down for more than 60 minutes, manual intervention is necessary to backfill data from S3 to your Redshift cluster.
As you can see, there is no perfect solution. But before finding a solution you must be sure you have a problem. And for that you have to understand how Redshift was designed to work and if its issues are also issues for you. Please leave a comment below with your use case or with your questions and we’ll be more than happy to help you!