In a previous post, we presented several ideas to improve data loading into Redshift. Today we are going to discuss an approach for organizing data into Redshift that scales to billions of entries without affecting reading performance.
Resuming in few words the entire flow and requirements:
- Entries are copied from S3, several times per hour
- Entries represents a continuous data stream, that arrives almost sorted by their creation timestamp
- Queries hits both entries from the last days, but also from the last hours and usually are bounded in a time interval.
- For some queries, execution latency is very important.
With these in minds, we performed several tests with different sort keys and finally the best approach was to have sort key column represented by a timestamp.
But because performance is very important, we have to trigger the VACUUM command after almost each data import, affecting in that way the performance during its running. The solution in this case was to have distinct tables for a specific period of time (hour, day, week) and to unify them under a view.
This approach brings several concrete advantages:
- From all tables, only one is affected by data insertion
- Smaller tables, VACUUM operations complete faster
- Very easy to delete data by simple dropping a table and alter view
- Fetching data from the view works identically like fetching from a table that unifies all the small tables and our tests revealed that response time is identical.
The command for creating view will look like:
CREATE OR REPLACE view_name AS SELECT * from table1 UNION ALL SELECT * from table2 … UNION ALL SELECT * from table3
Additionally, this approach could be used to develop a solution to act as a sliding window over a period of time. Each time a table is a candidate for removing, it could be backed up into S3 (or even Glacier) and in case it is needed again, it could be loaded into Redshift.
Regarding limits of this solution, they are quite hard to hit: maximum number of tables per database is 9900 whereas for views there is no limit specified in the documentation.
The single disadvantage is the cost of to manage the view, but since all views in Redshift are not materialized, the operations are done instantly, since no data processing is involved.