Because our recently published article about how to cut your S3 costs has had a very good feedback, we decided to dive deep into this subject and to explain how the encoding format you choose affects the S3 bill.
S3 doesn’t know and doesn’t care what kind of files you insert. It’s up to you if you put text files or binary or compressed or encrypted or whatever you want. Usually, the most convenient way is to store directly the text file, even we speak about a raw text file, regular text (like CSV or TSV files) or hierarchical files (JSONs). This approach has also the advantage of being human readable. But if we store these files to be processed afterward, we have to see what tools we have and what kind of encoding we need to feed these tools in order to produce the transformation or visualization we need.
If we refer to the AWS stack, most probably AWS Athena and Redshift and the most common approaches for handing data stored in S3. But both of these services support binary encodings like Avro or Parquet. So why not use them?
Below you can see what is the file size for 2 text formats (CSV and JSON) and for 2 binary formats (Avro and Parquet) when file contains 50k entries, 500k and 2m.
The conclusion is quite simple: if for text format the file size depends on the number of entries, for the binary ones there is not this dependency and even more, the bigger the file it is, the more reasonable is to use this kind of encoding.
Of course, the discussion is not ending here. For sure you have to test your queries against all data format and to see which one is best suited to your needs. For sure there are other strategies to consider (compression). But even if we speak about the storage cost (S3) or the scanning cost (AWS Athena), having fewer data stored is usually more convenient.
If you have anything to add on this topic, please add your comments below!