CSV: Delimiter Separated Values
Pros: Human readable, all tools support it.
Cons:
- IO/Storage inefficent (uncompressed)
- No richer types - all are strings
- Linear scanning (projections and predicates) if all file is in a single file then doomed
- Other issues: delimiter in data, new lines within data.
Json and XML much more verbose than CSV in terms of storage
Pros: Readable, some level of schema support
Cons:
- Duplicated schema
- Horrable in terms of storage
- Not splittable *, linear lookups
- Aggregations require all data to be loaded into memory
Binary Formats
Thrift (by Facebook)
column names do not matter;
ThriftC- generate thrift entity run by thriftC-regenerate the whole
com.tweet
val tweet = TweetThrift(1,123,“Saturday 8th,June”,“arnuma”)
protocol Buffer (Google)
- smaller than Thrift
- only 5 types
- use last 3 bits to represent the type
- 36 bytes
Thrift and ProtoBuf Summary
- Predmominantly used for RPC: encoding and decoings are not expensive
- Columns identified by name but by number (filed tags)
- Careful not to use the previously existing one
- Mannul effort to add a new field.
- Language specific bindings generated (biggest problem with this)
Avro
- row oriented
- data and schema are embedded together- so we don’t need the schema when we read file
- sync marker shows where the file semantically ends.
No bindings required - unlike thrift or protobuf
support alis- backward compatible -whenever moving to new schema: old schema will still be used.
can’t rename the columns or types then we are done -all the past data with the same column is lost- rename the type then type conversion takes place.
need to load all data into memory - to process the data.
Paruqet
- column oriented: encoding column by column
- if reading all columns -for aggregation - parquet is format
- for select * AVRO is the best format- Parquet is the worst preforming.
- "Tables have turned " approachL row group
- rowgroups, chunks and pages:
- 1000 rows -> 128MB -> 1Block->1 row group
- store the min and max at the row group level for faster read and write
- 1 MB page for each column and computes statistics (min/max for int columns)
Parquet writes the schema at the end