File Format with Spark

CSV: Delimiter Separated Values

Pros: Human readable, all tools support it.

Cons:

IO/Storage inefficent (uncompressed)
No richer types - all are strings
Linear scanning (projections and predicates) if all file is in a single file then doomed
Other issues: delimiter in data, new lines within data.

Json and XML much more verbose than CSV in terms of storage

Pros: Readable, some level of schema support

Cons:

Duplicated schema
Horrable in terms of storage
Not splittable *, linear lookups
Aggregations require all data to be loaded into memory

Binary Formats
Thrift (by Facebook)
column names do not matter;
ThriftC- generate thrift entity run by thriftC-regenerate the whole

com.tweet
val tweet = TweetThrift(1,123,“Saturday 8th,June”,“arnuma”)

protocol Buffer (Google)

smaller than Thrift
only 5 types
use last 3 bits to represent the type
36 bytes

Thrift and ProtoBuf Summary

Predmominantly used for RPC: encoding and decoings are not expensive
Columns identified by name but by number (filed tags)
Careful not to use the previously existing one
Mannul effort to add a new field.
Language specific bindings generated (biggest problem with this)

Avro

row oriented
data and schema are embedded together- so we don’t need the schema when we read file
sync marker shows where the file semantically ends.

No bindings required - unlike thrift or protobuf
support alis- backward compatible -whenever moving to new schema: old schema will still be used.
can’t rename the columns or types then we are done -all the past data with the same column is lost- rename the type then type conversion takes place.
need to load all data into memory - to process the data.

Paruqet

column oriented: encoding column by column
if reading all columns -for aggregation - parquet is format
for select * AVRO is the best format- Parquet is the worst preforming.
"Tables have turned " approachL row group
rowgroups, chunks and pages:
1000 rows -> 128MB -> 1Block->1 row group
store the min and max at the row group level for faster read and write
1 MB page for each column and computes statistics (min/max for int columns)

Parquet writes the schema at the end