File Format with Spark

CSV: Delimiter Separated Values

Pros: Human readable, all tools support it.

Cons:

  • IO/Storage inefficent (uncompressed)
  • No richer types - all are strings
  • Linear scanning (projections and predicates) if all file is in a single file then doomed
  • Other issues: delimiter in data, new lines within data.

Json and XML much more verbose than CSV in terms of storage

Pros: Readable, some level of schema support

Cons:

  • Duplicated schema
  • Horrable in terms of storage
  • Not splittable *, linear lookups
  • Aggregations require all data to be loaded into memory

Binary Formats
Thrift (by Facebook)
column names do not matter;
ThriftC- generate thrift entity run by thriftC-regenerate the whole

com.tweet
val tweet = TweetThrift(1,123,“Saturday 8th,June”,“arnuma”)

protocol Buffer (Google)

  • smaller than Thrift
  • only 5 types
  • use last 3 bits to represent the type
  • 36 bytes

Thrift and ProtoBuf Summary

  • Predmominantly used for RPC: encoding and decoings are not expensive
  • Columns identified by name but by number (filed tags)
  • Careful not to use the previously existing one
  • Mannul effort to add a new field.
  • Language specific bindings generated (biggest problem with this)

Avro

  • row oriented
  • data and schema are embedded together- so we don’t need the schema when we read file
  • sync marker shows where the file semantically ends.

No bindings required - unlike thrift or protobuf
support alis- backward compatible -whenever moving to new schema: old schema will still be used.
can’t rename the columns or types then we are done -all the past data with the same column is lost- rename the type then type conversion takes place.
need to load all data into memory - to process the data.

Paruqet

  • column oriented: encoding column by column
  • if reading all columns -for aggregation - parquet is format
  • for select * AVRO is the best format- Parquet is the worst preforming.
  • "Tables have turned " approachL row group
  • rowgroups, chunks and pages:
  • 1000 rows -> 128MB -> 1Block->1 row group
  • store the min and max at the row group level for faster read and write
  • 1 MB page for each column and computes statistics (min/max for int columns)

Parquet writes the schema at the end

相关文章: