- File Format specifies how records are encoded in files
- Record Format implies how a stream of bytes for a given record are encoded
- The default file format is TEXTFILE – each record is a line in the file
- Hive uses different control characters as delimeters in textfiles
- ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n
- The term field is used when overriding the default delimiter
- FIELDS TERMINATED BY ‘\001’
- Supports text files – csv, tsv
- TextFile can contain JSON or XML documents.
Commonly used File Formats –
- TextFile format
- Suitable for sharing data with other tools
- Can be viewed/edited manually
- SequenceFile
- Flat files that stores binary key ,value pair
- SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting respectively
- Supports – Uncompressed, Record compressed ( only value is compressed) and Block compressed ( both key,value compressed) formats
- RCFile
- RCFile stores columns of a table in a record columnar way
- ORC
- AVRO