current position:Home>How to choose hive's file storage format?

How to choose hive's file storage format?

2022-01-27 05:02:51 Shockang

This is my participation 8 The fourth of the yuegengwen challenge 8 God , Check out the activity details :8 Yuegengwen challenge

Text

explain

  • Hive The formats of supported storage numbers are common : TEXTFILE 、 SEQUENCEFILE 、 ORC 、 PARQUET .
  • among TEXTFILE Default format , This format is used by default when creating tables , When importing data, the data file will be directly copied to hdfs There's no processing on it .
  • SEQUENCEFILE 、 ORC 、 PARQUET Tables in format cannot import data directly from local files , And then from TEXTFILE Used in the table insert Import to SEQUENCEFILE、ORC、PARQUET In the table .
  • TEXTFILE and SEQUENCEFILE Is based on row storage ;
  • ORC and PARQUET Is based on column storage .

About SEQUENCEFILE For more information, please refer to my blog ——SequenceFile What is it? ? How to use it? ?

About ORC For more information, please refer to my blog ——ORCFile What is it? ?

About PARQUET For more information, please refer to my blog ——Parquet What is it?

About AVRO For more information, please refer to my blog ——Apache Avro What is it? ?

Hive All supported file storage formats

Storage format describe
STORED AS TEXTFILE Store as plain text file .TEXTFILE Is the default file format , Unless the configuration parameters hive.default.fileformat There are different settings . Use DELIMITED Clause to read delimited files . Use “ESCAPED BY” Clause ( for example ESCAPED BY '') Enable escape for delimiters if you want to process data that contains these delimiters , You need to escape . You can also use “NULL DEFINED AS” Clause to specify custom NULL Format ( The default value is “\N”).(Hive4.0) All binary columns in the table are assumed to be base64 code . To read data as raw bytes :TBLPROPERTIES(“hive.serialization.decode.binary.as.base64"="false")
STORED AS SEQUENCEFILE Store as a compressed sequence file .
STORED AS ORC With ORC File format storage . Support ACID Transaction and cost based optimizer (CBO). Store column level metadata .
STORED AS PARQUET stay Hive 0.13.0 And later Stored as Parquet format for the Parquet columnar storage format; stay Hive 0.10,0.11 or 0.12 Chinese grammar is Use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT
STORED AS AVRO With Hive 0.14.0 And later versions are stored as Avro Format
STORED AS RCFILE Store in record column file format .
STORED AS JSONFILE With Hive 4.0.0 And later versions are stored as Json File format .
STORED BY Store in a non local tabular format . Create or link to non-native tables , For example, by HBase or Druid or Accumulo Supported tables .
INPUTFORMAT and OUTPUTFORMAT stay file_format in , Corresponding InputFormat and OutputFormat The name of the class is specified as string text . for example ,'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. about LZO Compress , The value to use is 'INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat", Output format “org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat”

File storage format comparison

1、 Compression ratio test for stored files

1.1 Test data

Test data

log.txt The size is 18.1 M

1.2 TEXTFILE

  • Create table , The stored data format is TEXTFILE
create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as  TEXTFILE ;
 Copy code 
  • Load the data into the table
load data local inpath '/home/hadoop/log.txt' into table log_text ;
 Copy code 
  • View the data size of the table
dfs -du -h /user/hive/warehouse/log_text;

+------------------------------------------------+--+
|                   DFS Output                   |
+------------------------------------------------+--+
| 18.1 M  /user/hive/warehouse/log_text/log.txt  |
+------------------------------------------------+--+
 Copy code 

1.3 PARQUET

  • Create table , The stored data format is PARQUET
create table log_parquet  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as PARQUET;
 Copy code 
  • Load the data into the table
insert into table log_parquet select * from log_text;
 Copy code 
  • View the data size of the table
dfs -du -h /user/hive/warehouse/log_parquet;

+----------------------------------------------------+--+
|                     DFS Output                     |
+----------------------------------------------------+--+
| 13.1 M  /user/hive/warehouse/log_parquet/000000_0  |
+----------------------------------------------------+--+
 Copy code 

1.4 ORC

  • Create table , The stored data format is ORC
create table log_orc  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC  ;
 Copy code 
  • Load the data into the table
insert into table log_orc select * from log_text ;
 Copy code 
  • View the data size of the table
dfs -du -h /user/hive/warehouse/log_orc;
+-----------------------------------------------+--+
|                  DFS Output                   |
+-----------------------------------------------+--+
| 2.8 M  /user/hive/warehouse/log_orc/000000_0  |
+-----------------------------------------------+--+
 Copy code 

1.5 Storage file compression ratio summary

ORC >  PARQUET >   TEXTFILE
 Copy code 

2、 Query speed test for storing files

2.1 TEXTFILE

select count(*) from log_text;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (16.99 seconds)
 Copy code 

2.2 PARQUET

select count(*) from log_parquet;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (17.994 seconds)
 Copy code 

2.3 ORC

select count(*) from log_orc;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (15.943 seconds)
 Copy code 

2.4 A summary of the query speed of stored files

ORC >  TEXTFILE > PARQUET
 Copy code 

3、 A combination of storage and compression

  • The advantage of using compression is that the required disk storage space can be minimized , And reduce disk and network io operation

  • ORC Supports three types of compression :ZLIB,SNAPPY,NONE. The last one is uncompressed ,==ORC The default is ZLIB Compress ==.

3.1 Create an uncompressed ORC Storage method table

  • 1、 Create an uncompressed ORC surface
create table log_orc_none (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties("ORC.compress"="NONE") ;
 Copy code 
  • 2、 Load data
insert into table log_orc_none select * from log_text ;
 Copy code 
  • 3、 View the data size of the table
dfs -du -h /user/hive/warehouse/log_orc_none;
+----------------------------------------------------+--+
|                     DFS Output                     |
+----------------------------------------------------+--+
| 7.7 M  /user/hive/warehouse/log_orc_none/000000_0  |
+----------------------------------------------------+--+
 Copy code 

3.2 Create a snappy Compression of the ORC Storage method table

  • 1、 Create a snappy Compressed ORC surface
create table log_orc_snappy (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties("ORC.compress"="SNAPPY") ;
 Copy code 
  • 2、 Load data
insert into table log_orc_snappy select * from log_text ;
 Copy code 
  • 3、 View the data size of the table
dfs -du -h /user/hive/warehouse/log_orc_snappy;
+------------------------------------------------------+--+
|                      DFS Output                      |
+------------------------------------------------------+--+
| 3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0  |
+------------------------------------------------------+--+
 Copy code 

3.3 Create a ZLIB Compression of the ORC Storage method table

  • If the compression format is not specified, it is the default ZLIB Compress
  • You can refer to the... Created above log_orc surface
  • View the data size of the table
dfs -du -h /user/hive/warehouse/log_orc;
+-----------------------------------------------+--+
|                  DFS Output                   |
+-----------------------------------------------+--+
| 2.8 M  /user/hive/warehouse/log_orc/000000_0  |
+-----------------------------------------------+--+
 Copy code 

Storage and compression summary

  • ORC The default compression method ZLIB Than Snappy It's still small .
  • In actual project development ,hive The data storage format of the table is generally selected :==ORC or PARQUET==.
  • because snappy Compression and decompression of The efficiency is quite high ,== Compression mode is generally selected snappy==

copyright notice
author[Shockang],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/01/202201270502458617.html

Random recommended