current position:Home>Hadoop learning 5-4: Hadoop 3 X new feature -- erasure code (erasure code)

Hadoop learning 5-4: Hadoop 3 X new feature -- erasure code (erasure code)

2022-01-27 01:06:51 May you be treated warmly by the world

1 Basic concepts

  HDFS Encode for erasure (EC) Provided support , To store data more efficiently . Compared with the default three copy mechanism ,EC Strategy can save about 50% Storage space

   However, it can not be ignored that the operation of encoding and decoding will consume CPU resources . The codec performance of erasure correcting code is very important to it in HDFS The application of in plays a vital role , If you do not use hardware optimization, it is difficult to get the ideal performance . Intel's intelligent storage Accelerator (ISA-L) It provides the optimization of erasure code coding and decoding , Greatly improves its performance

   The erasure code is hadoop3.x New features , Previous hdfs All of them adopt replica fault tolerance , By default , A file has 3 Copies , Can tolerate arbitrary 2 Copies (datanode) Unavailable , This improves the availability of data , But it also brought 2 Times the redundancy overhead . for example 3TB Space , Can only store 1TB Valid data for . The erasure code can be used under the same availability , Save more space , With RS-6-3-1024K This erasure strategy is an example ,6 Raw data , Generated after encoding 3 Check data , altogether 9 Copy of the data , As long as there is 6 Data exists , You can get the raw data , It can tolerate arbitrary 3 Data is not available .

2 Erasure code operation

2.1 Check the erasure code strategy

hdfs ec -listPolicies

 Insert picture description here

There are many of the above strategies , As mentioned above, the policy arrow points to , Here's one of them , other And so on

RS-6-3-1024k: Use RS code , Every time 6 Data units , Generate 3 A verification unit , common 9 A unit , in other words : this 9 In units , As long as there is any 6 Units exist ( Whether it's a data unit or a verification unit , As long as the total =6), You can get the raw data . For example, upload a 40MB The data of , Then it will 40MB For the data, press 1024KB Divide completely into one piece (1024KB It's also The smallest data unit ). And in the strategy 6 To represent a division 6 A raw data part , about 40MB The data of , Divided into 6 Parts of , So each part is 7MB,7MB The data can be seen as By multiple 1024KB The composition of , It also uses 1024KB computationally ( Because not every data content can be processed in one 6 Integer multiple ) The raw data part is stored in 6*7MB=42MB, Instead, use the original number of copies to store ( What I set up here is 3 individual ), Then the memory occupied is 120MB, Although the verification unit of erasure code strategy also occupies memory , But in theory, the space saved by erasure strategy is as high as 50%,

State: Indicates the status of the policy . Above picture RS-6-3-1024K Indicates the open state

In theory RS-6-3-1024k need 9 platform DataNode,RS-3-2-1024k need 5 platform DataNode Support , And so on

2.2 Erasure code policy settings

The erasure code strategy is related to the specific path (path) The associated . in other words , If we want to use erasure codes , Then set the erasure code strategy for a specific path , follow-up , All files stored in this directory , Will execute this policy
By default, only on RS-6-3-1024k Strategy support , If you want to use other policies, you need to enable

The following thought input Directory settings RS-3-2-1024K For example , Open the erasure code correction strategy , The original copy policy will not be used to store files

1、 Open to RS-3-2-1024k Strategy support ( This policy can only be used after it is enabled )

# Turn on 
hdfs ec -enablePolicy -policy RS-3-2-1024k

# Ban 
hdfs ec -disablePolicy -policy RS-3-2-1024k

 Insert picture description here

2、 stay HDFS Create directory , And set the erase policy

# directories creating 
hdfs dfs -mkdir /input

# by input Directory setting policy 
hdfs ec -setPolicy -path /input -policy RS-3-2-1024k

# Get the directory erasure code strategy 
hdfs ec -getPolicy -path /input

 Insert picture description here

3、 Upload files , And check the storage of the encoded file

Upload any file to HDFS On , And check the number of copies ( The number of replicas set in the current cluster is 3, And created 5 platform DataNode, Theoretically RS-3-2-1024K need 5 platform DataNode Support )
 Insert picture description here
You can see that the number of copies is 1, Different from the setting . Click the file to see the storage of data , You can see in the 5 There are data on all machines ,5 The data on a machine is our 3 Data units and 2 Two inspection units , Each unit is on a machine , instead of 5 One unit on one machine . Only one copy of each unit will be saved
 Insert picture description here
View the storage of files through the following files

hdfs fsck /input/aaa.txt -files -blocks -locations

 Insert picture description here

2.3 Erasure code strategy test

According to the characteristics of erasure code strategy , Close one of them here DataNode, So what's stored 5 In units , One will be missing , Try to get the file normally , Use the following command to get the file to the local

hadoop fs -get input/aaa.txt ./ec

Normally, it will report an error , But the storage is normal , open ec Document meeting See that the file is completely copied to the local
 Insert picture description here

copyright notice
author[May you be treated warmly by the world],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/01/202201270106479624.html

Random recommended