current position:Home>What are the differences among order by, sort by, distribution by and cluster by in Apache hive?

What are the differences among order by, sort by, distribution by and cluster by in Apache hive?

2022-01-27 05:02:57 Shockang

This is my participation 8 The fourth of the yuegengwen challenge 8 God , Check out the activity details :8 Yuegengwen challenge

Text

  1. order by It will sort all the data given globally , No matter how much data comes , Start only one reducer To deal with it .
  2. sort by yes Local sorting ,sort by One or more... Will be started according to the size of the data volume reducer Come to work , also , It will enter reduce Before for each reducer All produce a sort file .
  3. distribute by control map Distribution of results , It will... With the same field map The output is distributed to a reduce Processing on nodes .
  4. cluster by It can be understood as a special distribute by and sort by The combination of , When distribute by and sort by The following column names are the same , It's equivalent to using cluster by Keep up with the column name . But be cluster by The final sorting result of the specified column can only be descending , And you can't specify asc and desc.

Add

1. order by Global ordering

Global ordering , only one reduce

Use order by Clause ordering

  1. asc (ascend)—— Ascending ( Default )
  2. desc (descend)—— Descending

order by Clause in select End of statement

2. distribute by Partition sorting

distribute by similar MapReduce in partition,== collection hash Algorithm , stay map The end will query the results in hash Results with the same value are distributed to the corresponding reduce In file ==. Need to combine sort by Use .

Be careful : Hive requirement distribute by The statement is written in sort by The statement before .

3. cluster by

  • When distribute by and sort by Same field , have access to cluster by The way

  • except distribute by Function outside , It also sorts the fields , therefore cluster by = distribute by + sort by

-- The following two ways are equivalent 

insert overwrite local directory '/home/hadoop/hivedata/distribute_sort' 
select * from student distribute  by score sort  by score;


insert overwrite local directory '/home/hadoop/hivedata/cluster' 
select * from student cluster by score;
 Copy code 

practice

1. Check the student's grades , And in descending order of scores

select * from student s order by score desc;
 Copy code 

2. Sort by alias

  • Sort according to the average of students' scores
select s.sid,s.tname, avg(score)  as score_avg  from student s  group by  s.sid,s.tname order by score_avg  desc;
 Copy code 

3. Multi column sorting

  • Sort in ascending order of students' scores and age
select *  from student s order by score,age;
 Copy code 

4. Every MapReduce Internal sorting (Sort By) Local sorting

sort by: Every reducer Sort internally , Not sort for global result sets .

1、 Set up reduce Number

set mapreduce.job.reduces=3;
 Copy code 

2、 Check the Settings reduce Number

set mapreduce.job.reduces;
 Copy code 

3、 The query results are arranged in descending order

select * from student s sort by s.score;
 Copy code 

4、 Import the query results into a file ( In descending order of grades )

insert overwrite local directory '/home/hadoop/hivedata/sort' select * from student s sort by s.score;
 Copy code 

5. First according to the students sid partition , Then sort according to the students' grades

1、 Set up reduce The number of

set mapreduce.job.reduces=3;
 Copy code 

2、 adopt distribute by Partition data , Will be different sid Divided into corresponding reduce Go to of

insert overwrite local directory '/home/hadoop/hivedata/distribute' select * from student distribute by sid sort by score;
 Copy code 

copyright notice
author[Shockang],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/01/202201270502509068.html

Random recommended