current position:Home>What are the differences among order by, sort by, distribution by and cluster by in Apache hive?
What are the differences among order by, sort by, distribution by and cluster by in Apache hive?
2022-01-27 05:02:57 【Shockang】
This is my participation 8 The fourth of the yuegengwen challenge 8 God , Check out the activity details :8 Yuegengwen challenge
Text
- order by It will sort all the data given globally , No matter how much data comes , Start only one reducer To deal with it .
- sort by yes Local sorting ,sort by One or more... Will be started according to the size of the data volume reducer Come to work , also , It will enter reduce Before for each reducer All produce a sort file .
- distribute by control map Distribution of results , It will... With the same field map The output is distributed to a reduce Processing on nodes .
- cluster by It can be understood as a special distribute by and sort by The combination of , When distribute by and sort by The following column names are the same , It's equivalent to using cluster by Keep up with the column name . But be cluster by The final sorting result of the specified column can only be descending , And you can't specify asc and desc.
Add
1. order by Global ordering
Global ordering , only one reduce
Use order by Clause ordering
- asc (ascend)—— Ascending ( Default )
- desc (descend)—— Descending
order by Clause in select End of statement
2. distribute by Partition sorting
distribute by similar MapReduce in partition,== collection hash Algorithm , stay map The end will query the results in hash Results with the same value are distributed to the corresponding reduce In file ==. Need to combine sort by Use .
Be careful : Hive requirement distribute by The statement is written in sort by The statement before .
3. cluster by
-
When distribute by and sort by Same field , have access to cluster by The way
-
except distribute by Function outside , It also sorts the fields , therefore cluster by = distribute by + sort by
-- The following two ways are equivalent
insert overwrite local directory '/home/hadoop/hivedata/distribute_sort'
select * from student distribute by score sort by score;
insert overwrite local directory '/home/hadoop/hivedata/cluster'
select * from student cluster by score;
Copy code
practice
1. Check the student's grades , And in descending order of scores
select * from student s order by score desc;
Copy code
2. Sort by alias
- Sort according to the average of students' scores
select s.sid,s.tname, avg(score) as score_avg from student s group by s.sid,s.tname order by score_avg desc;
Copy code
3. Multi column sorting
- Sort in ascending order of students' scores and age
select * from student s order by score,age;
Copy code
4. Every MapReduce Internal sorting (Sort By) Local sorting
sort by: Every reducer Sort internally , Not sort for global result sets .
1、 Set up reduce Number
set mapreduce.job.reduces=3;
Copy code
2、 Check the Settings reduce Number
set mapreduce.job.reduces;
Copy code
3、 The query results are arranged in descending order
select * from student s sort by s.score;
Copy code
4、 Import the query results into a file ( In descending order of grades )
insert overwrite local directory '/home/hadoop/hivedata/sort' select * from student s sort by s.score;
Copy code
5. First according to the students sid partition , Then sort according to the students' grades
1、 Set up reduce The number of
set mapreduce.job.reduces=3;
Copy code
2、 adopt distribute by Partition data , Will be different sid Divided into corresponding reduce Go to of
insert overwrite local directory '/home/hadoop/hivedata/distribute' select * from student distribute by sid sort by score;
Copy code
copyright notice
author[Shockang],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/01/202201270502509068.html
The sidebar is recommended
- Spring IOC container loading process
- [thinking] the difference between singleton mode and static method - object-oriented programming
- Hadoop environment setup (MySQL environment configuration)
- 10 minutes, using node JS creates a real-time early warning system for bad weather!
- Git tool
- Force deduction algorithm - 92 Reverse linked list II
- What is the sub problem of dynamic programming?
- C / C + +: static keyword summary
- Idea does not have the artifacts option when configuring Tomcat
- Anaconda can't open it
guess what you like
-
I don't know how to start this
-
Matlab simulation of transportation optimization algorithm based on PSO
-
MySQL slow log optimization
-
[Vue] as the window is stretched (larger, smaller, wider and higher), the text will not be displayed
-
Popular Linux distributions for embedded computing
-
Suzhou computer research
-
After installing SSL Certificate in Windows + tomcat, the domain name request is not successful. Please answer!!
-
Implementation time output and greetings of jQuery instance
-
The 72 year old uncle became popular. Wu Jing and Guo fan made his story into a film, which made countless dreamers blush
-
How to save computer research
Random recommended
- Springboot implements excel import and export, which is easy to use, and poi can be thrown away
- The final examination subjects of a class are mathematical programming, and the scores are sorted and output from high to low
- Two pronged approach, Tsinghua Professor Pro code JDK and hotspot source code notes, one-time learning to understand
- C + + recursive knapsack problem
- The use of GIT and GitHub and the latest git tutorial are easy to understand -- Video notes of crazy God speaking
- PostgreSQL statement query
- Ignition database test
- Context didn't understand why he got a high salary?, Nginxfair principle
- Bootstrap switch switch control user's guide, springcloud actual combat video
- A list that contains only strings. What other search methods can be used except sequential search
- [matlab path planning] multi ant colony algorithm grid map path planning [including GUI source code 650]
- [matlab path planning] improved genetic algorithm grid map path planning [including source code phase 525]
- Iinternet network path management system
- Appium settings app is not running after 5000ms
- Reactnative foundation - 07 (background image, status bar, statusbar)
- Reactnative foundation - 04 (custom rpx)
- If you want an embedded database (H2, hsql or Derby), please put it on the classpath
- When using stm32g070 Hal library, if you want to write to flash, you must perform an erase. If you don't let it, you can't write continuously.
- Linux checks where the software is installed and what files are installed
- SQL statement fuzzy query and time interval filtering
- 69. Sqrt (x) (c + + problem solving version with vs runnable source program)
- Fresh students are about to graduate. Do you choose Java development or big data?
- Java project: OA management system (java + SSM + bootstrap + MySQL + JSP)
- Titanic passenger survival prediction
- Vectorization of deep learning formula
- Configuration and use of private image warehouse of microservice architect docker
- Relearn JavaScript events
- For someone, delete return 1 and return 0
- How does Java dynamically obtain what type of data is passed? It is used to judge whether the data is the same, dynamic data type
- How does the database cow optimize SQL?
- [data structure] chain structure of binary tree (pre order traversal) (middle order traversal) (post order traversal) (sequence traversal)
- Webpack packaging optimization solution
- 5. Operation element
- Detailed explanation of red and black trees
- redhat7. 9 install database 19C
- Blue Bridge Cup notes: (the given elements are not repeated) complete arrangement (arrangement cannot be repeated, arrangement can be repeated)
- Detailed explanation of springboot default package scanning mechanism and @ componentscan specified scanning path
- How to solve the run-time exception of test times
- Detailed explanation of k8s management tool kubectl
- Android system view memory command