current position:Home>How is data skew caused in spark?

How is data skew caused in spark?

2022-02-04 16:31:40 Alibaba cloud Q & A

Spark How is the data skew caused ?

Take the answer 1:

stay Spark in , The same Stage Different Partition It can be processed in parallel , And the difference with dependency Stage It's serial processing between them . Suppose that one Spark Job It is divided into Stage 0 and Stage 1 Two Stage, And Stage 1 Depend on Stage 0, that Stage 0 It won't be processed until it's completely processed Stage 1. and Stage 0 May contain N individual Task, this N individual Task It can be done in parallel . If one N-1 individual Task All in 10 seconds , And the other Task But it takes time 1 minute , Then Stage The total time is at least 1 minute . let me put it another way , One Stage Time spent , Mainly by the slowest one Task decision . Because of the same Stage In all of the Task Perform the same calculation , On the premise of excluding the difference of computing power of different computing nodes , Different Task The time difference between them is mainly due to Task Determination of the amount of data processed .

copyright notice
author[Alibaba cloud Q & A],Please bring the original link to reprint, thank you.

Random recommended