spark.sql.shuffle.partitions和spark.default.parallelism的区别

时间 2019-12-04

标签 spark.sql.shuffle.partitions spark sql shuffle partitions spark.default.parallelism default parallelism 区别栏目 Spark 繁體版

原文原文链接

在关于spark任务并行度的设置中，有两个参数咱们会常常遇到，spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什么区别的？node

首先，让咱们来看下它们的定义sql

Property Name

Default

Meaning

spark.sql.shuffle.partitions

200

Configures the number of partitions to use when shuffling data for joins or aggregations.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.测试

For operations like parallelize with no parent RDDs, it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is largerspa

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起来它们的定义彷佛也很类似，但在实际测试中，code

spark.default.parallelism只有在处理RDD时才会起做用，对Spark SQL的无效。 spark.sql.shuffle.partitions则是对Spark SQL专用的设置

咱们能够在提交做业的经过 --conf 来修改这两个设置的值，方法以下：orm

spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20