在关于spark任务并行度的设置中,有两个参数咱们会常常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什么区别的?node
首先,让咱们来看下它们的定义sql
Property Name | Default | Meaning |
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.测试 For operations like parallelize with no parent RDDs, it depends on the cluster manager: |
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
看起来它们的定义彷佛也很类似,但在实际测试中,code
spark.default.parallelism只有在处理RDD时才会起做用,对Spark SQL的无效。 spark.sql.shuffle.partitions则是对Spark SQL专用的设置
咱们能够在提交做业的经过 --conf 来修改这两个设置的值,方法以下:orm
spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20