Configuration.md

layout: page title: Configuration nav_order: 15

Spark base configurations for Gluten plugin

Key	Modifiability	Recommend Setting	Description
spark.plugins	⚓ Static	org.apache.gluten.GlutenPlugin	To load Gluten's components by Spark's plug-in loader.
spark.memory.offHeap.enabled	⚓ Static	true	Gluten use off-heap memory for certain operations.
spark.memory.offHeap.size	⚓ Static	30G	The absolute amount of memory which can be used for off-heap allocation, in bytes unless otherwise specified. Note: Gluten Plugin will leverage this setting to allocate memory space for native usage even offHeap is disabled. The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Gluten Plugin.
spark.shuffle.manager	⚓ Static	org.apache.spark.shuffle.sort.ColumnarShuffleManager	To turn on Gluten Columnar Shuffle Plugin.
spark.driver.extraClassPath	⚓ Static	/path/to/gluten_jar_file	Gluten Plugin jar file to prepend to the classpath of the driver.
spark.executor.extraClassPath	⚓ Static	/path/to/gluten_jar_file	Gluten Plugin jar file to prepend to the classpath of executors.

Gluten configurations

Key	Modifiability	Default	Description
spark.gluten.costModel	🔄 Dynamic	legacy	The class name of user-defined cost model that will be used by Gluten's transition planner. If not specified, a legacy built-in cost model will be used. The legacy cost model helps RAS planner exhaustively offload computations, and helps transition planner choose columnar-to-columnar transition over others.
spark.gluten.enabled	🔄 Dynamic	true	Whether to enable gluten. Default value is true. Just an experimental property. Recommend to enable/disable Gluten through the setting for spark.plugins.
spark.gluten.execution.resource.expired.time	🔄 Dynamic	86400	Expired time of execution with resource relation has cached.
spark.gluten.expression.blacklist	🔄 Dynamic	<undefined>	A black list of expression to skip transform, multiple values separated by commas.
spark.gluten.loadLibFromJar	🔄 Dynamic	false	Whether to load shared libraries from jars.
spark.gluten.loadLibOS	🔄 Dynamic	<undefined>	The shared library loader's OS name.
spark.gluten.loadLibOSVersion	🔄 Dynamic	<undefined>	The shared library loader's OS version.
spark.gluten.memory.isolation	🔄 Dynamic	false	Enable isolated memory mode. If true, Gluten controls the maximum off-heap memory can be used by each task to X, X = executor memory / max task slots. It's recommended to set true if Gluten serves concurrent queries within a single session, since not all memory Gluten allocated is guaranteed to be spillable. In the case, the feature should be enabled to avoid OOM.
spark.gluten.memory.overAcquiredMemoryRatio	🔄 Dynamic	0.3	If larger than 0, Velox backend will try over-acquire this ratio of the total allocated memory as backup to avoid OOM.
spark.gluten.memory.reservationBlockSize	🔄 Dynamic	8MB	Block size of native reservation listener reserve memory from Spark.
spark.gluten.numTaskSlotsPerExecutor	🔄 Dynamic	-1	Must provide default value since non-execution operations (e.g. org.apache.spark.sql.Dataset#summary) doesn't propagate configurations using org.apache.spark.sql.execution.SQLExecution#withSQLConfPropagated
spark.gluten.shuffleWriter.bufferSize	🔄 Dynamic	<undefined>
spark.gluten.soft-affinity.duplicateReading.maxCacheItems	🔄 Dynamic	10000	Enable Soft Affinity duplicate reading detection
spark.gluten.soft-affinity.duplicateReadingDetect.enabled	🔄 Dynamic	false	If true, Enable Soft Affinity duplicate reading detection
spark.gluten.soft-affinity.enabled	🔄 Dynamic	false	Whether to enable Soft Affinity scheduling.
spark.gluten.soft-affinity.min.target-hosts	🔄 Dynamic	1	For on HDFS, if there are already target hosts, and then prefer to use the original target hosts to schedule
spark.gluten.soft-affinity.replications.num	🔄 Dynamic	2	Calculate the number of the replications for scheduling to the target executors per file
spark.gluten.sql.adaptive.costEvaluator.enabled	⚓ Static	true	If true, use org.apache.spark.sql.execution.adaptive.GlutenCostEvaluator as custom cost evaluator class, else follow the configuration spark.sql.adaptive.customCostEvaluatorClass.
spark.gluten.sql.ansiFallback.enabled	🔄 Dynamic	true	When true (default), Gluten will fall back to Spark when ANSI mode is enabled. When false, Gluten will attempt to execute in ANSI mode.
spark.gluten.sql.cacheWholeStageTransformerContext	🔄 Dynamic	false	When true, `WholeStageTransformer` will cache the `WholeStageTransformerContext` when executing. It is used to get substrait plan node and native plan string.
spark.gluten.sql.collapseGetJsonObject.enabled	🔄 Dynamic	false	Collapse nested get_json_object functions as one for optimization.
spark.gluten.sql.columnar.appendData	🔄 Dynamic	true	Enable or disable columnar v2 command append data.
spark.gluten.sql.columnar.arrowUdf	🔄 Dynamic	true	Enable or disable columnar arrow udf.
spark.gluten.sql.columnar.batchscan	🔄 Dynamic	true	Enable or disable columnar batchscan.
spark.gluten.sql.columnar.broadcastExchange	🔄 Dynamic	true	Enable or disable columnar broadcastExchange.
spark.gluten.sql.columnar.broadcastJoin	🔄 Dynamic	true	Enable or disable columnar broadcastJoin.
spark.gluten.sql.columnar.broadcastNestedLoopJoin.enabled	🔄 Dynamic	true	Enable or disable columnar broadcastNestedLoopJoin.
spark.gluten.sql.columnar.cartesianProduct.enabled	🔄 Dynamic	true	Enable or disable columnar cartesianProduct.
spark.gluten.sql.columnar.cast.avg	🔄 Dynamic	true
spark.gluten.sql.columnar.coalesce	🔄 Dynamic	true	Enable or disable columnar coalesce.
spark.gluten.sql.columnar.collectLimit	🔄 Dynamic	true	Enable or disable columnar collectLimit.
spark.gluten.sql.columnar.collectTail	🔄 Dynamic	true	Enable or disable columnar collectTail.
spark.gluten.sql.columnar.enableNestedColumnPruningInHiveTableScan	🔄 Dynamic	true	Enable or disable nested column pruning in hivetablescan.
spark.gluten.sql.columnar.enableVanillaVectorizedReaders	⚓ Static	true	Enable or disable vanilla vectorized scan.
spark.gluten.sql.columnar.executor.libpath	🔄 Dynamic		The gluten executor library path.
spark.gluten.sql.columnar.expand	🔄 Dynamic	true	Enable or disable columnar expand.
spark.gluten.sql.columnar.fallback.expressions.threshold	🔄 Dynamic	50	Fall back filter/project if number of nested expressions reaches this threshold, considering Spark codegen can bring better performance for such case.
spark.gluten.sql.columnar.fallback.ignoreRowToColumnar	🔄 Dynamic	true	When true, the fallback policy ignores the RowToColumnar when counting fallback number.
spark.gluten.sql.columnar.fallback.preferColumnar	🔄 Dynamic	true	When true, the fallback policy prefers to use Gluten plan rather than vanilla Spark plan if the both of them contains ColumnarToRow and the vanilla Spark plan ColumnarToRow number is not smaller than Gluten plan.
spark.gluten.sql.columnar.filescan	🔄 Dynamic	true	Enable or disable columnar filescan.
spark.gluten.sql.columnar.filter	🔄 Dynamic	true	Enable or disable columnar filter.
spark.gluten.sql.columnar.force.hashagg	🔄 Dynamic	true	Whether to force to use gluten's hash agg for replacing vanilla spark's sort agg.
spark.gluten.sql.columnar.forceShuffledHashJoin	🔄 Dynamic	true
spark.gluten.sql.columnar.generate	🔄 Dynamic	true
spark.gluten.sql.columnar.hashagg	🔄 Dynamic	true	Enable or disable columnar hashagg.
spark.gluten.sql.columnar.hivetablescan	🔄 Dynamic	true	Enable or disable columnar hivetablescan.
spark.gluten.sql.columnar.libname	🔄 Dynamic	gluten	The gluten library name.
spark.gluten.sql.columnar.libpath	🔄 Dynamic		The gluten library path.
spark.gluten.sql.columnar.limit	🔄 Dynamic	true
spark.gluten.sql.columnar.maxBatchSize	🔄 Dynamic	4096
spark.gluten.sql.columnar.overwriteByExpression	🔄 Dynamic	true	Enable or disable columnar v2 command overwrite by expression.
spark.gluten.sql.columnar.overwritePartitionsDynamic	🔄 Dynamic	true	Enable or disable columnar v2 command overwrite partitions dynamic.
spark.gluten.sql.columnar.parquet.write.blockSize	🔄 Dynamic	128MB
spark.gluten.sql.columnar.partial.generate	🔄 Dynamic	true	Evaluates the non-offload-able HiveUDTF using vanilla Spark generator
spark.gluten.sql.columnar.partial.project	🔄 Dynamic	true	Break up one project node into 2 phases when some of the expressions are non offload-able. Phase one is a regular offloaded project transformer that evaluates the offload-able expressions in native, phase two preserves the output from phase one and evaluates the remaining non-offload-able expressions using vanilla Spark projections
spark.gluten.sql.columnar.physicalJoinOptimizationLevel	🔄 Dynamic	12	Fallback to row operators if there are several continuous joins.
spark.gluten.sql.columnar.physicalJoinOptimizationOutputSize	🔄 Dynamic	52	Fallback to row operators if there are several continuous joins and matched output size.
spark.gluten.sql.columnar.physicalJoinOptimizeEnable	🔄 Dynamic	false	Enable or disable columnar physicalJoinOptimize.
spark.gluten.sql.columnar.preferStreamingAggregate	🔄 Dynamic	true	Velox backend supports `StreamingAggregate`. `StreamingAggregate` uses the less memory as it does not need to hold all groups in memory, so it could avoid spill. When true and the child output ordering satisfies the grouping key then Gluten will choose `StreamingAggregate` as the native operator.
spark.gluten.sql.columnar.project	🔄 Dynamic	true	Enable or disable columnar project.
spark.gluten.sql.columnar.project.collapse	🔄 Dynamic	true	Combines two columnar project operators into one and perform alias substitution
spark.gluten.sql.columnar.query.fallback.threshold	🔄 Dynamic	-1	The threshold for whether query will fall back by counting the number of ColumnarToRow & vanilla leaf node.
spark.gluten.sql.columnar.range	🔄 Dynamic	true	Enable or disable columnar range.
spark.gluten.sql.columnar.replaceData	🔄 Dynamic	true	Enable or disable columnar v2 command replace data.
spark.gluten.sql.columnar.scanOnly	🔄 Dynamic	false	When enabled, only scan and the filter after scan will be offloaded to native.
spark.gluten.sql.columnar.shuffle	🔄 Dynamic	true	Enable or disable columnar shuffle.
spark.gluten.sql.columnar.shuffle.celeborn.fallback.enabled	⚓ Static	true	If enabled, fall back to ColumnarShuffleManager when celeborn service is unavailable.Otherwise, throw an exception.
spark.gluten.sql.columnar.shuffle.celeborn.useRssSort	🔄 Dynamic	true	If true, use RSS sort implementation for Celeborn sort-based shuffle.If false, use Gluten's row-based sort implementation. Only valid when `spark.celeborn.client.spark.shuffle.writer` is set to `sort`.
spark.gluten.sql.columnar.shuffle.codec	🔄 Dynamic	<undefined>	By default, the supported codecs are lz4 and zstd. When spark.gluten.sql.columnar.shuffle.codecBackend=qat,the supported codecs are gzip and zstd.
spark.gluten.sql.columnar.shuffle.codecBackend	🔄 Dynamic	<undefined>
spark.gluten.sql.columnar.shuffle.compression.threshold	🔄 Dynamic	100	If number of rows in a batch falls below this threshold, will copy all buffers into one buffer to compress.
spark.gluten.sql.columnar.shuffle.dictionary.enabled	🔄 Dynamic	false	Enable dictionary in hash-based shuffle.
spark.gluten.sql.columnar.shuffle.merge.threshold	🔄 Dynamic	0.25
spark.gluten.sql.columnar.shuffle.partitionBufferEvictThreshold	🔄 Dynamic	-1	For Velox hash shuffle writer, evict partition buffers larger than this threshold after splitting an input batch. Use non-positive value to disable this feature.
spark.gluten.sql.columnar.shuffle.readerBufferSize	🔄 Dynamic	1MB	Buffer size in bytes for shuffle reader reading input stream from local or remote.
spark.gluten.sql.columnar.shuffle.realloc.threshold	🔄 Dynamic	0.25
spark.gluten.sql.columnar.shuffle.sort.columns.threshold	🔄 Dynamic	100000	The threshold to determine whether to use sort-based columnar shuffle. Sort-based shuffle will be used if the number of columns is greater than this threshold.
spark.gluten.sql.columnar.shuffle.sort.deserializerBufferSize	🔄 Dynamic	1MB	Buffer size in bytes for sort-based shuffle reader deserializing raw input to columnar batch.
spark.gluten.sql.columnar.shuffle.sort.partitions.threshold	🔄 Dynamic	4000	The threshold to determine whether to use sort-based columnar shuffle. Sort-based shuffle will be used if the number of partitions is greater than this threshold.
spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled	🔄 Dynamic	false	Enable type-aware compression (e.g. FFor for 64-bit integers) in shuffle. Not compatible with dictionary encoding; if both are enabled, type-aware compression is automatically disabled.
spark.gluten.sql.columnar.shuffledHashJoin	🔄 Dynamic	true	Enable or disable columnar shuffledHashJoin.
spark.gluten.sql.columnar.shuffledHashJoin.optimizeBuildSide	🔄 Dynamic	true	Whether to allow Gluten to choose an optimal build side for shuffled hash join.
spark.gluten.sql.columnar.smallFileThreshold	🔄 Dynamic	0.5	The total size threshold of small files in table scan.To avoid small files being placed into the same partition, Gluten will try to distribute small files into different partitions when the total size of small files is below this threshold.
spark.gluten.sql.columnar.sort	🔄 Dynamic	true	Enable or disable columnar sort.
spark.gluten.sql.columnar.sortMergeJoin	🔄 Dynamic	true	Enable or disable columnar sortMergeJoin. This should be set with preferSortMergeJoin=false.
spark.gluten.sql.columnar.tableCache	⚓ Static	true	Enable or disable columnar table cache.
spark.gluten.sql.columnar.tableCache.partitionStats.enabled	🔄 Dynamic	false	When true, the Velox columnar cache serializer computes per-partition min/max/null/row-count stats and embeds them in the cached payload so that the Spark optimizer can prune whole partitions on equality / range predicates. When false (default) the serializer writes the legacy raw payload with no stats, and partition pruning is disabled. Default is off until cross-workload benchmarks confirm zero regression on non-pruning queries.
spark.gluten.sql.columnar.takeOrderedAndProject	🔄 Dynamic	true
spark.gluten.sql.columnar.union	🔄 Dynamic	true	Enable or disable columnar union.
spark.gluten.sql.columnar.wholeStage.fallback.threshold	🔄 Dynamic	-1	The threshold for whether whole stage will fall back in AQE supported case by counting the number of ColumnarToRow & vanilla leaf node.
spark.gluten.sql.columnar.window	🔄 Dynamic	true	Enable or disable columnar window.
spark.gluten.sql.columnar.window.group.limit	🔄 Dynamic	true	Enable or disable columnar window group limit.
spark.gluten.sql.columnar.writeToDataSourceV2	🔄 Dynamic	true	Enable or disable columnar v2 command write to data source v2.
spark.gluten.sql.columnarSampleEnabled	🔄 Dynamic	false	Disable or enable columnar sample.
spark.gluten.sql.columnarToRowMemoryThreshold	🔄 Dynamic	64MB
spark.gluten.sql.countDistinctWithoutExpand	🔄 Dynamic	false	Convert Count Distinct to a UDAF called count_distinct to prevent SparkPlanner converting it to Expand+Count. WARNING: When enabled, count distinct queries will fail to fallback!!!
spark.gluten.sql.extendedColumnPruning.enabled	🔄 Dynamic	true	Do extended nested column pruning for cases ignored by vanilla Spark.
spark.gluten.sql.fallbackRegexpExpressions	🔄 Dynamic	false	If true, fall back all regexp expressions. There are a few incompatible cases between RE2 (used by native engine) and java.util.regex (used by Spark). User should enable this property if their incompatibility is intolerable.
spark.gluten.sql.fallbackUnexpectedMetadataParquet	🔄 Dynamic	false	If enabled, Gluten will not offload scan when unexpected metadata is detected.
spark.gluten.sql.fallbackUnexpectedMetadataParquet.limit	🔄 Dynamic	10	If supplied, metadata of `limit` number of Parquet files will be checked to determine whether to fall back to java scan.
spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentage	🔄 Dynamic	0.1	The percentage of root paths to sample for metadata validation when the number of root paths is large. Value range is (0, 1.0]. 1.0 means check all paths (no sampling). A smaller value reduces validation cost for tables with many partitions.
spark.gluten.sql.injectNativePlanStringToExplain	🔄 Dynamic	false	When true, Gluten will inject native plan tree to Spark's explain output.
spark.gluten.sql.mergeTwoPhasesAggregate.enabled	🔄 Dynamic	true	Whether to merge two phases aggregate if there are no other operators between them.
spark.gluten.sql.native.bloomFilter	🔄 Dynamic	true
spark.gluten.sql.native.hive.writer.enabled	🔄 Dynamic	true	This is config to specify whether to enable the native columnar writer for HiveFileFormat. Currently only supports HiveFileFormat with Parquet as the output file type.
spark.gluten.sql.native.hyperLogLog.Aggregate	🔄 Dynamic	true
spark.gluten.sql.native.parquet.write.blockRows	🔄 Dynamic	100000000
spark.gluten.sql.native.union	🔄 Dynamic	false	Enable or disable native union where computation is completely offloaded to backend.
spark.gluten.sql.native.writeColumnMetadataExclusionList	🔄 Dynamic	comment	Native write files does not support column metadata. Metadata in list would be removed to support native write files. Multiple values separated by commas.
spark.gluten.sql.native.writer.enabled	🔄 Dynamic	<undefined>	This is config to specify whether to enable the native columnar parquet/orc writer
spark.gluten.sql.orc.charType.scan.fallback.enabled	🔄 Dynamic	true	Force fallback for orc char type scan.
spark.gluten.sql.pushAggregateThroughJoin.enabled	🔄 Dynamic	false	Enables the push-aggregate-through-join optimization in Gluten. When enabled, aggregate operators may be pushed below joins during logical optimization and corresponding physical plans may be rewritten to execute the aggregation earlier.
spark.gluten.sql.pushAggregateThroughJoin.maxDepth	🔄 Dynamic	2147483647	Maximum join traversal depth when applying the push-aggregate-through-join optimization. A value of 1 allows pushing an aggregate through a single join; larger values allow the rule to traverse and push through multiple consecutive joins.
spark.gluten.sql.removeNativeWriteFilesSortAndProject	🔄 Dynamic	true	When true, Gluten will remove the vanilla Spark V1Writes added sort and project for velox backend.
spark.gluten.sql.rewrite.dateTimestampComparison	🔄 Dynamic	true	Rewrite the comparision between date and timestamp to timestamp comparison.For example `from_unixtime(ts) > date` will be rewritten to `ts > to_unixtime(date)`
spark.gluten.sql.scan.detailedMetrics.enabled	🔄 Dynamic	true	When true (default), Velox backend scan operators register all detailed SQL metrics. When false, only essential scan metrics are registered to reduce driver memory usage. Also enabled automatically when spark.gluten.sql.debug is true. Does not affect the ClickHouse backend.
spark.gluten.sql.scan.fileSchemeValidation.enabled	🔄 Dynamic	true	When true, enable file path scheme validation for scan. Validation will fail if file scheme is not supported by registered file systems, which will cause scan operator fall back.
spark.gluten.sql.supported.flattenNestedFunctions	🔄 Dynamic	and,or	Flatten nested functions as one for optimization.
spark.gluten.sql.text.input.empty.as.default	🔄 Dynamic	false	treat empty fields in CSV input as default values.
spark.gluten.sql.text.input.max.block.size	🔄 Dynamic	8KB	the max block size for text input rows
spark.gluten.sql.validation.printStackOnFailure	🔄 Dynamic	false
spark.gluten.storage.hdfsViewfs.enabled	⚓ Static	false	If enabled, gluten will convert the viewfs path to hdfs path in scala side
spark.gluten.supported.hive.udfs	🔄 Dynamic		Supported hive udf names.
spark.gluten.supported.python.udfs	🔄 Dynamic		Supported python udf names.
spark.gluten.supported.scala.udfs	🔄 Dynamic		Supported scala udf names.
spark.gluten.ui.enabled	⚓ Static	true	Whether to enable the gluten web UI, If true, attach the gluten UI page to the Spark web UI.

Gluten experimental configurations

Key	Modifiability	Default	Description
spark.gluten.auto.adjustStageResource.enabled	🔄 Dynamic	false	Experimental: If enabled, gluten will try to set the stage resource according to stage execution plan. Only worked when aqe is enabled at the same time!!
spark.gluten.auto.adjustStageResources.fallenNode.ratio.threshold	🔄 Dynamic	0.5	Experimental: Increase executor heap memory when stage contains fallen node count exceeds the total node count ratio.
spark.gluten.auto.adjustStageResources.heap.ratio	🔄 Dynamic	2.0	Experimental: Increase executor heap memory when match adjust stage resource rule.
spark.gluten.auto.adjustStageResources.offheap.ratio	🔄 Dynamic	0.5	Experimental: Decrease executor offheap memory when match adjust stage resource rule.
spark.gluten.memory.dynamic.offHeap.sizing.enabled	⚓ Static	false	Experimental: When set to true, the offheap config (spark.memory.offHeap.size) will be ignored and instead we will consider onheap and offheap memory in combination, both counting towards the executor memory config (spark.executor.memory). We will make use of JVM APIs to determine how much onheap memory is use, alongside tracking offheap allocations made by Gluten. We will then proceed to enforcing a total memory quota, calculated by the sum of what memory is committed and in use in the Java heap. Since the calculation of the total quota happens as offheap allocation happens and not as JVM heap memory is allocated, it is possible that we can oversubscribe memory. Additionally, note that this change is experimental and may have performance implications.
spark.gluten.memory.dynamic.offHeap.sizing.memory.fraction	⚓ Static	0.6	Experimental: Determines the memory fraction used to determine the total memory available for offheap and onheap allocations when the dynamic offheap sizing feature is enabled. The default is set to match spark.executor.memoryFraction.
spark.gluten.sql.columnar.cudf	🔄 Dynamic	false	Enable or disable cudf support. This is an experimental feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layout: page title: Configuration nav_order: 15

Spark base configurations for Gluten plugin

Gluten configurations

Gluten experimental configurations

FilesExpand file tree

Configuration.md

Latest commit

History

Configuration.md

File metadata and controls

layout: page title: Configuration nav_order: 15

Spark base configurations for Gluten plugin

Gluten configurations

Gluten experimental configurations