All Questions
Tagged with spark or apache-spark
82,565 questions
-1
votes
0
answers
50
views
Spark SQL MERGE/INSERT on Iceberg Recomputes Upstream Join Instead of Reusing Cached DataFrame (MEMORY_AND_DISK)
Spark SQL + Iceberg: MERGE and INSERT appear to ignore cached DataFrame and re-scan source
I am trying to optimize an SCD2 flow in Spark SQL (Python API) using a cached intermediate DataFrame.
...
8
votes
0
answers
471
views
Upgraded to version IntelliJ IDEA 2026.1 , gradle fails to sync or build
class org.jetbrains.plugins.gradle.tooling.serialization.internal.adapter.InternalIdeaModule cannot be cast to class org.gradle.tooling.model.ProjectModel (org.jetbrains.plugins.gradle.tooling....
1
vote
1
answer
57
views
Spark 4.0 MemoryStream was moved or changed?
I tried to upgrade my project from Spark 3.5 to Spark 4.0. In the process, I ran into this issue in our unit tests.
error: cannot find symbol
import org.apache.spark.sql.execution.streaming....
3
votes
1
answer
76
views
How to fix a cast invalid input error in Spark?
I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like:
df = spark.sql(f'''
SELECT A.CUSTOMER_ID
FROM TABLE_1 A
INNER JOIN
...
Advice
1
vote
2
replies
39
views
What is Spark doing when one record is bigger than partition size?
I have a project where I ingest large jsons (100-300 MB per file) where one json is one record. I had issues with processing them until I increased spark.sql.files.maxPartitionBytes. I received Out of ...
-2
votes
0
answers
60
views
PySpark script hangs after job completion — ThreadPoolExecutor + PyJ4 daemon threads never terminate
Environment
Spark: 3.3.2 (Cloudera parcel SPARK3-3.3.2.3.3.7191000.0-78-1.p0.56279928)
Python: 3.10
PyJ4: 0.10.9.5
Deployment: YARN
OS: Linux
Problem
I have a PySpark script that uses concurrent....
Advice
0
votes
1
replies
65
views
How to perform asynchronous LLM inference on Kafka streams using Apache Spark, and handle high-throughput RAG ingestion?
I’m working on a streaming pipeline where data is coming from a Kafka topic, and I want to integrate LLM-based processing and RAG ingestion. I’m running into architectural challenges around latency ...
2
votes
1
answer
59
views
Spark JSON infer schema
I have a JSON file like this:
{ "id": 1, "str": "a string", "d": "1996-11-20" }
I want Spark (version 4.0.1) to infer the schema and make column d a ...
0
votes
1
answer
28
views
How to change spark-submit command in intellij spark plugin
In intellij I am trying to setup my spark plugin.
On my host I execute my code using
/<spark-home>/bin/spark3-submit .......
When I setup spark plugin on my intellij the command generated looks ...
0
votes
1
answer
69
views
Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure
I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches.
The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...
0
votes
1
answer
54
views
Spark 4.1.1 on AKS + Cosmos DB Cassandra API: ClassNotFound without connector, ClosedConnectionException with spark-cassandra-connector_2.13-3.5.1
We are upgrading a Spark job running on AKS (Kubernetes) from Spark 3.5.3 to Spark 4.1.1.
Current working setup (Spark 3.5.3):
Connector: com.datastax.spark:spark-cassandra-connector-assembly_2.12:3....
Advice
0
votes
2
replies
62
views
Does Spark Catalyst Optimize Across Actions?
Given the following scenario: DataFrame A, B, and C. B and C are retrieved from storage and operated on and joined to A = A*. Then a filter is applied to A* and written to one location, another filter ...
Best practices
0
votes
4
replies
114
views
Recomputation of common stages accross multiple actionless branches in Spark
My team has been working in a Spark process that reads one (of many tables) and does divergently steps with subsets of that table (after some preprocess). Common trunk looks like this:
Read & ...
Advice
2
votes
1
replies
89
views
How to decide number of partitions using repartition vs coalesce in Apache Spark for optimization?
How to decide I am trying to understand how to properly use repartition and coalesce in Apache Spark, especially for performance optimization.
From my understanding:
repartition can increase or ...
Advice
0
votes
7
replies
103
views
Java heap space during long dataset handle cycle
I have a long dataset handle cycle. In this cycle on each iteration i map dataset rows (mapPartitions), then calculate some statistics (foreachPartition), until a certain condition is met.
It loocks ...