Newest 'spark' Questions

-1 votes

0 answers

50 views

Spark SQL MERGE/INSERT on Iceberg Recomputes Upstream Join Instead of Reusing Cached DataFrame (MEMORY_AND_DISK)

Spark SQL + Iceberg: MERGE and INSERT appear to ignore cached DataFrame and re-scan source I am trying to optimize an SCD2 flow in Spark SQL (Python API) using a cached intermediate DataFrame. ...

fabrik5k

1

asked Mar 31 at 18:43

8 votes

0 answers

471 views

Upgraded to version IntelliJ IDEA 2026.1 , gradle fails to sync or build

class org.jetbrains.plugins.gradle.tooling.serialization.internal.adapter.InternalIdeaModule cannot be cast to class org.gradle.tooling.model.ProjectModel (org.jetbrains.plugins.gradle.tooling....

Musa Baloyi

713

asked Mar 26 at 22:05

1 vote

1 answer

57 views

Spark 4.0 MemoryStream was moved or changed?

I tried to upgrade my project from Spark 3.5 to Spark 4.0. In the process, I ran into this issue in our unit tests. error: cannot find symbol import org.apache.spark.sql.execution.streaming....

Tarek Eid

27

asked Mar 26 at 16:37

3 votes

1 answer

76 views

How to fix a cast invalid input error in Spark?

I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like: df = spark.sql(f''' SELECT A.CUSTOMER_ID FROM TABLE_1 A INNER JOIN ...

SRJCoding

543

asked Mar 24 at 12:39

Advice

1 vote

2 replies

39 views

What is Spark doing when one record is bigger than partition size?

I have a project where I ingest large jsons (100-300 MB per file) where one json is one record. I had issues with processing them until I increased spark.sql.files.maxPartitionBytes. I received Out of ...

sageroe42

1

asked Mar 24 at 7:06

-2 votes

0 answers

60 views

PySpark script hangs after job completion — ThreadPoolExecutor + PyJ4 daemon threads never terminate

Environment Spark: 3.3.2 (Cloudera parcel SPARK3-3.3.2.3.3.7191000.0-78-1.p0.56279928) Python: 3.10 PyJ4: 0.10.9.5 Deployment: YARN OS: Linux Problem I have a PySpark script that uses concurrent....

NoName_acc

11

asked Mar 22 at 20:15

Advice

0 votes

1 replies

65 views

How to perform asynchronous LLM inference on Kafka streams using Apache Spark, and handle high-throughput RAG ingestion?

I’m working on a streaming pipeline where data is coming from a Kafka topic, and I want to integrate LLM-based processing and RAG ingestion. I’m running into architectural challenges around latency ...

Arpan

993

asked Mar 20 at 19:41

2 votes

1 answer

59 views

Spark JSON infer schema

I have a JSON file like this: { "id": 1, "str": "a string", "d": "1996-11-20" } I want Spark (version 4.0.1) to infer the schema and make column d a ...

hage

6,243

asked Mar 20 at 8:11

0 votes

1 answer

28 views

How to change spark-submit command in intellij spark plugin

In intellij I am trying to setup my spark plugin. On my host I execute my code using /<spark-home>/bin/spark3-submit ....... When I setup spark plugin on my intellij the command generated looks ...

Ravi Kumar

994

asked Mar 19 at 11:39

0 votes

1 answer

69 views

Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure

I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches. The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...

Aakash Shrivastav

1

asked Mar 11 at 10:08

0 votes

1 answer

54 views

Spark 4.1.1 on AKS + Cosmos DB Cassandra API: ClassNotFound without connector, ClosedConnectionException with spark-cassandra-connector_2.13-3.5.1

We are upgrading a Spark job running on AKS (Kubernetes) from Spark 3.5.3 to Spark 4.1.1. Current working setup (Spark 3.5.3): Connector: com.datastax.spark:spark-cassandra-connector-assembly_2.12:3....

akshay kadam

1

asked Mar 9 at 18:07

Advice

0 votes

2 replies

62 views

Does Spark Catalyst Optimize Across Actions?

Given the following scenario: DataFrame A, B, and C. B and C are retrieved from storage and operated on and joined to A = A*. Then a filter is applied to A* and written to one location, another filter ...

Sergei I.

76

asked Mar 8 at 21:05

Best practices

0 votes

4 replies

114 views

Recomputation of common stages accross multiple actionless branches in Spark

My team has been working in a Spark process that reads one (of many tables) and does divergently steps with subsets of that table (after some preprocess). Common trunk looks like this: Read & ...

xandor19

57

asked Mar 4 at 11:56

Advice

2 votes

1 replies

89 views

How to decide number of partitions using repartition vs coalesce in Apache Spark for optimization?

How to decide I am trying to understand how to properly use repartition and coalesce in Apache Spark, especially for performance optimization. From my understanding: repartition can increase or ...

Test User123

1

asked Mar 2 at 8:46

Advice

0 votes

7 replies

103 views

Java heap space during long dataset handle cycle

I have a long dataset handle cycle. In this cycle on each iteration i map dataset rows (mapPartitions), then calculate some statistics (foreachPartition), until a certain condition is met. It loocks ...

user32405588

asked Feb 23 at 17:25

Collectives™ on Stack Overflow

All Questions

Spark SQL MERGE/INSERT on Iceberg Recomputes Upstream Join Instead of Reusing Cached DataFrame (MEMORY_AND_DISK)

Upgraded to version IntelliJ IDEA 2026.1 , gradle fails to sync or build

Spark 4.0 MemoryStream was moved or changed?

How to fix a cast invalid input error in Spark?

What is Spark doing when one record is bigger than partition size?

PySpark script hangs after job completion — ThreadPoolExecutor + PyJ4 daemon threads never terminate

How to perform asynchronous LLM inference on Kafka streams using Apache Spark, and handle high-throughput RAG ingestion?

Spark JSON infer schema

How to change spark-submit command in intellij spark plugin

Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure

Spark 4.1.1 on AKS + Cosmos DB Cassandra API: ClassNotFound without connector, ClosedConnectionException with spark-cassandra-connector_2.13-3.5.1

Does Spark Catalyst Optimize Across Actions?

Recomputation of common stages accross multiple actionless branches in Spark

How to decide number of partitions using repartition vs coalesce in Apache Spark for optimization?

Java heap space during long dataset handle cycle

Hot Network Questions