Skip to main content

All Questions

Tagged with or
Filter by
Sorted by
Tagged with
-1 votes
0 answers
50 views

Spark SQL + Iceberg: MERGE and INSERT appear to ignore cached DataFrame and re-scan source I am trying to optimize an SCD2 flow in Spark SQL (Python API) using a cached intermediate DataFrame. ...
fabrik5k's user avatar
8 votes
0 answers
471 views

class org.jetbrains.plugins.gradle.tooling.serialization.internal.adapter.InternalIdeaModule cannot be cast to class org.gradle.tooling.model.ProjectModel (org.jetbrains.plugins.gradle.tooling....
Musa Baloyi's user avatar
1 vote
1 answer
57 views

I tried to upgrade my project from Spark 3.5 to Spark 4.0. In the process, I ran into this issue in our unit tests. error: cannot find symbol import org.apache.spark.sql.execution.streaming....
Tarek Eid's user avatar
3 votes
1 answer
76 views

I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like: df = spark.sql(f''' SELECT A.CUSTOMER_ID FROM TABLE_1 A INNER JOIN ...
SRJCoding's user avatar
  • 543
Advice
1 vote
2 replies
39 views

I have a project where I ingest large jsons (100-300 MB per file) where one json is one record. I had issues with processing them until I increased spark.sql.files.maxPartitionBytes. I received Out of ...
sageroe42's user avatar
-2 votes
0 answers
60 views

Environment Spark: 3.3.2 (Cloudera parcel SPARK3-3.3.2.3.3.7191000.0-78-1.p0.56279928) Python: 3.10 PyJ4: 0.10.9.5 Deployment: YARN OS: Linux Problem I have a PySpark script that uses concurrent....
NoName_acc's user avatar
Advice
0 votes
1 replies
65 views

I’m working on a streaming pipeline where data is coming from a Kafka topic, and I want to integrate LLM-based processing and RAG ingestion. I’m running into architectural challenges around latency ...
Arpan's user avatar
  • 993
2 votes
1 answer
59 views

I have a JSON file like this: { "id": 1, "str": "a string", "d": "1996-11-20" } I want Spark (version 4.0.1) to infer the schema and make column d a ...
hage's user avatar
  • 6,243
0 votes
1 answer
28 views

In intellij I am trying to setup my spark plugin. On my host I execute my code using /<spark-home>/bin/spark3-submit ....... When I setup spark plugin on my intellij the command generated looks ...
Ravi Kumar's user avatar
0 votes
1 answer
69 views

I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches. The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...
Aakash Shrivastav's user avatar
0 votes
1 answer
54 views

We are upgrading a Spark job running on AKS (Kubernetes) from Spark 3.5.3 to Spark 4.1.1. Current working setup (Spark 3.5.3): Connector: com.datastax.spark:spark-cassandra-connector-assembly_2.12:3....
akshay kadam's user avatar
Advice
0 votes
2 replies
62 views

Given the following scenario: DataFrame A, B, and C. B and C are retrieved from storage and operated on and joined to A = A*. Then a filter is applied to A* and written to one location, another filter ...
Sergei I.'s user avatar
Best practices
0 votes
4 replies
114 views

My team has been working in a Spark process that reads one (of many tables) and does divergently steps with subsets of that table (after some preprocess). Common trunk looks like this: Read & ...
xandor19's user avatar
Advice
2 votes
1 replies
89 views

How to decide I am trying to understand how to properly use repartition and coalesce in Apache Spark, especially for performance optimization. From my understanding: repartition can increase or ...
Test User123's user avatar
Advice
0 votes
7 replies
103 views

I have a long dataset handle cycle. In this cycle on each iteration i map dataset rows (mapPartitions), then calculate some statistics (foreachPartition), until a certain condition is met. It loocks ...
user avatar

15 30 50 per page
1
2 3 4 5
5505