[SPARK-49821][SS][PYTHON] Implement MapState and TTL support for TransformWithStateInPandas by bogao007 · Pull Request #48290 · apache/spark

bogao007 · 2024-09-28T00:24:01Z

What changes were proposed in this pull request?

Implement MapState and TTL support for TransformWithStateInPandas
Fixed an issue to properly closes/cleans up resources after arrow batch writes are completed in TransformWithStateInPandasStateServer. Since we use the same arrow batch write logic for both listState and mapState, this fix also applies to listState.

Why are the changes needed?

Bring parity to Scala on supported state variables

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added new unit test.

Was this patch authored or co-authored using generative AI tooling?

No

… interfere with each other

jingz-db · 2024-10-08T17:43:22Z

python/pyspark/sql/streaming/list_state_client.py

                iterator = self._stateful_processor_api_client._read_arrow_state()
-                batch = next(iterator)
-                pandas_df = batch.to_pandas()
+                data_batch = None


A bit confused here, this code snippet is trying to deal with multiple batches but only keep the data in the last batch?

The previous code would stuck forever after we added the arrow resource cleanup logic (I think it might be related to previous logic did not exhaust the iterator, though that iterator did only contain a single batch), hence using the recommended way to consume the arrow batches which is

for batch in iterator: ......

The logic is the same as the previous one, we only need to consume a single batch here.

Can we have short comment here to explain the iteration though we are not expecting multiple batches?

So we do iterate (consume) all batches if there are more than one but only take the first batch? Please leave code comment as @jingz-db suggested as it confuses people to think it might be a bug from looking into the code.

Also, when the iterator has multiple batches and how it is safe to ignore remaining and take only the first one?

I'd say I want to see this fix in separate PR, with relevant test which fails on master branch and passes with the fix. Let's scope the PR properly - the PR is aiming to add MapState with TTL.

It's OK to have the fix in here, as the fix is applied to MapState impl as well.

python/pyspark/sql/streaming/map_state_client.py

sql/core/src/main/java/org/apache/spark/sql/execution/streaming/StateMessage.proto

python/pyspark/sql/streaming/stateful_processor.py

jingz-db

One small nits and LGTM!

HeartSaVioR

First pass. Mostly minors and nits. Probably need some work to sort out/incorporate with #47878 once it is merged.

In overall, we'd probably need to spare a time in near future to revisit and refactor the code. I see non-trivial amount of redundant code in the files in TWS python impl. now.

python/pyspark/sql/streaming/stateful_processor.py

python/pyspark/sql/streaming/stateful_processor_api_client.py

HeartSaVioR · 2024-10-16T03:55:49Z

python/pyspark/sql/streaming/list_state_client.py

                iterator = self._stateful_processor_api_client._read_arrow_state()
-                batch = next(iterator)
-                pandas_df = batch.to_pandas()
+                data_batch = None


So we do iterate (consume) all batches if there are more than one but only take the first batch? Please leave code comment as @jingz-db suggested as it confuses people to think it might be a bug from looking into the code.

Also, when the iterator has multiple batches and how it is safe to ignore remaining and take only the first one?

HeartSaVioR · 2024-10-16T03:57:21Z

python/pyspark/sql/streaming/list_state_client.py

                iterator = self._stateful_processor_api_client._read_arrow_state()
-                batch = next(iterator)
-                pandas_df = batch.to_pandas()
+                data_batch = None


I'd say I want to see this fix in separate PR, with relevant test which fails on master branch and passes with the fix. Let's scope the PR properly - the PR is aiming to add MapState with TTL.

python/pyspark/sql/streaming/map_state_client.py

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

HeartSaVioR · 2024-10-16T05:26:18Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+    }
+  }
+
+  private def sendIteratorAsArrowBatches[T](


I see this is also added from #47878 but different implementation. Do you plan to rebase this PR to apply the new implementation to timer as well once #47878 is merged?

yeah, if the timer PR is merged first, I can rebase and apply the update. Otherwise let's just keep the current implementation, we can do the rebase in the timer PR.

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

HeartSaVioR · 2024-10-16T05:45:00Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

            for s in iter:
                ttl_list_state_count += s[0]
+        if self.ttl_map_state.exists():
+            ttl_map_state_count = self.ttl_map_state.get_value(key)[0]


just to double confirm, here the key does not necessarily need to be the same with grouping key, right? It's just to simplify the test.

Yep, correct.

HeartSaVioR

+1 pending CI

Let's rebase #47878 as we address the review comments here earlier.

HeartSaVioR · 2024-10-21T06:56:53Z

https://github.com/bogao007/spark/actions/runs/11433439704/job/31805443260
It only failed with org.apache.spark.sql.SparkSessionE2ESuite.

HeartSaVioR · 2024-10-21T06:57:02Z

Thanks! Merging to master.

…sformWithStateInPandas ### What changes were proposed in this pull request? - Implement MapState and TTL support for TransformWithStateInPandas - Fixed an issue to properly closes/cleans up resources after arrow batch writes are completed in `TransformWithStateInPandasStateServer`. Since we use the same arrow batch write logic for both listState and mapState, this fix also applies to listState. ### Why are the changes needed? Bring parity to Scala on supported state variables ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48290 from bogao007/map-state. Authored-by: bogao007 <bo.gao@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

bogao007 added 18 commits August 29, 2024 15:27

initial commit

8ce64e8

fix end of file

4e66c67

update test

3ebfb63

lint

8675a81

fix

e5c0220

fix

15ea860

address comments

b2177b6

Merge branch 'master' into list-state

335a9fb

Addressed comments, returning Row insteand of PandasDataFrame

c98e72c

fix lint

0a9a803

first commit

e8af1b5

Fix an issue that iterators got from the same list state variable may…

1276622

… interfere with each other

Merge branch 'list-state' into map-state

915c355

second commit

b40c557

third commit

7b32c9e

fix

e020105

Merge branch 'master' into map-state

2ee4e30

refactor

f0a646d

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Sep 28, 2024

bogao007 added 2 commits October 1, 2024 10:07

lint, more unit test

e7501ab

Added ttl support

9f2e3c4

bogao007 changed the title ~~[SPARK-49821] Implement MapState for TransformWithStateInPandas~~ [SPARK-49821][SS][PYTHON] Implement MapState and TTL support for TransformWithStateInPandas Oct 2, 2024

jingz-db reviewed Oct 8, 2024

View reviewed changes

python/pyspark/sql/streaming/map_state_client.py Outdated Show resolved Hide resolved

jingz-db reviewed Oct 8, 2024

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/streaming/StateMessage.proto Outdated Show resolved Hide resolved

jingz-db reviewed Oct 8, 2024

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/streaming/StateMessage.proto Outdated Show resolved Hide resolved

jingz-db reviewed Oct 8, 2024

View reviewed changes

python/pyspark/sql/streaming/stateful_processor.py Show resolved Hide resolved

Merge branch 'master' into map-state

da11fa2

bogao007 added 3 commits October 9, 2024 11:01

Address comments

803c2bb

lint

dcc91f4

linter

306d120

jingz-db approved these changes Oct 11, 2024

View reviewed changes

HeartSaVioR reviewed Oct 16, 2024

View reviewed changes

address comments

e6490f7

HeartSaVioR approved these changes Oct 21, 2024

View reviewed changes

HeartSaVioR closed this in c1198fa Oct 21, 2024

HeartSaVioR mentioned this pull request Oct 21, 2024

[SPARK-49513][SS] Add Support for timer in transformWithStateInPandas API #47878

Closed

dongjoon-hyun mentioned this pull request Oct 25, 2024

[SPARK-50117][BUILD][SS] Change to using maven/sbt plugin to generate StateMessage.java #48654

Closed

Conversation

bogao007 commented Sep 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jingz-db left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Oct 21, 2024

Uh oh!

HeartSaVioR commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bogao007 commented Sep 28, 2024 •

edited

Loading

HeartSaVioR left a comment •

edited

Loading