[SPARK-49513][SS] Add Support for timer in transformWithStateInPandas API by jingz-db · Pull Request #47878 · apache/spark

jingz-db · 2024-08-26T20:18:45Z

What changes were proposed in this pull request?

Support for timer in TransformWithStateInPandas Python API.

Why are the changes needed?

To couple with Scala API, TransformWithStateInPandas should also support processing/event time timer for arbitrary state.

Does this PR introduce any user-facing change?

Yes. Users can now interact with timers from handleInputRows with two addtional parameters as:

def handleInputRows(
        self, key: Any, rows: Iterator["PandasDataFrameLike"], 
          timer_values: TimerValues,
          expired_timer_info: ExpiredTimerInfo)

And user can interact with a newly introduce TimerValues to get processing/event time for current batch:

class TimerValues:
    def get_current_processing_time_in_ms(self) -> int

    def get_current_watermark_in_ms(self) -> int

Users can also interact with expired_timer_info to get the timestamp for expired timers:

class ExpiredTimerInfo:
    def is_valid(self) -> bool

    def get_expiry_time_in_ms(self) ->

How was this patch tested?

Unit tests in TransformWithStateInPandasStateServerSuite and integration tests in test_pandas_transform_with_state.py.

Was this patch authored or co-authored using generative AI tooling?

No.

bogao007 · 2024-09-16T16:58:20Z

python/pyspark/sql/pandas/group_ops.py

+            batch_timestamp = statefulProcessorApiClient.get_batch_timestamp()
+            watermark_timestamp = statefulProcessorApiClient.get_watermark_timestamp()


Can we move these 2 API calls inside the in-else clause below and only call it with supported time mode?

We will need to have some values to initialize the TimerValues in the handleInputRows. On Scala side, we will always pass the real timestamp into TimerValues even timer is not defined: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala#L250

It is unlikely that users will call TimerValues if they do not have timer registered, but my original intention is to align the behavior with the Scala side. I guess here we might need to decide between saving a call and aligning with Scala side. I don't have a strong opinion on which is better. Which approach do you prefer?

IIUC these 2 values are only being used when time mode is not none, I was meaning that for none time mode, we don't need these 2 extra API calls since it's not needed anyway

bogao007 · 2024-09-16T17:01:21Z

python/pyspark/sql/pandas/group_ops.py

+                if timeMode == "processingtime" and expiry_timestamp < batch_timestamp:
+                    result_iter_list.append(statefulProcessor.handleInputRows(
+                        (key_obj,), iter([]),
+                        TimerValues(batch_timestamp, watermark_timestamp),


is watermark_timestamp needed for processingTime time mode and vise versa?

Same above, this is to couple with behavior on Scala side.

bogao007 · 2024-09-16T17:03:13Z

python/pyspark/sql/pandas/group_ops.py

+                        ExpiredTimerInfo(True, expiry_timestamp)))
+
+            # TODO(SPARK-49603) set the handle state in the lazily initialized iterator
+            """


If we have a TODO here, we can remove the commented code.

bogao007 · 2024-09-16T17:21:26Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+            if len(response_message[2]) == 0:
+                return -1
+            # TODO: can we simply parse from utf8 string here?
+            timestamp = int(response_message[2])


Just curious: would this return the correct value?

Passing a row schema and use CPickleSerializer seems a bit heavy-weighted. Modified this to pass a byte buffer of exact 8 bytes and read exactly 8 bytes on Python client.

bogao007 · 2024-09-16T17:27:10Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+            return []
+        elif status == 0:
+            iterator = self._read_arrow_state()
+            batch = next(iterator)


Do we expect all the timers can be stored within a single arrow batch? If not, should we handle it properly here?

We don't. It is now returning an iterator of List. Do you think this API makes sense to you?

bogao007 · 2024-09-16T17:33:59Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+              while (iter.hasNext) {
+                val timestamp = iter.next()
+                val internalRow = InternalRow(timestamp)
+                arrowStreamWriter.writeRow(internalRow)


Same as the python side comment: here we don't limit how many arrow batches we construct for timers, if user sets a fairly low value for arrowTransformWithStateInPandasMaxRecordsPerBatch, we would send multiple arrow batches and client side needs to handle this properly as well.

Question: should we have a lower limit on how many records we send throw a single batch (e.g. the default value 10000)? IIUC, each timer record is very small and should not consume a lot of memory. The user also doesn't care about how many records each batch contains since they would always get a single list from this API.

I guess I'll rebase on your ListState PR change and this arrowTransformWithStateInPandasMaxRecordsPerBatch will be passed as the new config you'll add here: https://github.com/apache/spark/pull/47933/files#diff-0b0aaf91850194b6980b75d47bc166148566cbdc1b17b3da16faff1f0740e0f4R107.
But your concern above still holds. Should we pass a different default value for transmitting the list[Int] here? If so, should we add a new config or shall we just assign a fixed value for it?

bogao007 · 2024-09-16T17:34:39Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+            val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+              s"stdout writer for transformWithStateInPandas state socket", 0, Long.MaxValue)
+            val root = VectorSchemaRoot.create(arrowSchema, allocator)
+            new BaseStreamingArrowWriter(root, new ArrowStreamWriter(root, null, outputStream),


Does it make sense to abstract this logic out since it's being used in multiple places?

Done. Create a util object and put all python related writer functions into the Util object.

jingz-db · 2024-09-17T21:36:46Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+                batch = next(iterator)
+                result_list = []
+                key_fields = [field.name for field in self.key_schema.fields]
+                # TODO any better way to restore a grouping object from a batch?


@bogao007 Is there any common practice for deserializing data from a batch object to the Python object for grouping key?

Maybe take a look at how load_stream is implemented in ApplyInPandasWithStateSerializer and TransformWithStateInPandasSerializer in pyspark/sql/pandas/serializers.py. (and maybe some other customize serializers in the same file)

Maybe try something like below?

df.itertuples(index=False, name=None)]

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html

Btw, what if multiple batches are being sent from JVM, are we handling it correctly?

Discussed with Bo offline, JVM will return Row type to Python and we can directly convert it into Tuple.

bogao007 · 2024-09-19T21:01:34Z

python/pyspark/sql/pandas/group_ops.py

+            batch_timestamp = statefulProcessorApiClient.get_batch_timestamp()
+            watermark_timestamp = statefulProcessorApiClient.get_watermark_timestamp()


IIUC these 2 values are only being used when time mode is not none, I was meaning that for none time mode, we don't need these 2 extra API calls since it's not needed anyway

python/pyspark/sql/pandas/group_ops.py

bogao007 · 2024-09-19T21:18:05Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+                batch = next(iterator)
+                result_list = []
+                key_fields = [field.name for field in self.key_schema.fields]
+                # TODO any better way to restore a grouping object from a batch?


Maybe try something like below?

df.itertuples(index=False, name=None)]

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html

bogao007 · 2024-09-19T21:18:53Z

python/pyspark/sql/streaming/stateful_processor_api_client.py

+                batch = next(iterator)
+                result_list = []
+                key_fields = [field.name for field in self.key_schema.fields]
+                # TODO any better way to restore a grouping object from a batch?


Btw, what if multiple batches are being sent from JVM, are we handling it correctly?

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

bogao007 · 2024-09-19T21:25:54Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+      outputStream.write(responseMessageBytes)
+    }
+
+    def serializeLongToByteString(longValue: Long): ByteString = {


I think it may bring some extra complexity to do serde between long and ByteString. Since this is only used in TimerValueRequests, maybe we could add a dedicated response message for it which returns a long value? That way we can just use read_long on python side.

Add a new type of StateResponse to transmit Long type directly in proto message.

bogao007

LGTM overall, left one minor comment regarding arrow resources clean up. Thanks for making the changes!

bogao007 · 2024-09-26T00:26:13Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

+        arrowStreamWriter.writeRow(internalRow)
+      }
+      arrowStreamWriter.finalizeCurrentArrowBatch()
+      writer.end()


Minor: We might need to do something similar to what PythonArrowInput does to ensure we don't see unexpected errors

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala

Lines 74 to 84 in 80d6651

protected def close(): Unit = {

Utils.tryWithSafeFinally {

// end writes footer to the output stream and doesn't clean any resources.

// It could throw exception if the output stream is closed, so it should be

// in the try block.

writer.end()

} {

root.close()

allocator.close()

}

}

HeartSaVioR

Looks OK in overall. There are some small correction but mostly minors and nits.

HeartSaVioR · 2024-10-15T02:52:42Z

python/pyspark/sql/pandas/group_ops.py

I might lose following of how TWS (for PySpark) works, but given we get the iterator of expiry timers based on the timestamp, isn't this if statement already covered from API call? In other words, shouldn't API need to cover this?

Please let me know if there is specific reason - don't need to change the code directly if there is a reason. I just wanted to understand and possibly refresh my head.

You are correct about this. Thanks for noticing the redundant check. Removed.

HeartSaVioR · 2024-10-15T03:09:25Z

python/pyspark/sql/streaming/stateful_processor.py

nit: is this indentation correct? looks a bit odd, compared to others - params start from same indentation with the first _.

Shall we leave Github to resolve the review comment rather than manually marking as resolved? I don't see any new commit to resolve these style comments. I guess you've addressed but missed to push commit, but easier to track if we resolve the comment as "outdated".

(I tend to use reaction to distinguish comments I agree to address, especially style comments.)

HeartSaVioR · 2024-10-15T03:09:36Z

python/pyspark/sql/streaming/stateful_processor.py

HeartSaVioR · 2024-10-15T03:11:07Z

python/pyspark/sql/streaming/stateful_processor.py

nit: method doc in Python is placed "after" definition of the method.

HeartSaVioR · 2024-10-15T03:11:14Z

python/pyspark/sql/streaming/stateful_processor.py

HeartSaVioR · 2024-10-15T05:28:26Z

...scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServerSuite.scala

For other functionality we add the spying instance as a param to test this. Do we test this in e2e instead? I'm OK with it. Just wanted to check.

Same above. This is also tested in e2e suites by assertions on the output of handle expired timer rows.

python/pyspark/sql/streaming/stateful_processor_api_client.py

HeartSaVioR · 2024-10-15T05:56:30Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

nit: indentation?

HeartSaVioR · 2024-10-15T05:57:47Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

nit: any specific reason to implement here separately rather than calling _prepare_input_data?

refactored using calling_prepare_input_data.

HeartSaVioR · 2024-10-15T06:32:20Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

This is not true - watermark for eviction is 5 but watermark for late record is 0, hence ("a", 4) is not dropped. This is exactly the reason you still see event for "a". Otherwise you shouldn't have ("a", 20).

You might wonder how this works differently with Scala tests - AddData() & CheckNewAnswer() will trigger no-data batch, hence executing two batches.

Thanks for leaving the comments! By reading your comments I realized I did not quite understand the difference between watermark for eviction and watermark for late record before.
The test case should be still fine, I just deleted the comments. Dropping late record will be tested more throughly in the chaining of operator PR.

HeartSaVioR · 2024-10-21T06:59:27Z

@jingz-db
Could you please rebase and apply the way of handling iterator from #48290? Thanks!

HeartSaVioR

Looks good to me except nits.

HeartSaVioR · 2024-10-26T01:54:31Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

This seems to be missed.

HeartSaVioR · 2024-10-26T01:55:40Z

...main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasStateServer.scala

Thanks, do you like to leave getExpiredTimers() as it is? Then let's leave a code comment that the requests of getExpiredTimers won't be interleaved, hence this is safe.

HeartSaVioR · 2024-10-26T02:04:39Z

test failure seems to be unrelated but lint seems to be either related or simply broken.

https://github.com/jingz-db/spark/actions/runs/11526119046/job/32090806651

@HyukjinKwon Do you somehow know the reason of failure? I guess the generated py file should be excluded from linter and, I thought we did it, as I didn't see linter failure in prior PRs. Was anything changed around pyspark linter?

HyukjinKwon · 2024-10-26T09:38:35Z

@jingz-db mind updating your master branch latest, and rebase against that this branch, and push it?

HeartSaVioR

+1 pending CI

HeartSaVioR · 2024-10-28T21:50:17Z

@jingz-db
Looks like linter is still failing - have you ensured that the master branch for your repo is up-to-date with the master branch for Apache repo?

jingz-db · 2024-10-28T22:27:43Z

@jingz-db Looks like linter is still failing - have you ensured that the master branch for your repo is up-to-date with the master branch for Apache repo?

I do, I rebased on the latest master branch few hours ago:

Also my local lint command passed:

Let me add a type checking imports and see if it passes.

jingz-db · 2024-10-29T01:46:50Z

Hey @HyukjinKwon, do we have any place that could manually escape the python style check for certain files? Currently the linter check is only failing on auto-generated file created by protoc. I've rebased my latest branch on master and run dev/reformat-python but it is still failing on the python/pyspark/sql/streaming/StateMessage_pb2.py file.

bogao007 · 2024-10-29T16:35:40Z

python/pyspark/sql/streaming/StateMessage_pb2.py

Let's add # noqa: E501 back to ignore the length check.

HeartSaVioR · 2024-10-30T03:02:13Z

Could you please try modifying mypy.ini file to ignore errors on proto generated python files? You'll need to move the generated file to proto directory (create a new directory) and add the exclusion.

spark/python/mypy.ini

Lines 183 to 185 in 413242b

    
           ; Ignore errors for proto generated code 
        
           [mypy-pyspark.sql.connect.proto.*, pyspark.sql.connect.proto] 
        
           ignore_errors = True

HeartSaVioR · 2024-10-30T03:02:49Z

Also please rebase to incorporate the removal of generated code for java.

jingz-db · 2024-10-30T18:16:08Z

Could you please try modifying mypy.ini file to ignore errors on proto generated python files? You'll need to move the generated file to proto directory (create a new directory) and add the exclusion.

spark/python/mypy.ini

Lines 183 to 185 in 413242b

; Ignore errors for proto generated code

[mypy-pyspark.sql.connect.proto.*, pyspark.sql.connect.proto]

ignore_errors = True

Thanks for the pointer! Moved proto generated py file under sql/streaming/proto directory and add the entry in the mypy.init file.

HeartSaVioR · 2024-10-31T02:14:50Z

Thanks! Merging to master.

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Aug 26, 2024

jingz-db changed the title ~~compiling~~ [SS] Add Support for timer in transformWithStateInPandas API Aug 28, 2024

jingz-db force-pushed the python-timer-impl branch from 4e61ac2 to 7512129 Compare September 3, 2024 21:19

jingz-db changed the title ~~[SS] Add Support for timer in transformWithStateInPandas API~~ [SPARK-49513][SS] Add Support for timer in transformWithStateInPandas API Sep 4, 2024

jingz-db marked this pull request as ready for review September 4, 2024 17:25

jingz-db force-pushed the python-timer-impl branch from 2ddec2b to 954759f Compare September 11, 2024 23:27

bogao007 reviewed Sep 16, 2024

View reviewed changes

jingz-db commented Sep 17, 2024

View reviewed changes

jingz-db requested a review from bogao007 September 17, 2024 21:42

bogao007 reviewed Sep 19, 2024

View reviewed changes

bogao007 approved these changes Sep 26, 2024

View reviewed changes

HeartSaVioR reviewed Oct 15, 2024

View reviewed changes

HeartSaVioR mentioned this pull request Oct 16, 2024

[SPARK-49821][SS][PYTHON] Implement MapState and TTL support for TransformWithStateInPandas #48290

Closed

HeartSaVioR reviewed Oct 26, 2024

View reviewed changes

jingz-db force-pushed the python-timer-impl branch from c36be34 to fc240c6 Compare October 28, 2024 18:16

jingz-db requested a review from HeartSaVioR October 28, 2024 18:21

HeartSaVioR approved these changes Oct 28, 2024

View reviewed changes

bogao007 reviewed Oct 29, 2024

View reviewed changes

python/pyspark/sql/streaming/StateMessage_pb2.py Outdated

Copy link
Copy Markdown

Contributor

bogao007 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add # noqa: E501 back to ignore the length check.

jingz-db added 2 commits October 30, 2024 10:07

rebase on master

8a15378

add license headers

5c8fe12

jingz-db and others added 13 commits October 30, 2024 10:08

linted

48b5cda

lint

3e93886

lint

b7c4ef5

type checking lint

b21d5cd

add imports to pass linter

a020d4a

add escape style check

406ebce

add noqa to the top of the file

caf790d

using protoc 28.3

62a8c7f

add license, revert back to protoc 27.3

bb35d15

run linter

c850850

add noqa

6b67ba1

move python generated files under proto directory and ignore style check

277dcc1

add generated py files

e52fb3a

jingz-db force-pushed the python-timer-impl branch from dd03580 to e52fb3a Compare October 30, 2024 18:14

jingz-db added 2 commits October 30, 2024 13:06

style

39c0711

mypy linter

be7021a

HeartSaVioR closed this in 2ace2eb Oct 31, 2024

		batch_timestamp = statefulProcessorApiClient.get_batch_timestamp()
		watermark_timestamp = statefulProcessorApiClient.get_watermark_timestamp()

	protected def close(): Unit = {
	Utils.tryWithSafeFinally {
	// end writes footer to the output stream and doesn't clean any resources.
	// It could throw exception if the output stream is closed, so it should be
	// in the try block.
	writer.end()
	} {
	root.close()
	allocator.close()
	}
	}

Conversation

jingz-db commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingz-db Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingz-db commented Aug 26, 2024 •

edited

Loading

jingz-db Sep 16, 2024 •

edited

Loading