-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
What is the problem?
his happens 100% of time when pipeline spilling wasn't enabled, and 50% of time when it is enabled (we should eventually solve it by better retry protocol). TL;DR this block is triggered quite often which made it wait 10 additional seconds.
| // Create failed. The object may already exist locally. If something else went |
Object store mem: 64MB/70MB
Object: 8MB with 2 chunks
- chunk index 1 receives at thread A
- buffer pool attempts to spill & create an object
- chunk index 0 receives at thread B
- buffer pool attempts to spill & create an object
- Object is created by chunk index 0 (since it is FIFO)
- But chunk index 1 gets reply from the plasma store first.
- chunk index 0 failed at
// Create failed. The object may already exist locally. If something else went - chunk index 1 creates the object successfully.
- Since chunk index 0 failed, it should wait 10 more seconds to retry.
This can be solved by either 1. better retry protocol we are planning to do. 2. cache the request until object is created (this is not ideal because it can cause large memory usage) and seal it as soon as it is created. It happens less often because creating a new object at buffer pool is usually faster (since we are not blocked by spilling)
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
Look at the original issue
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.