-
Notifications
You must be signed in to change notification settings - Fork 16
Closed
Labels
bugError, flaw, or fault that causes unexpected behaviorError, flaw, or fault that causes unexpected behavior
Description
Describe the problem.
I ran into this issue running the TaPS cholesky app. The app runs if I set extract_target=True. My guess is it's a race condition between numpy threads resolving the proxy and therefore updating the store cache.
I'm not sure what the best approach is. We can simply make the cache safe to key deletion errors, but I think this points to a larger issue of the Store not being thread-safe (and not being documented as such).
python -m taps.run --config configs/cholesky-app.toml configs/dask-local.toml configs/proxystore-redis-local.toml
[2024-08-06 19:37:01.787] RUN (taps.run) :: CLI Arguments: --config configs/cholesky-app.toml configs/dask-local.toml configs/proxystore-redis-local.toml
[2024-08-06 19:37:01.792] RUN (taps.run) :: Environment:
host: uan-0001
os: linux (Linux-5.14.21-150400.24.55-default-x86_64-with-glibc2.31)
cpu: x86_64 (52 cores / 104 logical)
memory: 907.67 GB
python:
version: 3.11.9
build: CPython (64-bit runtime) [GCC 11.2.0]
taps: 0.2.1.dev1
[2024-08-06 19:37:01.792] RUN (taps.run) :: Starting app (name=cholesky)
[2024-08-06 19:37:01.793] RUN (taps.run) :: Configuration:
app:
name: 'cholesky'
block_size: 1000
matrix_size: 10000
engine:
executor:
name: 'dask'
daemon_workers: True
scheduler: None
use_threads: False
workers: 32
filter:
name: 'object-size'
max_size: inf
min_size: 10000
task_record_file_name: 'tasks.jsonl'
transformer:
name: 'proxystore'
cache_size: 16
connector:
kind: 'redis'
options:
hostname: 'localhost'
port: 6379
extract_target: False
populate_target: True
logging:
file_level: 'INFO'
file_name: 'log.txt'
level: 'INFO'
run:
dir_format: 'runs/{name}_{executor}_{timestamp}'
env_vars:
version: '0.2.1.dev1'
[2024-08-06 19:37:01.793] RUN (taps.run) :: Runtime directory: /lus/gila/projects/CSC249ADCD08_CNDA/jgpaul/hppss24-proxystore/experiments/runs/cholesky_dask_2024-08-06-19-37-01
[2024-08-06 19:37:02.760] INFO (proxystore.store) :: Registered a store named "proxy-transformer"
[2024-08-06 19:37:02.760] INFO (proxystore.store.base) :: Initialized Store(name=proxy-transformer, connector=RedisConnector(hostname=localhost, port=6379), serializer=default, deserializer=default, cache_size=16, metrics=False, populate_target=True, auto_register=True)
[2024-08-06 19:37:06.261] APP (taps.apps.cholesky) :: Generated input matrix: (10000, 10000)
[2024-08-06 19:37:06.261] APP (taps.apps.cholesky) :: Block size: 1000
2024-08-06 19:37:16,857 - distributed.worker - WARNING - Compute Failed
Key: gemm-558068d565ec7f9751164c2265f2b467
State: executing
Function: gemm
args: (TaskResult(result=<Proxy at 0x7fd0b045bc50 with factory <proxystore.store.factory.StoreFactory object at 0x7fd0b00d32d0>>, info=ExecutionInfo(hostname='uan-0001', execution_start_time=1722973036.5006053, execution_end_time=1722973036.5594468, task_start_time=1722973036.500615, task_end_time=1722973036.5400963, input_transform_start_time=1722973036.50061, input_transform_end_time=1722973036.5006146, result_transform_start_time=1722973036.5400975, result_transform_end_time=1722973036.5594466)), TaskResult(result=array([[ 0.15782064, 1.13895301, -0.53984381, ..., 0.22560128,
0.56136957, 0.19795206],
[-1.3285794 , 0.03382272, 0.13867708, ..., 1.04503873,
-1.03138524, 0.56715439],
[-0.89702222, 0.46101935, 0.75638062, ..., -0.99828644,
-0.38111654, -0.0060622 ],
...,
[ 1.05515101, -0.31326996, -0.24591136, ..., -0.93215602,
-0.88302518, -1.50558914],
[ 0.16922123, 0.09686408, 1.22620024, ..., 0.28477478,
kwargs: {}
Exception: "KeyError(RedisKey(redis_key='b5ee21aa-da34-4a2b-a5d8-88b517fe8595'))"
Traceback: ' File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/engine/task.py", line 176, in __call__\n result = self.function(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/apps/cholesky.py", line 43, in gemm\n return a - numpy.dot(b, c)\n ~~^~~~~~~~~~~~~~~~~\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 417, in __sub__\n return self.__proxy_wrapped__ - other\n ^^^^^^^^^^^^^^^^^^^^^^\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 291, in __proxy_wrapped__\n target = factory()\n ^^^^^^^^^\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 79, in __call__\n obj = self.resolve()\n ^^^^^^^^^^^^^^\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 112, in resolve\n obj = store.get(\n ^^^^^^^^^^\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/base.py", line 537, in get\n self.cache.set(key, result)\n File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/cache.py", line 61, in set\n del self.data[lru_key]\n ~~~~~~~~~^^^^^^^^^\n'
[2024-08-06 19:37:17.276] INFO (proxystore.store) :: Unregistered a store named proxy-tra
1 [engine.transformer]
nsformer
[2024-08-06 19:37:17.280] ERROR (taps.run) :: Caught unhandled exception
1 [engine.transformer]
Traceback (most recent call last):
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/run/main.py", line 121, in main
run(config, run_dir)
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/run/main.py", line 35, in _decorator
return func(config, run_dir)
^^^^^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/engine/_engine.py", line 183, in _task_done_callback
execution_info = future.result().info
^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/distributed/client.py", line 405, in result
return self.client.sync(self._result, callback_timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/engine/task.py", line 176, in __call__
result = self.function(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/apps/cholesky.py", line 43, in gemm
return a - numpy.dot(b, c)
^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 417, in __sub__
return self.__proxy_wrapped__ - other
^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 291, in __proxy_wrapped__
target = factory()
^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 79, in __call__
obj = self.resolve()
^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 112, in resolve
obj = store.get(
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/base.py", line 537, in get
self.cache.set(key, result)
^^^^^^^^^^^^^^^^^
File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/cache.py", line 61, in set
del self.data[lru_key]
^^^^^^^^^^^^^^^^^
KeyError: RedisKey(redis_key='b5ee21aa-da34-4a2b-a5d8-88b517fe8595')
How did you install ProxyStore?
$ pip install proxystoreProxyStore Version
v0.7.0
Python Version
3.11
OS and Platform
x86 Linux
Metadata
Metadata
Assignees
Labels
bugError, flaw, or fault that causes unexpected behaviorError, flaw, or fault that causes unexpected behavior