Skip to content

KeyError in cache eviction of Store #604

@gpauloski

Description

@gpauloski

Describe the problem.

I ran into this issue running the TaPS cholesky app. The app runs if I set extract_target=True. My guess is it's a race condition between numpy threads resolving the proxy and therefore updating the store cache.

I'm not sure what the best approach is. We can simply make the cache safe to key deletion errors, but I think this points to a larger issue of the Store not being thread-safe (and not being documented as such).

python -m taps.run --config configs/cholesky-app.toml configs/dask-local.toml configs/proxystore-redis-local.toml
[2024-08-06 19:37:01.787] RUN   (taps.run) :: CLI Arguments: --config configs/cholesky-app.toml configs/dask-local.toml configs/proxystore-redis-local.toml
[2024-08-06 19:37:01.792] RUN   (taps.run) :: Environment:
host: uan-0001
  os: linux (Linux-5.14.21-150400.24.55-default-x86_64-with-glibc2.31)
  cpu: x86_64 (52 cores / 104 logical)
  memory: 907.67 GB
python:
  version: 3.11.9
  build: CPython (64-bit runtime) [GCC 11.2.0]
  taps: 0.2.1.dev1
[2024-08-06 19:37:01.792] RUN   (taps.run) :: Starting app (name=cholesky)
[2024-08-06 19:37:01.793] RUN   (taps.run) :: Configuration:
app:
  name: 'cholesky'
  block_size: 1000
  matrix_size: 10000
engine:
  executor:
    name: 'dask'
    daemon_workers: True
    scheduler: None
    use_threads: False
    workers: 32
  filter:
    name: 'object-size'
    max_size: inf
    min_size: 10000
  task_record_file_name: 'tasks.jsonl'
  transformer:
    name: 'proxystore'
    cache_size: 16
    connector:
      kind: 'redis'
      options:
        hostname: 'localhost'
        port: 6379
    extract_target: False
    populate_target: True
logging:
  file_level: 'INFO'
  file_name: 'log.txt'
  level: 'INFO'
run:
  dir_format: 'runs/{name}_{executor}_{timestamp}'
  env_vars:
version: '0.2.1.dev1'
[2024-08-06 19:37:01.793] RUN   (taps.run) :: Runtime directory: /lus/gila/projects/CSC249ADCD08_CNDA/jgpaul/hppss24-proxystore/experiments/runs/cholesky_dask_2024-08-06-19-37-01
[2024-08-06 19:37:02.760] INFO  (proxystore.store) :: Registered a store named "proxy-transformer"
[2024-08-06 19:37:02.760] INFO  (proxystore.store.base) :: Initialized Store(name=proxy-transformer, connector=RedisConnector(hostname=localhost, port=6379), serializer=default, deserializer=default, cache_size=16, metrics=False, populate_target=True, auto_register=True)
[2024-08-06 19:37:06.261] APP   (taps.apps.cholesky) :: Generated input matrix: (10000, 10000)
[2024-08-06 19:37:06.261] APP   (taps.apps.cholesky) :: Block size: 1000
2024-08-06 19:37:16,857 - distributed.worker - WARNING - Compute Failed
Key:       gemm-558068d565ec7f9751164c2265f2b467
State:     executing
Function:  gemm
args:      (TaskResult(result=<Proxy at 0x7fd0b045bc50 with factory <proxystore.store.factory.StoreFactory object at 0x7fd0b00d32d0>>, info=ExecutionInfo(hostname='uan-0001', execution_start_time=1722973036.5006053, execution_end_time=1722973036.5594468, task_start_time=1722973036.500615, task_end_time=1722973036.5400963, input_transform_start_time=1722973036.50061, input_transform_end_time=1722973036.5006146, result_transform_start_time=1722973036.5400975, result_transform_end_time=1722973036.5594466)), TaskResult(result=array([[ 0.15782064,  1.13895301, -0.53984381, ...,  0.22560128,
         0.56136957,  0.19795206],
       [-1.3285794 ,  0.03382272,  0.13867708, ...,  1.04503873,
        -1.03138524,  0.56715439],
       [-0.89702222,  0.46101935,  0.75638062, ..., -0.99828644,
        -0.38111654, -0.0060622 ],
       ...,
       [ 1.05515101, -0.31326996, -0.24591136, ..., -0.93215602,
        -0.88302518, -1.50558914],
       [ 0.16922123,  0.09686408,  1.22620024, ...,  0.28477478,

kwargs:    {}
Exception: "KeyError(RedisKey(redis_key='b5ee21aa-da34-4a2b-a5d8-88b517fe8595'))"
Traceback: '  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/engine/task.py", line 176, in __call__\n    result = self.function(*args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/apps/cholesky.py", line 43, in gemm\n    return a - numpy.dot(b, c)\n           ~~^~~~~~~~~~~~~~~~~\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 417, in __sub__\n    return self.__proxy_wrapped__ - other\n           ^^^^^^^^^^^^^^^^^^^^^^\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 291, in __proxy_wrapped__\n    target = factory()\n             ^^^^^^^^^\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 79, in __call__\n    obj = self.resolve()\n          ^^^^^^^^^^^^^^\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 112, in resolve\n    obj = store.get(\n          ^^^^^^^^^^\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/base.py", line 537, in get\n    self.cache.set(key, result)\n  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/cache.py", line 61, in set\n    del self.data[lru_key]\n        ~~~~~~~~~^^^^^^^^^\n'

[2024-08-06 19:37:17.276] INFO  (proxystore.store) :: Unregistered a store named proxy-tra
  1 [engine.transformer]
nsformer
[2024-08-06 19:37:17.280] ERROR (taps.run) :: Caught unhandled exception
  1 [engine.transformer]
Traceback (most recent call last):
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/run/main.py", line 121, in main
    run(config, run_dir)
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/run/main.py", line 35, in _decorator
    return func(config, run_dir)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/engine/_engine.py", line 183, in _task_done_callback
    execution_info = future.result().info
                     ^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/distributed/client.py", line 405, in result
    return self.client.sync(self._result, callback_timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/engine/task.py", line 176, in __call__
    result = self.function(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/taps/apps/cholesky.py", line 43, in gemm
    return a - numpy.dot(b, c)
      ^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 417, in __sub__
    return self.__proxy_wrapped__ - other
    ^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/proxy/__init__.py", line 291, in __proxy_wrapped__
    target = factory()
    ^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 79, in __call__
    obj = self.resolve()
    ^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/factory.py", line 112, in resolve
    obj = store.get(
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/base.py", line 537, in get
    self.cache.set(key, result)
    ^^^^^^^^^^^^^^^^^
  File "/home/jgpaul/.conda/envs/hppss24/lib/python3.11/site-packages/proxystore/store/cache.py", line 61, in set
    del self.data[lru_key]
    ^^^^^^^^^^^^^^^^^
KeyError: RedisKey(redis_key='b5ee21aa-da34-4a2b-a5d8-88b517fe8595')

How did you install ProxyStore?

$ pip install proxystore

ProxyStore Version

v0.7.0

Python Version

3.11

OS and Platform

x86 Linux

Metadata

Metadata

Assignees

Labels

bugError, flaw, or fault that causes unexpected behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions