-
Notifications
You must be signed in to change notification settings - Fork 4k
Cache storage abstraction #6052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4e54e3c to
7e629d2
Compare
…rages. Preparation to rename CacheStorageFactory to Manager
… and max_entries to use also None param
| _CACHED_FILE_EXTENSION = "memo" | ||
|
|
||
|
|
||
| class OpenSourceCacheStorageManager(CacheStorageManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, maybe we call it LocalCacheStorageManager
| """ | ||
| pass | ||
|
|
||
| def get_stats(self) -> list[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion earlier, propose we remove this from the interface
# Conflicts: # lib/streamlit/runtime/caching/cache_data_api.py # lib/streamlit/runtime/caching/cache_utils.py
…rage to LocalCacheStorage
…in raw mode (using InMemoryWrappedDummyCacheStorageManager
| caching.cache_data.clear() | ||
| cache_storage_manager = get_cache_storage_manager() | ||
| cache_storage_manager.clear_all() | ||
| caching.cache_resource.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, streamlit cache clear works from another python process and doesn't have access to in_memory cache (that was true for previous implementation, and remains true for current).
It removes only persist cache (in our case only files from disk)
Related to this maybe we need to remove caching.cache_resource.clear() call from here at all, and also rework documentation to not mislead users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should not change the behavior or add anything new or complex here. If it seems like any regression lets discuss (but sounds like it's not any).
Agree if we can easily update the docs / code to be more clear about the behavior, that's a nice to have.
tconkling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Mostly just nits, but I also have some questions about thread safety notes that I think are maybe just a misunderstanding, but want to make sure.
Also, I wonder if we need to think more thoroughly about concurrency in our local disk storage. We are probably fine (and I don't think any of these changes introduce any new local disk concurrency issues that weren't already present) but... let's verify that assumption.
| allow_widgets=self.allow_widgets, | ||
| ) | ||
|
|
||
| def validate_params(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you add a return type annotation?
| ) | ||
|
|
||
| def validate_params(self): | ||
| """ " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Nits: stray
"here; also start of the docstring should go next to the opening""". - Can you add specificity around what it means to validate params? What happens if validation fails? Do we throw an Exception? If so, what type of Exception?
| # try to remove in optimal way; if not possible, fallback to remove all | ||
| # available storages one by one | ||
| self.get_storage_manager().clear_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extend this slightly to explain why we want this fallback? (Why don't we trust storage manager to do the right thing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, reworked comment.
Basically clear_all method of CacheStorageManager is not required to implement (e.g. when there is no good way to delete all storages at once), so we should have alternative mechanism to remove at least all storages that we have in memory
| max_entries: int | None, | ||
| ttl: int | float | timedelta | None, | ||
| ) -> None: | ||
| """Validate that the cache params are valid for given storage.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if validation fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now CacheStorageManager has ability to raise CacheStorageImproperlyConfigured in check_context method, and here we catch that error, log it, and then propagate it .
| from streamlit.runtime.caching.storage.cache_storage_protocol import ( | ||
| CacheStorage as CacheStorage, | ||
| ) | ||
| from streamlit.runtime.caching.storage.cache_storage_protocol import ( | ||
| CacheStorageContext as CacheStorageContext, | ||
| ) | ||
| from streamlit.runtime.caching.storage.cache_storage_protocol import ( | ||
| CacheStorageError as CacheStorageError, | ||
| ) | ||
| from streamlit.runtime.caching.storage.cache_storage_protocol import ( | ||
| CacheStorageKeyNotFoundError as CacheStorageKeyNotFoundError, | ||
| ) | ||
| from streamlit.runtime.caching.storage.cache_storage_protocol import ( | ||
| CacheStorageManager as CacheStorageManager, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this is more verbose than it needs to be - can we combine these imports?
from streamlit.runtime.caching.storage.cache_storage_protocol import (
CacheStorage as CacheStorage,
CacheStorageContext as CacheStorageContext,
...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it as suggetsted, but pre-commit hooks (isort?) always revert it back
| ) | ||
|
|
||
|
|
||
| def get_cache_storage_manager() -> CacheStorageManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this unconditionally creates a new CacheStorageManager, but the name makes it sound like it just gets an existing one. Can we change it to create_default_cache_storage_manager or something?
| mock_runtime.media_file_mgr = MediaFileManager( | ||
| MemoryMediaFileStorage("/mock/media") | ||
| ) | ||
| # TODO: Add a mock for CacheStorageManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be a TODO? Can we do it now?
| local_disk_cache_storage = LocalDiskCacheStorage(context) | ||
| dummy_cache_storage = DummyCacheStorage() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we remove these from the file scope? It looks like they're just used in test_in_memory_cache_storage_wrapper_works_with_any_storage - can they just be declared inline there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, refactored to different tests
| LocalDiskCacheStorage, | ||
| ) | ||
|
|
||
| context = CacheStorageContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: all-caps constant names (CONTEXT)
| self.patch_get_cache_folder_path.stop() | ||
| self.tempdir.cleanup() | ||
|
|
||
| def test_create_persist_context(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add short docstrings to this and the rest of the tests in the file explaining what's being tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* Add CacheStorageImproperlyConfigured exception
* other renaming and docstring addition based on review comments
…s not implemented by design
tconkling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! My only real complaint is with the name "DummyCacheStorageManager", which confused me until I looked closely at the code and realized that it's just an in-memory cache with no persistence.
| Validate the params passed to @st.cache_data are compatible with cache storage | ||
| When called, this method could log warnings if cache params are invalid | ||
| for current storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also throw a specific exception, right?
| """Raised when the key is not found in the cache storage""" | ||
|
|
||
|
|
||
| class CacheStorageImproperlyConfigured(CacheStorageError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this name to something like InvalidCacheStorageContext, to make clear that it's specifically for validating a CacheStorageContext?
| class LocalDiskCacheStorage(CacheStorage): | ||
| """ | ||
| Cache storage that persists data to disk | ||
| This is the default cache storage for @st.cache_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "This is the default cache persistence layer for @st.cache_data" (because it will always have a memory cache layer in front of it, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tconkling, @vdonato I would like to explain the original idea:
Storage should not know about any details of its use, and only provide the ability to add and remove entries.
In this case, the LocalDiskCacheStorage class purely provides the ability to work with disk storage.
It is only the responsibility of LocalDiskCacheStorage class.
In contrary LocalDiskCacheStorageManager is responsible for creating those storages, and in our case, it is wrapping those storage instances with another layer of in-memory cache via InMemoryCacheWrapper. LocalDiskCacheStorageManager was originally called InMemoryWrappedLocalDiskCacheStorageManager :) but then during a review, we renamed that to LocalDiskCacheStorage for simplicity).
So the idea is that each implementation should consist of ConcreteCacheStorage, and ConcreteCacheStorageManager , responsible for the creation (and optional decoration) of ConcreteCacheStorage.
For convenience, we provide InMemoryCacheStorageWrapper class, which could wrap any storage with an in-memory cache layer, but this is a detail of the manager, not the storage itself.
I think this is my flaw in the description of the classes, it will be necessary to work on this improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I get that - I'm just nitpicking the comment. "This is the default cache storage for @st.cache_data" sounds to me like it means "by default, we use only the local disk for cache storage."
But that's not the case: by default, we use memory for cache storage, with LocalDiskCacheStorage as the persistence layer for functions that opt-in to persistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, get it :) comment rephrased!
| ) | ||
|
|
||
|
|
||
| class DummyCacheStorageManager(CacheStorageManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class is misnamed - "DummyCacheStorageManager" makes it sound like the cache is a no-op. But actually, this is just a memory-only cache storage with no persistence layer. Can we call it "MemoryCacheStorageManager" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable-ish to me since (assuming I'm understanding the code correctly) this is only used to handle the case where a user calls python3 my_script.py and we want the script run to essentially be a no-op
I do think this could benefit from a class docstring mentioning what it's used for so that it isn't confused for a manager implementation that's used when running a Streamlit server, though
vdonato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a drive-by review even if we're already covered for this PR since this caching project interests me 👀
LGTM! I'll avoid officially +1-ing this myself since Tim already has / I'm not confident enough in my knowledge of the caching code to want to click approve
I left some comments that are mostly nits + naming suggestions to potentially make things easier on a reader that's never looked at any of this code before.
Also, I think it'd be nice to have a diagram on how caching works since it took some effort to understand what's going on here as a first-time reader of most of this code. I think an ASCII-art (or nicer if someone's willing to put in the effort) class diagram or even just a text-description of how all of the classes interact (like what we have in event_based_path_watcher.py) would help out in the ramp-up process for this part of the codebase a ton given how many layers of indirection are involved.
| def ttl_to_seconds( | ||
| ttl: float | timedelta | None, *, allow_none: bool = False | ||
| ) -> float | None: | ||
| """ | ||
| Convert a ttl value to a float representing "number of seconds". | ||
| """ | ||
| if ttl is None: | ||
| if not allow_none and ttl is None: | ||
| return math.inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could the allow_none param be renamed to something like coerce_none_to_inf that defaults to True? (I don't particularly like the suggested name either but can't think of anything better at the moment)
allow_none makes me think that an error is raised if None is passed to the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, renamed to coerce_none_to_inf :)
|
|
||
| def validate_params(self) -> None: | ||
| """ | ||
| Validate the params passed to @st.cache_data are compatible with cache storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (here and throughout the PR):
Conventionally, the summary of a docstring is placed on the same line as the opening triple quotes
| ) | ||
|
|
||
|
|
||
| def create_default_cache_storage_manager() -> CacheStorageManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to me that we have a function that just calls the LocalDiskCacheStorageManager constructor. Couldn't we just call the constructor directly where we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function adds clarity: callers are specifically saying "I want whatever the default is". (That happens to be LocalDiskCacheStorageManager right now, but calling the constructor directly wouldn't communicate its default-ness.)
| cache_storage_manager = create_default_cache_storage_manager() | ||
| cache_storage_manager.clear_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It may be useful to add a comment here pointing out that since this is a separate CLI command, there's no runtime/streamlit server/etc. and that's why it's reasonable to create a new cache manager here.
Potentially an obvious call-out, but I was confused as to why we were creating a new instance for a second when initially reading this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an explanatory comment added!
| persist: Literal["disk"] | None = None | ||
|
|
||
|
|
||
| class CacheStorage(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be better named something like FunctionCacheStorage to emphasize that it's the cache for a single annotated function. It wasn't clear to me why we have so many layers of indirection in this code until I had finished reading the docstrings for each of the classes involved.
This may be because I'm reviewing this code without the deepest knowledge of our caching internals, but there are enough layers involved that I think it may help if it doesn't make things too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely see your point, we definitely have too many layers of indirections for caching.
In cache_utils.py we have Cache, CachedFunc, and CachedFuncInfo classes, as well as DataCaches and DataCache classes in cache_data_api.py to implement all machinery required for keeping the ability to decorate functions.
That's the reason I tried to abstract the layer responsible for reading/writing results from that machinery of function decoration. This CacheStorage and CacheStorageManager was designed keeping the idea in mind that they should not know anything about what and how were caches, working purely with keys and bytes.
It wasn't 100% done, e.g. CacheStorageContext should know
about function_key (to prefix files), and function_display_name to raise exceptions with that name, but at least our storage doesn't know anything about function entity, params hashing algorithms, etc.
I agree that the motivation for why we have many CacheStorages per function instead of one should be better explained in the documentation/diagrams, but I want to keep the CacheStorage name as is since it should be not connected with the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable -- I think my response in this case would be to push a bit harder for a class diagram explaining how all these things work together so that it's more clear to the reader before they've read through all of the code/docstrings + put some work into figuring it out
| ) | ||
|
|
||
|
|
||
| class DummyCacheStorageManager(CacheStorageManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable-ish to me since (assuming I'm understanding the code correctly) this is only used to handle the case where a user calls python3 my_script.py and we want the script run to essentially be a no-op
I do think this could benefit from a class docstring mentioning what it's used for so that it isn't confused for a manager implementation that's used when running a Streamlit server, though
| ttl_seconds: float | None, | ||
| max_entries: int | None, | ||
| ) -> CacheStorageContext: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra newline
| meaningful default behaviour is to raise NotImplementedError, so this is not | ||
| abstractmethod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It may be a bit more clear to say that the method is optional in this docstring. I think it's mentioned above already but repeating it here would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
The method is optional to implement: cache data API will fall back to remove
all available storages one by one via storage.clear() method
if clear_all raises NotImplementedError.
paragraph
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def check_context(self, context: CacheStorageContext) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: validate_context might be a nicer name to match with the validate_* methods that live in cache_data_api.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeа, I also don't like that we have inconsistencies with the validate_ and check_ naming.
in this case, I would like to save the check_context in the public protocol, because this method should not produce/return any validated data, but just log a message/raise an exception.
* more tests added
…heStorageManager) to MemoryCacheStorageManager
📚 Context
Introduce the new
CacheStorageandCacheStorageManagerabstractions for@st.cache_data. This allows to customize the way that cache values are stored within Streamlit.RuntimeConfignow takes aCacheStorageManagerinterface, andcache_data_apiconstructs theCacheStorageusing that instance.We ship with a concrete
LocalDiskCacheStorageandLocalDiskCacheStorageManagerimplementation, which just stores cache items in memory and on disk if
persist=Trueis passed to st.cache_data (as before). This is what server.py uses when it initializes its Runtime instance.We also implemented
InMemoryCacheStorageWrapperclass, that could serve as facade and wrap any CacheStorage with in_memory first layer cache.Also, for testing purposes,
DummyCacheStorageandDummyCacheStorageManagerimplemented.Because the
LocalDiskCacheStorageManagerinstance exists when a Runtime has been created, scripts that run in "raw" mode (i.e.streamlit._is_running_with_streamlit == False) no longer have access to that manager from Runtime. So inget_storage_managerincache_data_apiwe fallback toDummyCacheStorageManager.What kind of change does this PR introduce?
🧠 Description of Changes
Add bullet points summarizing your changes here
Revised:
Insert screenshot of your updated UI/code here
Current:
Insert screenshot of existing UI/code here
🧪 Testing Done
🌐 References
Does this depend on other work, documents, or tickets?
Contribution License Agreement
By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.