feat: make LMDB max_dbs and map_size configurable#1704
feat: make LMDB max_dbs and map_size configurable#1704georgeh0 merged 2 commits intococoindex-io:v1from
Conversation
Expose the previously hardcoded LMDB `max_dbs` (1024) and `map_size` (4 GiB) as configurable settings, both programmatically and via environment variables `COCOINDEX_LMDB_MAX_DBS` / `COCOINDEX_LMDB_MAP_SIZE`. When not set, the existing defaults are preserved. Fixes cocoindex-io#1703 Signed-off-by: Harry-kp <chaudharyharshit9@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR makes LMDB max_dbs and map_size configurable (previously hardcoded in the Rust core), exposing them through Python Settings(...) and environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE while preserving existing defaults when unset.
Changes:
- Add optional
lmdb_max_dbs/lmdb_map_sizeto RustEnvironmentSettingsand apply them when opening the LMDB environment (fallback to prior defaults). - Add
lmdb_max_dbs/lmdb_map_sizeto PythonSettings(public + internal mirror) and load them from env vars inSettings.from_env().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| rust/core/src/engine/environment.rs | Adds configurable LMDB settings and uses them when opening heed::Env. |
| python/cocoindex/setting.py | Exposes LMDB settings in public Settings and loads them from env vars. |
| python/cocoindex/_internal/setting.py | Mirrors the public settings/env loading for internal API surface. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| lmdb_kwargs: dict[str, Any] = dict() | ||
| _load_field( | ||
| lmdb_kwargs, | ||
| "lmdb_max_dbs", | ||
| "COCOINDEX_LMDB_MAX_DBS", | ||
| parse=int, | ||
| ) | ||
| _load_field( | ||
| lmdb_kwargs, | ||
| "lmdb_map_size", | ||
| "COCOINDEX_LMDB_MAP_SIZE", | ||
| parse=int, | ||
| ) |
There was a problem hiding this comment.
New environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE are introduced here, but there doesn’t appear to be test coverage asserting they’re honored by Settings.from_env(). Please add a unit/integration test that sets these env vars and verifies the resulting Settings carries the parsed values.
There was a problem hiding this comment.
@georgeh0 The existing GlobalExecutionOptions.source_max_inflight_rows (the pattern we're following) doesn't have dedicated unit tests for env var loading either. Would you like me to add test coverage for the LMDB env vars, or is the Rust-side validation sufficient for now?
rust/core/src/engine/environment.rs
Outdated
| @@ -5,13 +5,16 @@ use crate::{ | |||
| use serde::{Deserialize, Serialize}; | |||
| use std::{collections::BTreeSet, path::PathBuf, u32}; | |||
There was a problem hiding this comment.
u32 is imported from std but no longer used in this module. Please remove it to avoid unused-import warnings (and potential CI failures if warnings are denied).
| use std::{collections::BTreeSet, path::PathBuf, u32}; | |
| use std::{collections::BTreeSet, path::PathBuf}; |
rust/core/src/engine/environment.rs
Outdated
| let max_dbs = settings.lmdb_max_dbs.unwrap_or(DEFAULT_MAX_DBS); | ||
| let map_size = settings.lmdb_map_size.unwrap_or(DEFAULT_LMDB_MAP_SIZE); | ||
| let db_env = unsafe { | ||
| heed::EnvOpenOptions::new() | ||
| .max_dbs(MAX_DBS) | ||
| .map_size(LMDB_MAP_SIZE) | ||
| .max_dbs(max_dbs) | ||
| .map_size(map_size) |
There was a problem hiding this comment.
lmdb_max_dbs/lmdb_map_size are now user-configurable, but invalid values can be passed through (e.g. lmdb_max_dbs=0 will later prevent creating named databases via Env::create_database(Some(name))). Consider validating these settings in Environment::new() and returning a clear client error (e.g. require max_dbs >= 1 and map_size > 0) before calling heed::EnvOpenOptions.
There was a problem hiding this comment.
Addressed — validation is now handled on the Rust side with client_bail! per @georgeh0's guidance.
| ) | ||
| global_execution_options = GlobalExecutionOptions(**exec_kwargs) | ||
|
|
||
| lmdb_kwargs: dict[str, Any] = dict() |
There was a problem hiding this comment.
from_env() parses the LMDB settings with int, but does not validate the result. Negative/zero values (or extremely large values) will fail later during engine initialization/deserialization with less actionable errors. Consider validating that lmdb_max_dbs is >= 1 and lmdb_map_size is > 0 (and optionally within platform limits) and raising a ValueError with a clear message.
| _load_field( | ||
| lmdb_kwargs, | ||
| "lmdb_max_dbs", | ||
| "COCOINDEX_LMDB_MAX_DBS", | ||
| parse=int, | ||
| ) | ||
| _load_field( | ||
| lmdb_kwargs, | ||
| "lmdb_map_size", | ||
| "COCOINDEX_LMDB_MAP_SIZE", | ||
| parse=int, | ||
| ) |
There was a problem hiding this comment.
New environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE are introduced here, but there doesn’t appear to be test coverage asserting they’re honored by Settings.from_env(). Please add a unit/integration test that sets these env vars and verifies the resulting Settings carries the parsed values (and ideally that an Environment can be constructed with them).
| "COCOINDEX_LMDB_MAP_SIZE", | ||
| parse=int, | ||
| ) | ||
|
|
There was a problem hiding this comment.
from_env() parses the LMDB settings with int, but does not validate the result. Negative/zero values will fail later during engine initialization/deserialization with less actionable errors. Consider validating that lmdb_max_dbs is >= 1 and lmdb_map_size is > 0 and raising a ValueError with a clear message.
| if "lmdb_max_dbs" in lmdb_kwargs: | |
| lmdb_max_dbs = lmdb_kwargs["lmdb_max_dbs"] | |
| if lmdb_max_dbs < 1: | |
| raise ValueError( | |
| f"COCOINDEX_LMDB_MAX_DBS must be an integer >= 1, got {lmdb_max_dbs}" | |
| ) | |
| if "lmdb_map_size" in lmdb_kwargs: | |
| lmdb_map_size = lmdb_kwargs["lmdb_map_size"] | |
| if lmdb_map_size <= 0: | |
| raise ValueError( | |
| f"COCOINDEX_LMDB_MAP_SIZE must be an integer > 0, got {lmdb_map_size}" | |
| ) |
There was a problem hiding this comment.
Validation is now handled on the Rust side with client_bail! as suggested by @georgeh0, so Python-side validation is not needed.
| lmdb_max_dbs: int | None = None | ||
| lmdb_map_size: int | None = None |
There was a problem hiding this comment.
I think we can put the default values at Python side (similar to GlobalExecutionOptions.source_max_inflight_rows). This will make the default more clear from Python API. Then these fields are always not None and we don't need to apply default by constants from Rust side.
There was a problem hiding this comment.
Done in f2308d6 — moved defaults to the Python side, matching the source_max_inflight_rows pattern. Fields are now always concrete int values, never None. Removed Option<> wrappers on the Rust side as well, with #[serde(default = "...")] for backward compatibility.
There was a problem hiding this comment.
FYI, this file is no longer used in v1 (I'll do a cleanup soon). It's OK to either keep or revert current changes.
There was a problem hiding this comment.
Thanks for the heads up! I have kept the changes for consistency with _internal/setting.py. Happy to revert if you prefer — just let me know.
rust/core/src/engine/environment.rs
Outdated
| let max_dbs = settings.lmdb_max_dbs.unwrap_or(DEFAULT_MAX_DBS); | ||
| let map_size = settings.lmdb_map_size.unwrap_or(DEFAULT_LMDB_MAP_SIZE); |
There was a problem hiding this comment.
To resolve the copilot comments on the setting.py, we can validate these values are positive here and raise a error otherwise (use client_bail!)
There was a problem hiding this comment.
Done in f2308d6 — added validation with client_bail! before calling heed::EnvOpenOptions:
if settings.lmdb_max_dbs < 1 {
client_bail!("lmdb_max_dbs must be >= 1, got {}", settings.lmdb_max_dbs);
}
if settings.lmdb_map_size == 0 {
client_bail!("lmdb_map_size must be > 0, got {}", settings.lmdb_map_size);
}This also resolves the Copilot comments about Python-side validation — invalid values are now caught at the Rust layer with clear error messages.
- Move LMDB default values to Python side (lmdb_max_dbs=1024, lmdb_map_size=4GiB) so fields are never None, matching the pattern used by GlobalExecutionOptions.source_max_inflight_rows - Make Rust EnvironmentSettings fields non-optional (u32/usize) with serde default functions for backward compatibility - Add validation with client_bail! for invalid values (max_dbs < 1, map_size == 0) - Remove unused u32 import from std
| lmdb_kwargs: dict[str, Any] = dict() | ||
| _load_field( | ||
| lmdb_kwargs, | ||
| "lmdb_max_dbs", | ||
| "COCOINDEX_LMDB_MAX_DBS", | ||
| parse=int, | ||
| ) | ||
| _load_field( | ||
| lmdb_kwargs, | ||
| "lmdb_map_size", | ||
| "COCOINDEX_LMDB_MAP_SIZE", | ||
| parse=int, | ||
| ) |
Summary
max_dbs(1024) andmap_size(4 GiB) as configurable settingsSettings(lmdb_max_dbs=..., lmdb_map_size=...)or environment variablesCOCOINDEX_LMDB_MAX_DBS/COCOINDEX_LMDB_MAP_SIZEsource_max_inflight_rowspattern), so fields are always concrete valuesclient_bail!Fixes #1703
Changes
rust/core/src/engine/environment.rs— Addedlmdb_max_dbs: u32andlmdb_map_size: usizetoEnvironmentSettingswith#[serde(default = "...")]for backward compatibility. Validatesmax_dbs >= 1andmap_size > 0viaclient_bail!. Removed unusedu32import and the// TODO: Expose these as settingscomment.python/cocoindex/_internal/setting.py— Addedlmdb_max_dbs: int = 1024andlmdb_map_size: int = 0x1_0000_0000toSettings, loaded from env vars infrom_env()python/cocoindex/setting.py— Same changes (public API mirror, noted as deprecated in v1)Usage
Or via environment variables: