Skip to content

feat: make LMDB max_dbs and map_size configurable#1704

Merged
georgeh0 merged 2 commits intococoindex-io:v1from
Harry-kp:feat/configurable-lmdb-settings
Mar 1, 2026
Merged

feat: make LMDB max_dbs and map_size configurable#1704
georgeh0 merged 2 commits intococoindex-io:v1from
Harry-kp:feat/configurable-lmdb-settings

Conversation

@Harry-kp
Copy link
Copy Markdown

@Harry-kp Harry-kp commented Feb 24, 2026

Summary

  • Exposes the hardcoded LMDB max_dbs (1024) and map_size (4 GiB) as configurable settings
  • Configurable via Python Settings(lmdb_max_dbs=..., lmdb_map_size=...) or environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE
  • Defaults are set on the Python side (matching the source_max_inflight_rows pattern), so fields are always concrete values
  • Invalid values are validated on the Rust side with client_bail!

Fixes #1703

Changes

  • rust/core/src/engine/environment.rs — Added lmdb_max_dbs: u32 and lmdb_map_size: usize to EnvironmentSettings with #[serde(default = "...")] for backward compatibility. Validates max_dbs >= 1 and map_size > 0 via client_bail!. Removed unused u32 import and the // TODO: Expose these as settings comment.
  • python/cocoindex/_internal/setting.py — Added lmdb_max_dbs: int = 1024 and lmdb_map_size: int = 0x1_0000_0000 to Settings, loaded from env vars in from_env()
  • python/cocoindex/setting.py — Same changes (public API mirror, noted as deprecated in v1)

Usage

import cocoindex

settings = cocoindex.Settings(
    db_path="/tmp/cocoindex_db",
    lmdb_max_dbs=2048,
    lmdb_map_size=0x2_0000_0000,  # 8 GiB
)

Or via environment variables:

export COCOINDEX_LMDB_MAX_DBS=2048
export COCOINDEX_LMDB_MAP_SIZE=8589934592

Expose the previously hardcoded LMDB `max_dbs` (1024) and `map_size`
(4 GiB) as configurable settings, both programmatically and via
environment variables `COCOINDEX_LMDB_MAX_DBS` / `COCOINDEX_LMDB_MAP_SIZE`.

When not set, the existing defaults are preserved.

Fixes cocoindex-io#1703

Signed-off-by: Harry-kp <chaudharyharshit9@gmail.com>
Copilot AI review requested due to automatic review settings February 24, 2026 20:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes LMDB max_dbs and map_size configurable (previously hardcoded in the Rust core), exposing them through Python Settings(...) and environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE while preserving existing defaults when unset.

Changes:

  • Add optional lmdb_max_dbs / lmdb_map_size to Rust EnvironmentSettings and apply them when opening the LMDB environment (fallback to prior defaults).
  • Add lmdb_max_dbs / lmdb_map_size to Python Settings (public + internal mirror) and load them from env vars in Settings.from_env().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
rust/core/src/engine/environment.rs Adds configurable LMDB settings and uses them when opening heed::Env.
python/cocoindex/setting.py Exposes LMDB settings in public Settings and loads them from env vars.
python/cocoindex/_internal/setting.py Mirrors the public settings/env loading for internal API surface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +71 to +83
lmdb_kwargs: dict[str, Any] = dict()
_load_field(
lmdb_kwargs,
"lmdb_max_dbs",
"COCOINDEX_LMDB_MAX_DBS",
parse=int,
)
_load_field(
lmdb_kwargs,
"lmdb_map_size",
"COCOINDEX_LMDB_MAP_SIZE",
parse=int,
)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE are introduced here, but there doesn’t appear to be test coverage asserting they’re honored by Settings.from_env(). Please add a unit/integration test that sets these env vars and verifies the resulting Settings carries the parsed values.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgeh0 The existing GlobalExecutionOptions.source_max_inflight_rows (the pattern we're following) doesn't have dedicated unit tests for env var loading either. Would you like me to add test coverage for the LMDB env vars, or is the Rust-side validation sufficient for now?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good for now. Thanks!

@@ -5,13 +5,16 @@ use crate::{
use serde::{Deserialize, Serialize};
use std::{collections::BTreeSet, path::PathBuf, u32};
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u32 is imported from std but no longer used in this module. Please remove it to avoid unused-import warnings (and potential CI failures if warnings are denied).

Suggested change
use std::{collections::BTreeSet, path::PathBuf, u32};
use std::{collections::BTreeSet, path::PathBuf};

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f2308d6.

Comment on lines +43 to +48
let max_dbs = settings.lmdb_max_dbs.unwrap_or(DEFAULT_MAX_DBS);
let map_size = settings.lmdb_map_size.unwrap_or(DEFAULT_LMDB_MAP_SIZE);
let db_env = unsafe {
heed::EnvOpenOptions::new()
.max_dbs(MAX_DBS)
.map_size(LMDB_MAP_SIZE)
.max_dbs(max_dbs)
.map_size(map_size)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmdb_max_dbs/lmdb_map_size are now user-configurable, but invalid values can be passed through (e.g. lmdb_max_dbs=0 will later prevent creating named databases via Env::create_database(Some(name))). Consider validating these settings in Environment::new() and returning a clear client error (e.g. require max_dbs >= 1 and map_size > 0) before calling heed::EnvOpenOptions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — validation is now handled on the Rust side with client_bail! per @georgeh0's guidance.

)
global_execution_options = GlobalExecutionOptions(**exec_kwargs)

lmdb_kwargs: dict[str, Any] = dict()
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_env() parses the LMDB settings with int, but does not validate the result. Negative/zero values (or extremely large values) will fail later during engine initialization/deserialization with less actionable errors. Consider validating that lmdb_max_dbs is >= 1 and lmdb_map_size is > 0 (and optionally within platform limits) and raising a ValueError with a clear message.

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +95
_load_field(
lmdb_kwargs,
"lmdb_max_dbs",
"COCOINDEX_LMDB_MAX_DBS",
parse=int,
)
_load_field(
lmdb_kwargs,
"lmdb_map_size",
"COCOINDEX_LMDB_MAP_SIZE",
parse=int,
)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New environment variables COCOINDEX_LMDB_MAX_DBS / COCOINDEX_LMDB_MAP_SIZE are introduced here, but there doesn’t appear to be test coverage asserting they’re honored by Settings.from_env(). Please add a unit/integration test that sets these env vars and verifies the resulting Settings carries the parsed values (and ideally that an Environment can be constructed with them).

Copilot uses AI. Check for mistakes.
"COCOINDEX_LMDB_MAP_SIZE",
parse=int,
)

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_env() parses the LMDB settings with int, but does not validate the result. Negative/zero values will fail later during engine initialization/deserialization with less actionable errors. Consider validating that lmdb_max_dbs is >= 1 and lmdb_map_size is > 0 and raising a ValueError with a clear message.

Suggested change
if "lmdb_max_dbs" in lmdb_kwargs:
lmdb_max_dbs = lmdb_kwargs["lmdb_max_dbs"]
if lmdb_max_dbs < 1:
raise ValueError(
f"COCOINDEX_LMDB_MAX_DBS must be an integer >= 1, got {lmdb_max_dbs}"
)
if "lmdb_map_size" in lmdb_kwargs:
lmdb_map_size = lmdb_kwargs["lmdb_map_size"]
if lmdb_map_size <= 0:
raise ValueError(
f"COCOINDEX_LMDB_MAP_SIZE must be an integer > 0, got {lmdb_map_size}"
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation is now handled on the Rust side with client_bail! as suggested by @georgeh0, so Python-side validation is not needed.

@Harry-kp Harry-kp marked this pull request as draft February 24, 2026 20:55
Copy link
Copy Markdown
Member

@georgeh0 georgeh0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines +61 to +62
lmdb_max_dbs: int | None = None
lmdb_map_size: int | None = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can put the default values at Python side (similar to GlobalExecutionOptions.source_max_inflight_rows). This will make the default more clear from Python API. Then these fields are always not None and we don't need to apply default by constants from Rust side.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f2308d6 — moved defaults to the Python side, matching the source_max_inflight_rows pattern. Fields are now always concrete int values, never None. Removed Option<> wrappers on the Rust side as well, with #[serde(default = "...")] for backward compatibility.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this file is no longer used in v1 (I'll do a cleanup soon). It's OK to either keep or revert current changes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up! I have kept the changes for consistency with _internal/setting.py. Happy to revert if you prefer — just let me know.

Comment on lines +43 to +44
let max_dbs = settings.lmdb_max_dbs.unwrap_or(DEFAULT_MAX_DBS);
let map_size = settings.lmdb_map_size.unwrap_or(DEFAULT_LMDB_MAP_SIZE);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To resolve the copilot comments on the setting.py, we can validate these values are positive here and raise a error otherwise (use client_bail!)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f2308d6 — added validation with client_bail! before calling heed::EnvOpenOptions:

if settings.lmdb_max_dbs < 1 {
    client_bail!("lmdb_max_dbs must be >= 1, got {}", settings.lmdb_max_dbs);
}
if settings.lmdb_map_size == 0 {
    client_bail!("lmdb_map_size must be > 0, got {}", settings.lmdb_map_size);
}

This also resolves the Copilot comments about Python-side validation — invalid values are now caught at the Rust layer with clear error messages.

- Move LMDB default values to Python side (lmdb_max_dbs=1024,
  lmdb_map_size=4GiB) so fields are never None, matching the pattern
  used by GlobalExecutionOptions.source_max_inflight_rows
- Make Rust EnvironmentSettings fields non-optional (u32/usize) with
  serde default functions for backward compatibility
- Add validation with client_bail! for invalid values (max_dbs < 1,
  map_size == 0)
- Remove unused u32 import from std
@Harry-kp Harry-kp marked this pull request as ready for review February 27, 2026 19:02
Copy link
Copy Markdown
Member

@georgeh0 georgeh0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Merging.

Comment on lines +71 to +83
lmdb_kwargs: dict[str, Any] = dict()
_load_field(
lmdb_kwargs,
"lmdb_max_dbs",
"COCOINDEX_LMDB_MAX_DBS",
parse=int,
)
_load_field(
lmdb_kwargs,
"lmdb_map_size",
"COCOINDEX_LMDB_MAP_SIZE",
parse=int,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good for now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants