feat(sdk): implement lazy loading for API runs for performance improvement by thanos-wandb · Pull Request #10034 · wandb/wandb

thanos-wandb · 2025-06-16T14:47:45Z

Fixes

Overview

Implements intelligent lazy loading for the api.runs() method, providing performance improvements for run listing operations while maintaining full backward compatibility.

What's New

New `lazy` Parameter

# Fast listing (default) - only essential metadata
runs = api.runs("entity/project", lazy=True)  # faster initial load

# Full data upfront (for compatibility)
runs = api.runs("entity/project", lazy=False)  # original behavior

Lazy Loading

Lightweight GraphQL fragments exclude heavy fields (config, summaryMetrics, systemMetrics) by default
Automatic on-demand loading when accessing heavy fields
Intelligent caching prevents duplicate requests
Parallel loading for upgrading multiple runs simultaneously

Seamless API Experience

# Existing code works unchanged - gets performance benefits automatically
runs = api.runs("entity/project")  # lazy=True by default

for run in runs:
    print(run.name, run.state)     # ✅ Fast - basic metadata only
    config = run.config            # ✅ Auto-loads full data when needed
    summary = run.summary          # ✅ Auto-loads full data when needed

Technical Implementation

GraphQL Fragment Optimization

Lightweight Fragment: Includes only essential run metadata (id, name, state, tags, etc.)
Full Fragment: Includes all fields (config, summaryMetrics, systemMetrics, etc.)
Dynamic Fragment Selection: Based on lazy parameter

Lazy Property Loading

Auto-triggering: Properties automatically load full data when accessed
Smart Caching: Prevents duplicate requests for the same run
Parallel Processing: Uses ThreadPoolExecutor for concurrent loading

Cache Management

Intelligent Upgrades: Seamlessly upgrades cached lazy runs to full data
No Duplication: Single cache entry per run, regardless of loading mode
Memory Efficient: Heavy fields only loaded when needed

API Changes

New Parameters

lazy: bool = True - Controls lazy loading behavior (default: True for optimal performance)

New Methods

run.load_full_data() - Explicitly load full run data
runs.upgrade_to_full() - Upgrade entire collection to full data with parallel loading

Enhanced Properties

run.config - Auto-loads full data if in lazy mode
run.summary - Auto-loads full data if in lazy mode
run.system_metrics - Auto-loads full data if in lazy mode
run.summary_metrics - Auto-loads full data if in lazy mode

Performance Benefits

faster initial loading with lazy=True (default)
dramatic memory reduction for initial run listing

Testing

Backward compatible - all current workflows continue to work
API surface preserved - no breaking changes to public interfaces
Error handling - graceful fallbacks for edge cases

codecov · 2025-06-16T14:54:39Z

Codecov Report

❌ Patch coverage is 73.07692% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
wandb/apis/public/runs.py	70.32%	27 Missing ⚠️
wandb/apis/public/api.py	85.71%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…nce gains - Add LIGHTWEIGHT_RUN_FRAGMENT excluding heavy fields - Add lightweight parameter to api.runs() method - Implement lazy loading properties and load_full_data() method - 99.7% memory reduction and 1.2-4x performance improvement

… - Tests lightweight vs full mode with X-Wandb-Force-Trace header - Demonstrates 99.7% memory reduction and performance gains - Includes lazy loading and selective loading patterns

…or lightweight runs

…nstead of _objects

- Change parameter name from 'lightweight' to 'lazy' for better clarity - Fix cache key logic to not include lazy parameter - Add _create_runs_query() helper to eliminate GraphQL query duplication - Add fragment name constants to avoid fragile string parsing - Improve exception handling with more specific catches - Add type annotations and better documentation - Add _full_data_loaded flag for robust lazy loading checks - Remove unnecessary hasattr checks and clean up code structure Addresses all reviewer feedback while maintaining backward compatibility and improving code maintainability.

- Remove trailing whitespace and blank line whitespace - Fix function parameter line breaks and spacing - Break long conditional statements for better readability - Apply ruff formatting standards Addresses CI/CD pre-commit hook failures

- Fix sweep_name access in lazy mode by checking _attrs.get('sweepName') - Add summary_metrics, rawconfig, and sweep_name properties with lazy loading - Handle missing fields gracefully when using lightweight fragment - Prevent AttributeError when accessing fields not in lightweight fragment This resolves test failures where lazy loading was trying to access fields that weren't available in the lightweight GraphQL fragment.

- Remove trailing whitespace from blank lines in api.py - Break long conditional statements in runs.py properties - Apply consistent code formatting standards This should resolve all pre-commit hook failures related to formatting and code style.

Fix F841 linting error by properly assigning sweep to self.sweep instead of creating an unused local variable. This matches the original intent and behavior of the sweep loading logic.

- Fix run() method to pass lazy=False for individual run access - Fix Run.create() method to pass lazy=False for created runs - Ensures individual runs have full data loaded by default - Maintains lazy=True default for runs() collection method - Resolves test failures caused by missing data in individual runs This maintains the intended behavior: - runs() method: lazy=True (optimized for listing) - run() method: lazy=False (full data for individual access) - create_run(): lazy=False (full data for created runs)

- Fix run() method formatting with proper line breaks for pre-commit - Fix lazy loading logic to check key existence instead of truthiness - Prevent unnecessary load_full_data() calls when data is already available - Use 'key not in _attrs' instead of 'not _attrs.get(key)' for consistency This resolves test failures where sweep_name access was triggering unnecessary GraphQL queries for non-existent test runs.

- Use single-line conditionals for rawconfig and sweep_name properties - Aligns with ruff formatting standards - Resolves final pre-commit hook formatting violations

The sweep_name property was incorrectly triggering load_full_data() even though sweepName is included in the lightweight GraphQL fragment. This was causing test failures when accessing sweep_name for test runs that don't actually exist in the backend. Since sweepName is available in both lightweight and full fragments, we can always return the value directly without triggering lazy loading.

@jacobromero

- Use ThreadPoolExecutor for concurrent loading of full run data - Limit max workers to 10 to avoid overwhelming the server - Significant performance improvement when upgrading multiple lazy runs - Addresses review comment from @jacobromero

- Fixed lazy loading implementation to only call _load_from_attrs() when using full fragment or when required fields are present - This prevents errors when trying to parse config/systemMetrics/summaryMetrics fields that don't exist in lightweight fragments - Addresses the issue where _convert_to_dict was being called on non-existent fields

- Only convert config/summaryMetrics/systemMetrics if they exist in _attrs - Remove duplicate JSON parsing logic since _convert_to_dict handles it - Update exception handling for dict operations instead of JSON parsing - This fixes test failures when using lazy loading with lightweight fragment

- Add caching to _server_provides_project_id_for_run and _server_provides_internal_id_for_project - These functions were making network requests for every single run loaded - With 100 runs, this caused 100+ unnecessary GraphQL introspection queries - Added global cache and instance-level caching to reduce API calls by ~100x - Fixes test script hanging/timeout issues with upgrade_to_full()

- Remove PERFORMANCE_FIX.md documentation - Remove test_upgrade_performance.py test script These were for local testing only and shouldn't be part of the PR

- Fix formatting issues: trailing whitespace, line length, extra blank lines - Ensure _project_internal_id is always set for lightweight fragments - Add caching to server capability checks to prevent repeated network calls - Make Files query conditional on server internalId support - Remove duplicate _project_internal_id assignment in _load_from_attrs Fixes test failures in test_delete_files_for_multiple_runs and GraphQL query errors for servers that don't support project.internalId field.

…runs_query - Fix critical bug where 'internalId' was incorrectly queried on Run type instead of 'projectId' - Add proper server capability check for 'projectId' on Run type using _server_provides_project_id_for_run() - Maintain separate checks for 'internalId' on Project vs 'projectId' on Run - Resolves 'Cannot query field "internalId" on type "Run"' GraphQL errors This fixes the test failure in test_agent_subprocess_with_import_readline where the runs query was trying to access an unsupported field on older servers.

… exact message format The test was failing because it expected the exact completion message 'wandb: [run_id] Finished syncing run.' but the actual output was just 'wandb: '. This suggests the backend service isn't sending the completion message properly. Rather than relying on the fragile message format, this fix: - Checks the sync command exit code (0 = success) - Verifies basic output structure is present - Maintains data validation to ensure sync actually worked This makes the test more resilient to message format changes while still ensuring the core sync functionality works correctly.

… in PR #10087 PR #10087 already solved the multiple introspection calls performance issue by making the call once in the parent Runs object and propagating to children. This removes the _SERVER_CAPABILITIES_CACHE approach to keep changes minimal while preserving the core lazy loading fixes: - ✅ Keep: _project_internal_id setting for lightweight fragments - ✅ Keep: Correct projectId field usage in _create_runs_query - ✅ Keep: Dynamic Files query server capability detection - ✅ Keep: Test robustness improvements - ❌ Remove: Global caching mechanism (_SERVER_CAPABILITIES_CACHE) The remaining changes focus purely on lazy loading functionality.

Found and fixed a performance issue where _create_runs_query was making introspection calls internally while the caller was also making them. Before: Each Runs object creation made 3+ introspection calls After: Each Runs object creation makes exactly 2 introspection calls Changes: - Modified _create_runs_query to accept both capability flags as parameters - Removed internal _server_provides_project_id_for_run() call - Updated both call sites to pass explicit capability flags - Maintained all functionality while reducing network overhead This complements PR #10087 by further optimizing the query generation process without introducing caching complexity.

The test_syncs_run changes were addressing a separate backend issue with sync completion messages, not related to our lazy loading fixes. The original test failure was about wandb-core service not sending the completion message properly, which is unrelated to GraphQL field selection and _project_internal_id handling. Keep PR focused only on lazy loading improvements: - ✅ GraphQL field compatibility fixes - ✅ _project_internal_id setting for lightweight fragments - ✅ Optimized introspection calls - ✅ Files query server capability detection - ❌ Removed unrelated sync test changes

nickpenaranda

nice. this looks like it'll be a net improvement overall but in a subsequent change (def not this round!) I think we can take the laziness even further. right now runs are either partially loaded, or fully loaded. accessing any lazy-loaded fields on a partially loaded run loads ALL the other fields; but we really only need the fields that are actually being accessed. this can be done w/ a single well-parameterized GQL query using @include for each piece that can be loaded independently.

nickpenaranda · 2025-09-24T19:46:31Z

one other thought, maybe not directly related to this change but adjacent: when using this api w get a Run object that's actually a snapshot of a Run in time. In the current sdk version I think this problem already exists w/ active Runs: your Run object is more or less guaranteed to be stale. And With this new lazy code path it becomes even less clear to the user what the expected behavior should be. Should we consider some doc updates to go along with this change?

jacobromero

Awesome work!
Lets add an update in the change log as well so users are aware of this change.

github-actions · 2025-09-26T12:50:28Z

❌ Documentation Reference Check Failed

No documentation reference found in the PR description. Please add either:

A link to a docs PR (format: wandb/docs#XXX or https://github.com/wandb/docs/pull/XXX)
A Jira ticket reference (format: DOCS-XXX or https://wandb.atlassian.net/browse/DOCS-XXX)

This check is required for all PRs that start with "feat". Please update your PR description and this check will run again automatically.

thanos-wandb · 2025-09-26T12:59:30Z

thanks @nickpenaranda that's a good point about snapshot behaviour, here the lazy loading change itself doesn't alter this, just the timing of when snapshots are taken (initial query vs. field access, so it might actually provide just slightly fresher data for lazy-loaded fields). But indeed this existing characteristic of the AP may now become more apparent with lazy loading. re: documentation, thats great suggestion, I can create a follow-up docs issue if I see users asking about it, for now I mentioned it in changelog.

@jacobromero thank you! have updated the changelog accordingly, and merging now!

timokau · 2025-10-02T09:03:44Z

Thanks for the performance work! But I'm hitting a breaking change in v0.22.1:

run.config is now a JSON string instead of a dict when using api.runs().

Reproducer:

Create a test run:

import wandb
run = wandb.init(project="test", config={"nested": {"key": "value"}})
run.finish()

Fetch it via api.runs():

import wandb
api = wandb.Api()
runs = api.runs("entity/test", per_page=1)
run = next(iter(runs))
print(type(run.config))  # <class 'str'> instead of <class 'dict'>

api.run() works fine and returns a dict. Tested on wandb 0.22.1. This worked in 0.21.4.

…uns' The refactor in wandb#10034 likely had a typo.

thanos-wandb requested a review from a team as a code owner June 16, 2025 14:47

thanos-wandb changed the title ~~Feat/light runs fragment optimization~~ feat: light runs fragment optimization Jun 16, 2025

thanos-wandb changed the title ~~feat: light runs fragment optimization~~ feat(sdk): add API runs lightweight fragment Jun 17, 2025

nickpenaranda reviewed Jun 17, 2025

View reviewed changes

Comment thread wandb/apis/public/api.py Outdated

Comment thread wandb/apis/public/api.py Outdated

timoffex reviewed Jun 17, 2025

View reviewed changes

thanos-wandb changed the title ~~feat(sdk): add API runs lightweight fragment~~ feat(sdk): add API runs lazy loading for improved performance Jun 24, 2025

jacobromero reviewed Jun 27, 2025

View reviewed changes

Comment thread wandb/apis/public/runs.py Outdated

thanos-wandb requested review from a team as code owners August 21, 2025 09:55

thanos-wandb added 20 commits August 21, 2025 11:00

test: add comprehensive performance test with Datadog tracing support…

585eb33

… - Tests lightweight vs full mode with X-Wandb-Force-Trace header - Demonstrates 99.7% memory reduction and performance gains - Includes lazy loading and selective loading patterns

cleanup: remove test file to keep branch focused on core optimization

cd5c85a

fix: remove trailing whitespace to pass precommit checks

fa09ccf

fix: address cache key issue and implement proper upgrade mechanism f…

6218f60

…or lightweight runs

fix: correct attribute name in upgrade_to_full method - use objects i…

177e23f

…nstead of _objects

remove perf-test.py - should not be part of the PR

18ffe86

fix: apply pre-commit formatting fixes

d2a4d32

- Remove trailing whitespace and blank line whitespace - Fix function parameter line breaks and spacing - Break long conditional statements for better readability - Apply ruff formatting standards Addresses CI/CD pre-commit hook failures

fix: resolve ruff linting error for unused variable

03f892f

Fix F841 linting error by properly assigning sweep to self.sweep instead of creating an unused local variable. This matches the original intent and behavior of the sweep loading logic.

fix: apply pre-commit formatting for conditional statements

80abe37

- Use single-line conditionals for rawconfig and sweep_name properties - Aligns with ruff formatting standards - Resolves final pre-commit hook formatting violations

fix: remove trailing whitespace (ruff formatting)

a246daf

fix: add missing Dict import for type hints

aa2b9a1

fix: use dict instead of Dict for type annotation

60a6815

thanos-wandb force-pushed the feat/light-runs-fragment-optimization branch from bdca846 to d8cc09d Compare September 13, 2025 10:46

thanos-wandb added 2 commits September 19, 2025 17:35

fix: apply ruff formatting to conditional statement

5f21520

thanos-wandb force-pushed the feat/light-runs-fragment-optimization branch from d8cc09d to 5f21520 Compare September 19, 2025 16:42

thanos-wandb and others added 13 commits September 19, 2025 17:44

chore: remove test files from PR

c32253b

- Remove PERFORMANCE_FIX.md documentation - Remove test_upgrade_performance.py test script These were for local testing only and shouldn't be part of the PR

Merge branch 'main' into feat/light-runs-fragment-optimization

6004964

Merge branch 'main' into feat/light-runs-fragment-optimization

6f8e673

style: apply ruff formatting

8220c65

Merge branch 'main' into feat/light-runs-fragment-optimization

594d3f1

nickpenaranda approved these changes Sep 24, 2025

View reviewed changes

jacobromero approved these changes Sep 25, 2025

View reviewed changes

thanos-wandb added 2 commits September 26, 2025 13:45

Merge branch 'main' into feat/light-runs-fragment-optimization

9da5763

Update CHANGELOG.unreleased.md

2fb1efb

thanos-wandb merged commit f6f7416 into main Sep 26, 2025
33 of 34 checks passed

thanos-wandb deleted the feat/light-runs-fragment-optimization branch September 26, 2025 13:10

thanos-wandb mentioned this pull request Oct 10, 2025

fix(sdk): resolve config type regression and threading issues in lazy loading #10682

Merged

1 task

s22chan added a commit to s22chan/wandb that referenced this pull request Jan 21, 2026

TypeError: Sweep.__init__() got an unexpected keyword argument 'withR…

b06bab5

…uns' The refactor in wandb#10034 likely had a typo.

s22chan mentioned this pull request Jan 21, 2026

fix(api): TypeError: Sweep.__init__() got an unexpected keyword argument 'withRuns' #11219

Merged

1 task

Conversation

thanos-wandb commented Jun 16, 2025 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Overview

What's New

New lazy Parameter

Lazy Loading

Seamless API Experience

Technical Implementation

GraphQL Fragment Optimization

Lazy Property Loading

Cache Management

API Changes

New Parameters

New Methods

Enhanced Properties

Performance Benefits

Testing

Uh oh!

codecov Bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickpenaranda left a comment

Choose a reason for hiding this comment

Uh oh!

nickpenaranda commented Sep 24, 2025

Uh oh!

jacobromero left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Sep 26, 2025

Uh oh!

thanos-wandb commented Sep 26, 2025

Uh oh!

Uh oh!

timokau commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thanos-wandb commented Jun 16, 2025 •

edited by atlassian Bot

Loading

New `lazy` Parameter

codecov Bot commented Jun 16, 2025 •

edited

Loading