Skip to content

feat(sdk): implement lazy loading for API runs for performance improvement#10034

Merged
thanos-wandb merged 37 commits intomainfrom
feat/light-runs-fragment-optimization
Sep 26, 2025
Merged

feat(sdk): implement lazy loading for API runs for performance improvement#10034
thanos-wandb merged 37 commits intomainfrom
feat/light-runs-fragment-optimization

Conversation

@thanos-wandb
Copy link
Copy Markdown
Contributor

@thanos-wandb thanos-wandb commented Jun 16, 2025

Fixes

Fixes WB-24113

Overview

Implements intelligent lazy loading for the api.runs() method, providing performance improvements for run listing operations while maintaining full backward compatibility.

What's New

New lazy Parameter

# Fast listing (default) - only essential metadata
runs = api.runs("entity/project", lazy=True)  # faster initial load

# Full data upfront (for compatibility)
runs = api.runs("entity/project", lazy=False)  # original behavior

Lazy Loading

  • Lightweight GraphQL fragments exclude heavy fields (config, summaryMetrics, systemMetrics) by default
  • Automatic on-demand loading when accessing heavy fields
  • Intelligent caching prevents duplicate requests
  • Parallel loading for upgrading multiple runs simultaneously

Seamless API Experience

# Existing code works unchanged - gets performance benefits automatically
runs = api.runs("entity/project")  # lazy=True by default

for run in runs:
    print(run.name, run.state)     # ✅ Fast - basic metadata only
    config = run.config            # ✅ Auto-loads full data when needed
    summary = run.summary          # ✅ Auto-loads full data when needed

Technical Implementation

GraphQL Fragment Optimization

  • Lightweight Fragment: Includes only essential run metadata (id, name, state, tags, etc.)
  • Full Fragment: Includes all fields (config, summaryMetrics, systemMetrics, etc.)
  • Dynamic Fragment Selection: Based on lazy parameter

Lazy Property Loading

  • Auto-triggering: Properties automatically load full data when accessed
  • Smart Caching: Prevents duplicate requests for the same run
  • Parallel Processing: Uses ThreadPoolExecutor for concurrent loading

Cache Management

  • Intelligent Upgrades: Seamlessly upgrades cached lazy runs to full data
  • No Duplication: Single cache entry per run, regardless of loading mode
  • Memory Efficient: Heavy fields only loaded when needed

API Changes

New Parameters

  • lazy: bool = True - Controls lazy loading behavior (default: True for optimal performance)

New Methods

  • run.load_full_data() - Explicitly load full run data
  • runs.upgrade_to_full() - Upgrade entire collection to full data with parallel loading

Enhanced Properties

  • run.config - Auto-loads full data if in lazy mode
  • run.summary - Auto-loads full data if in lazy mode
  • run.system_metrics - Auto-loads full data if in lazy mode
  • run.summary_metrics - Auto-loads full data if in lazy mode

Performance Benefits

  • faster initial loading with lazy=True (default)
  • dramatic memory reduction for initial run listing

Testing

  • Backward compatible - all current workflows continue to work
  • API surface preserved - no breaking changes to public interfaces
  • Error handling - graceful fallbacks for edge cases

@thanos-wandb thanos-wandb requested a review from a team as a code owner June 16, 2025 14:47
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 16, 2025

Codecov Report

❌ Patch coverage is 73.07692% with 28 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
wandb/apis/public/runs.py 70.32% 27 Missing ⚠️
wandb/apis/public/api.py 85.71% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@thanos-wandb thanos-wandb changed the title Feat/light runs fragment optimization feat: light runs fragment optimization Jun 16, 2025
@thanos-wandb thanos-wandb changed the title feat: light runs fragment optimization feat(sdk): add API runs lightweight fragment Jun 17, 2025
Comment thread wandb/apis/public/api.py Outdated
Comment thread wandb/apis/public/api.py Outdated
Comment thread wandb/apis/public/api.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
@thanos-wandb thanos-wandb changed the title feat(sdk): add API runs lightweight fragment feat(sdk): add API runs lazy loading for improved performance Jun 24, 2025
Comment thread wandb/apis/public/runs.py Outdated
@thanos-wandb thanos-wandb requested review from a team as code owners August 21, 2025 09:55
…nce gains - Add LIGHTWEIGHT_RUN_FRAGMENT excluding heavy fields - Add lightweight parameter to api.runs() method - Implement lazy loading properties and load_full_data() method - 99.7% memory reduction and 1.2-4x performance improvement
… - Tests lightweight vs full mode with X-Wandb-Force-Trace header - Demonstrates 99.7% memory reduction and performance gains - Includes lazy loading and selective loading patterns
- Change parameter name from 'lightweight' to 'lazy' for better clarity
- Fix cache key logic to not include lazy parameter
- Add _create_runs_query() helper to eliminate GraphQL query duplication
- Add fragment name constants to avoid fragile string parsing
- Improve exception handling with more specific catches
- Add type annotations and better documentation
- Add _full_data_loaded flag for robust lazy loading checks
- Remove unnecessary hasattr checks and clean up code structure

Addresses all reviewer feedback while maintaining backward compatibility
and improving code maintainability.
- Remove trailing whitespace and blank line whitespace
- Fix function parameter line breaks and spacing
- Break long conditional statements for better readability
- Apply ruff formatting standards

Addresses CI/CD pre-commit hook failures
- Fix sweep_name access in lazy mode by checking _attrs.get('sweepName')
- Add summary_metrics, rawconfig, and sweep_name properties with lazy loading
- Handle missing fields gracefully when using lightweight fragment
- Prevent AttributeError when accessing fields not in lightweight fragment

This resolves test failures where lazy loading was trying to access
fields that weren't available in the lightweight GraphQL fragment.
- Remove trailing whitespace from blank lines in api.py
- Break long conditional statements in runs.py properties
- Apply consistent code formatting standards

This should resolve all pre-commit hook failures related to
formatting and code style.
Fix F841 linting error by properly assigning sweep to self.sweep
instead of creating an unused local variable. This matches the
original intent and behavior of the sweep loading logic.
- Fix run() method to pass lazy=False for individual run access
- Fix Run.create() method to pass lazy=False for created runs
- Ensures individual runs have full data loaded by default
- Maintains lazy=True default for runs() collection method
- Resolves test failures caused by missing data in individual runs

This maintains the intended behavior:
- runs() method: lazy=True (optimized for listing)
- run() method: lazy=False (full data for individual access)
- create_run(): lazy=False (full data for created runs)
- Fix run() method formatting with proper line breaks for pre-commit
- Fix lazy loading logic to check key existence instead of truthiness
- Prevent unnecessary load_full_data() calls when data is already available
- Use 'key not in _attrs' instead of 'not _attrs.get(key)' for consistency

This resolves test failures where sweep_name access was triggering
unnecessary GraphQL queries for non-existent test runs.
- Use single-line conditionals for rawconfig and sweep_name properties
- Aligns with ruff formatting standards
- Resolves final pre-commit hook formatting violations
The sweep_name property was incorrectly triggering load_full_data()
even though sweepName is included in the lightweight GraphQL fragment.
This was causing test failures when accessing sweep_name for test runs
that don't actually exist in the backend.

Since sweepName is available in both lightweight and full fragments,
we can always return the value directly without triggering lazy loading.
- Use ThreadPoolExecutor for concurrent loading of full run data
- Limit max workers to 10 to avoid overwhelming the server
- Significant performance improvement when upgrading multiple lazy runs
- Addresses review comment from @jacobromero
@thanos-wandb thanos-wandb force-pushed the feat/light-runs-fragment-optimization branch from bdca846 to d8cc09d Compare September 13, 2025 10:46
- Fixed lazy loading implementation to only call _load_from_attrs() when using full fragment or when required fields are present
- This prevents errors when trying to parse config/systemMetrics/summaryMetrics fields that don't exist in lightweight fragments
- Addresses the issue where _convert_to_dict was being called on non-existent fields
@thanos-wandb thanos-wandb force-pushed the feat/light-runs-fragment-optimization branch from d8cc09d to 5f21520 Compare September 19, 2025 16:42
thanos-wandb and others added 13 commits September 19, 2025 17:44
- Only convert config/summaryMetrics/systemMetrics if they exist in _attrs
- Remove duplicate JSON parsing logic since _convert_to_dict handles it
- Update exception handling for dict operations instead of JSON parsing
- This fixes test failures when using lazy loading with lightweight fragment
- Add caching to _server_provides_project_id_for_run and _server_provides_internal_id_for_project
- These functions were making network requests for every single run loaded
- With 100 runs, this caused 100+ unnecessary GraphQL introspection queries
- Added global cache and instance-level caching to reduce API calls by ~100x
- Fixes test script hanging/timeout issues with upgrade_to_full()
- Remove PERFORMANCE_FIX.md documentation
- Remove test_upgrade_performance.py test script
These were for local testing only and shouldn't be part of the PR
- Fix formatting issues: trailing whitespace, line length, extra blank lines
- Ensure _project_internal_id is always set for lightweight fragments
- Add caching to server capability checks to prevent repeated network calls
- Make Files query conditional on server internalId support
- Remove duplicate _project_internal_id assignment in _load_from_attrs

Fixes test failures in test_delete_files_for_multiple_runs and GraphQL
query errors for servers that don't support project.internalId field.
…runs_query

- Fix critical bug where 'internalId' was incorrectly queried on Run type instead of 'projectId'
- Add proper server capability check for 'projectId' on Run type using _server_provides_project_id_for_run()
- Maintain separate checks for 'internalId' on Project vs 'projectId' on Run
- Resolves 'Cannot query field "internalId" on type "Run"' GraphQL errors

This fixes the test failure in test_agent_subprocess_with_import_readline where
the runs query was trying to access an unsupported field on older servers.
… exact message format

The test was failing because it expected the exact completion message
'wandb: [run_id] Finished syncing run.' but the actual output was just 'wandb: '.

This suggests the backend service isn't sending the completion message properly.
Rather than relying on the fragile message format, this fix:

- Checks the sync command exit code (0 = success)
- Verifies basic output structure is present
- Maintains data validation to ensure sync actually worked

This makes the test more resilient to message format changes while still
ensuring the core sync functionality works correctly.
… in PR #10087

PR #10087 already solved the multiple introspection calls performance issue
by making the call once in the parent Runs object and propagating to children.

This removes the _SERVER_CAPABILITIES_CACHE approach to keep changes minimal
while preserving the core lazy loading fixes:

- ✅ Keep: _project_internal_id setting for lightweight fragments
- ✅ Keep: Correct projectId field usage in _create_runs_query
- ✅ Keep: Dynamic Files query server capability detection
- ✅ Keep: Test robustness improvements
- ❌ Remove: Global caching mechanism (_SERVER_CAPABILITIES_CACHE)

The remaining changes focus purely on lazy loading functionality.
Found and fixed a performance issue where _create_runs_query was making
introspection calls internally while the caller was also making them.

Before: Each Runs object creation made 3+ introspection calls
After: Each Runs object creation makes exactly 2 introspection calls

Changes:
- Modified _create_runs_query to accept both capability flags as parameters
- Removed internal _server_provides_project_id_for_run() call
- Updated both call sites to pass explicit capability flags
- Maintained all functionality while reducing network overhead

This complements PR #10087 by further optimizing the query generation
process without introducing caching complexity.
The test_syncs_run changes were addressing a separate backend issue
with sync completion messages, not related to our lazy loading fixes.

The original test failure was about wandb-core service not sending
the completion message properly, which is unrelated to GraphQL field
selection and _project_internal_id handling.

Keep PR focused only on lazy loading improvements:
- ✅ GraphQL field compatibility fixes
- ✅ _project_internal_id setting for lightweight fragments
- ✅ Optimized introspection calls
- ✅ Files query server capability detection
- ❌ Removed unrelated sync test changes
Copy link
Copy Markdown
Contributor

@nickpenaranda nickpenaranda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. this looks like it'll be a net improvement overall but in a subsequent change (def not this round!) I think we can take the laziness even further. right now runs are either partially loaded, or fully loaded. accessing any lazy-loaded fields on a partially loaded run loads ALL the other fields; but we really only need the fields that are actually being accessed. this can be done w/ a single well-parameterized GQL query using @include for each piece that can be loaded independently.

@nickpenaranda
Copy link
Copy Markdown
Contributor

one other thought, maybe not directly related to this change but adjacent: when using this api w get a Run object that's actually a snapshot of a Run in time. In the current sdk version I think this problem already exists w/ active Runs: your Run object is more or less guaranteed to be stale. And With this new lazy code path it becomes even less clear to the user what the expected behavior should be. Should we consider some doc updates to go along with this change?

Copy link
Copy Markdown
Contributor

@jacobromero jacobromero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!
Lets add an update in the change log as well so users are aware of this change.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Documentation Reference Check Failed

No documentation reference found in the PR description. Please add either:

This check is required for all PRs that start with "feat". Please update your PR description and this check will run again automatically.

@thanos-wandb
Copy link
Copy Markdown
Contributor Author

thanks @nickpenaranda that's a good point about snapshot behaviour, here the lazy loading change itself doesn't alter this, just the timing of when snapshots are taken (initial query vs. field access, so it might actually provide just slightly fresher data for lazy-loaded fields). But indeed this existing characteristic of the AP may now become more apparent with lazy loading. re: documentation, thats great suggestion, I can create a follow-up docs issue if I see users asking about it, for now I mentioned it in changelog.

@jacobromero thank you! have updated the changelog accordingly, and merging now!

@thanos-wandb thanos-wandb merged commit f6f7416 into main Sep 26, 2025
33 of 34 checks passed
@thanos-wandb thanos-wandb deleted the feat/light-runs-fragment-optimization branch September 26, 2025 13:10
@timokau
Copy link
Copy Markdown

timokau commented Oct 2, 2025

Thanks for the performance work! But I'm hitting a breaking change in v0.22.1:

run.config is now a JSON string instead of a dict when using api.runs().

Reproducer:

Create a test run:

import wandb
run = wandb.init(project="test", config={"nested": {"key": "value"}})
run.finish()

Fetch it via api.runs():

import wandb
api = wandb.Api()
runs = api.runs("entity/test", per_page=1)
run = next(iter(runs))
print(type(run.config))  # <class 'str'> instead of <class 'dict'>

api.run() works fine and returns a dict. Tested on wandb 0.22.1. This worked in 0.21.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants