Skip to content

fix(sdk): resolve config type regression and threading issues in lazy loading#10682

Merged
thanos-wandb merged 6 commits intomainfrom
hotfix/fix-v0.22.1-regressions
Oct 15, 2025
Merged

fix(sdk): resolve config type regression and threading issues in lazy loading#10682
thanos-wandb merged 6 commits intomainfrom
hotfix/fix-v0.22.1-regressions

Conversation

@thanos-wandb
Copy link
Copy Markdown
Contributor

@thanos-wandb thanos-wandb commented Oct 10, 2025

Description

Issue: config type regression

User impact: workflows failing with TypeError: string indices must be integers

Reproduction:

api = wandb.Api()
art = api.artifact('entity/project/artifact:latest')
run = art.logged_by()
config = run.config  # Returns str instead of dict in v0.22.1
config["key"]       # TypeError!

Root cause: _load_from_attrs() conversion logic bypassed due to _is_loaded flag issues
Fix: Ensure proper string→dict conversion + defensive property checks

  • I updated CHANGELOG.unreleased.md, or it's not applicable

Testing

How was this PR tested?

Regression Testing:

  • Verified the exact customer reproduction case now works correctly
  • Tested artifact.logged_by() + run.config access returns proper dict type
  • Confirmed sequential upgrade completes without hanging
  • Validated all existing usage patterns (lazy/full/mixed) continue to work

Test Cases:

# Test 1: Config type fix
art = api.artifact('project/artifact:v1')
run = art.logged_by()
assert isinstance(run.config, dict)  # ✅ Now passes

# Test 2: No deadlock in upgrade
runs = api.runs("project", lazy=True)
runs.upgrade_to_full()  # ✅ Completes successfully

# Test 3: Mixed usage patterns
api.runs("project", lazy=False)  # ✅ Works
api.run("project/run_id")        # ✅ Works

URGENT: Fixes two critical issues introduced in v0.22.1 lazy loading:

1. Config Type Regression (High Priority)
   Problem: run.config returns string instead of dict
   Root Cause: _load_from_attrs() bypassed due to _is_loaded flag logic
   Customer Impact: TypeError in artifact.logged_by() workflows
   Fix:
   - Modified _load_with_fragment to respect force parameter
   - Added defensive conversion in config/summary/system_metrics properties
   - Ensure _load_from_attrs called when loading full fragments

2. Threading Deadlock (High Priority)
   Problem: Jobs getting stuck in concurrent/futures._base.py:439
   Root Cause: ThreadPoolExecutor in upgrade_to_full() overwhelming asyncio
   Customer Impact: Training jobs hanging indefinitely
   Fix:
   - Replace parallel ThreadPoolExecutor with sequential loading
   - Add error handling to prevent cascading failures
   - Preserve functionality while eliminating deadlock risk

Validation:
- Tested exact customer reproduction case (artifact.logged_by())
- Verified config returns dict type consistently
- Confirmed sequential upgrade prevents deadlock pattern
- All existing functionality preserved

Fixes reported issues:
- Pinterest workflows: TypeError: string indices must be integers
- Pinterest training jobs: stuck in Future.result() deadlock
- Affects all workflows using artifact.logged_by() + run.config access
@thanos-wandb thanos-wandb requested a review from a team as a code owner October 10, 2025 11:27
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 73.33333% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
wandb/apis/public/runs.py 73.33% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@thanos-wandb thanos-wandb changed the title fix(sdk): hotfix regression in v0.22.1 lazy loading for config type & threading deadlock fix(sdk): resolve config type regression and threading issues in lazy loading Oct 10, 2025
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread wandb/apis/public/runs.py Outdated
Comment thread CHANGELOG.unreleased.md Outdated
Comment thread CHANGELOG.unreleased.md Outdated
Comment thread wandb/apis/public/runs.py Outdated
thanos-wandb and others added 3 commits October 13, 2025 18:32
…cus on config regression

- Simplify defensive type checking as suggested by jacobromero
  (_convert_to_dict already handles dict inputs as noop)
- Revert threading changes since deadlock is fixed in PR #10683
- Update changelog wording to be more positive
- Focus PR solely on the config type regression fix
@thanos-wandb thanos-wandb merged commit 76b0d80 into main Oct 15, 2025
32 of 34 checks passed
@thanos-wandb thanos-wandb deleted the hotfix/fix-v0.22.1-regressions branch October 15, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: accessing run summary raises exception (starting in 0.22.1) [Bug]: run.config is a string if loaded via wandb.Api().runs(..., lazy=True)

4 participants