Route lightweight transforms to local samping strategy by Edwardvaneechoud · Pull Request #272 · Edwardvaneechoud/Flowfile

Edwardvaneechoud · 2026-01-28T22:03:07Z

Summary

This PR implements intelligent routing of execution strategies based on node types. Lightweight transforms (select, sample, union, sort, record_count) now use LOCAL_WITH_SAMPLING strategy when running remotely, enabling fast local computation with external sampling for preview data. Wide transforms are further optimized by overriding back to REMOTE when optimizing for downstream nodes.

Key Changes

Strategy Determination Logic: Modified _determine_strategy() to check if a node has has_default_settings=True and route to LOCAL_WITH_SAMPLING for remote execution, while maintaining FULL_LOCAL for local execution and REMOTE for nodes without defaults.
Wide Transform Override: Added logic in execute() to override wide transforms (sort, record_count) from LOCAL_WITH_SAMPLING back to REMOTE when optimizing for downstream nodes, ensuring full data is available for dependent operations.
Enhanced Logging: Improved execution logging to include node type, transform type, default settings status, and strategy override information for better debugging and observability.
Comprehensive Test Coverage: Added extensive test suite covering:
- Strategy determination for different node types (select, sort, manual_input)
- Execution decisions based on invalidation reasons
- Wide transform override behavior
- Verification of all nodes with defaults and their transform types

Implementation Details

Nodes with node_default.has_default_settings=True are considered lightweight transforms suitable for local execution with sampling
The override mechanism preserves the original decision reason while changing only the strategy
Local execution mode (run_location="local") always uses FULL_LOCAL, bypassing the sampling strategy
Test helpers were refactored to support creating graphs with different node types and configurations

https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

_determine_strategy never returned LOCAL_WITH_SAMPLING — all remote execution went to REMOTE regardless of node type. Nodes registered in nodes_with_defaults (select, sample, union, sort, record_count) are lightweight transforms that can compute locally with an external sampler providing preview data. Changes: - Route nodes with has_default_settings to LOCAL_WITH_SAMPLING in _determine_strategy (wide transforms are still overridden to REMOTE by the existing optimize_for_downstream logic in execute()) - Add detailed strategy logging: node_type, transform_type, has_default, and whether the strategy was overridden - Add comprehensive executor tests covering strategy determination for select, sort, and manual_input nodes, the wide-transform override logic, and parametrized tests for all nodes_with_defaults https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

…PLING routing The condition for LOCAL_WITH_SAMPLING should be based on whether the node is a narrow transform, not whether it has default settings. nodes_with_defaults is about whether a node can run without user config — it's unrelated to execution strategy. Narrow transforms (select, sample, union) are lightweight column-level operations safe to compute locally. Wide transforms (sort, record_count) reshape data and should always go REMOTE. Updated tests to reflect that sort goes directly to REMOTE from _determine_strategy, not via the override in execute(). https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

When cache_results is enabled on a node, the result must be fully materialized and stored by the remote worker. This means even narrow transforms (select, sample, union) that would normally use LOCAL_WITH_SAMPLING must go REMOTE. Strategy priority in _determine_strategy: 1. local → FULL_LOCAL 2. cache_results → REMOTE 3. narrow transform → LOCAL_WITH_SAMPLING 4. everything else → REMOTE Added TestCacheResultsStrategy with 5 tests covering cache override for narrow/wide transforms and the full decision flow. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

netlify · 2026-01-28T22:03:12Z

✅ Deploy Preview for flowfile-wasm canceled.

Name	Link
🔨 Latest commit	`821b762`
🔍 Latest deploy log	https://app.netlify.com/projects/flowfile-wasm/deploys/697bac25a8117700083796b1

Both the fast-path (_can_skip_execution_fast) and the executor (_decide_execution) used to fall through to re-execution when a node had already run but no external cache file existed on disk. This happens for LOCAL_WITH_SAMPLING nodes where results live in memory. Now: - Already ran + no cache_results → skip (results in memory) - Already ran + cache_results + cache present → skip - Already ran + cache_results + cache missing → re-run (rebuild cache) - Never ran → run with appropriate strategy - Forced refresh / output node / source changed → run Removed the PERFORMANCE_MODE and SETTINGS_CHANGED re-run paths for already-ran nodes. The fast-path on FlowNode already skipped performance mode, and SETTINGS_CHANGED was misleading (settings hadn't changed — the cache file was just missing). https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

Remove verbose debug info from log line ([node_type=..., transform_type=..., has_default=...]) and the unused initial_strategy variable. The log now shows just: "Starting to run select (NEVER_RAN -> LOCAL_WITH_SAMPLING)" Add "Execution Strategy" section to the Core Internals developer docs covering the three strategies, the decision pipeline, LOCAL_WITH_SAMPLING vs REMOTE flow diagrams, and node classification table. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

The docs described the strategy routing but didn't connect it to the execution modes visible in the frontend. Restructured to lead with what users see (Development vs Performance mode) and then explain the underlying strategies. Added a section for each mode, clarified that LOCAL_WITH_SAMPLING only applies in Development mode, and updated the node classification table with a Performance mode column. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

Sort (reordering) is a wide operation. Narrow transforms do dropping, renaming, and filtering. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

Tests the full preview data path: - TestPreviewAfterExecution: preview returns sample data after run, caps at 100 rows, returns empty before run, doesn't recompute, caches across repeated calls - TestPreviewAfterUpstreamChange: documents that node_stats.has_completed_last_run is NOT reset when settings change, causing stale preview data. Also verifies preview is correct after re-running. Adds create_graph_with_read_and_select helper using fake_data.parquet (1000 rows) for integration-level preview testing. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

When sibling nodes run in parallel and both serialize the same upstream LazyFrame, Polars' internal RefCell can raise "Already borrowed". This is a thread contention issue in the core process during LazyFrame serialization, not a logic error. After a parallel stage completes, any node that failed with "Already borrowed" is now retried sequentially — the contention is gone because no other thread is accessing the shared LazyFrame. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

When sibling nodes run in parallel and both access the same upstream LazyFrame, Polars' internal RefCell raises "Already borrowed". Fix: each node acquires its input nodes' _execution_lock before calling _function(*input_data), holding it through the LazyFrame access. Sibling nodes sharing a parent now serialize their access to the parent's data. Nodes with different parents still run fully parallel. Reverts the retry-based approach in favor of preventing the conflict entirely. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

codecov-commenter · 2026-01-29T17:34:06Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

claude added 3 commits January 28, 2026 20:32

claude and others added 10 commits January 28, 2026 22:04

docs: fix narrow transform description — filtering, not reordering

580aef5

Sort (reordering) is a wide operation. Narrow transforms do dropping, renaming, and filtering. https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

bump version and implement test

b25469e

remove breakpoints

74f79b8

Merge branch 'main' into claude/add-flowfile-cor-tests-HSZny

3d61ca0

chore: clean up verbose comments and polish code for review

821b762

https://claude.ai/code/session_01DaTkkXLAJyC2EfopwiY55d

Edwardvaneechoud merged commit 17c700c into main Jan 29, 2026
23 checks passed

Edwardvaneechoud deleted the claude/add-flowfile-cor-tests-HSZny branch January 29, 2026 19:45

Edwardvaneechoud mentioned this pull request Jan 30, 2026

Write Comprehensive Documentation for Caching and Run Modes #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route lightweight transforms to local samping strategy#272

Route lightweight transforms to local samping strategy#272
Edwardvaneechoud merged 14 commits intomainfrom
claude/add-flowfile-cor-tests-HSZny

Edwardvaneechoud commented Jan 28, 2026

Uh oh!

netlify bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Edwardvaneechoud commented Jan 28, 2026

Summary

Key Changes

Implementation Details

Uh oh!

netlify bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for flowfile-wasm canceled.

Uh oh!

codecov-commenter commented Jan 29, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Jan 28, 2026 •

edited

Loading