Skip to content

feat: add processor plugin support#299

Merged
andreatgretel merged 8 commits into
mainfrom
andreatgretel/feat/processor-plugins-registry
Feb 25, 2026
Merged

feat: add processor plugin support#299
andreatgretel merged 8 commits into
mainfrom
andreatgretel/feat/processor-plugins-registry

Conversation

@andreatgretel

@andreatgretel andreatgretel commented Feb 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends the plugin system to support third-party processor plugins via entry points, alongside existing column generator and seed reader plugins.

Changes

Added

Changed

  • PluginRegistry uses RLock instead of Lock to prevent deadlocks when plugin imports trigger re-entry
  • ProcessorConfigT moved from processors.py to processor_types.py for plugin injection
  • Import updates in config_builder.py, data_designer_config.py, validation.py

Docs

Attention Areas

Reviewers: Please pay special attention to the following:

Test Plan

  • All 2217 existing tests pass
  • 11 demo plugin tests pass (6 regex filter, 5 semantic dedup)
  • Plugin discovery correctly registers both processor plugins
  • Demo notebook runs end-to-end with live LLM (regex filter: 4→2 rows, semantic dedup verified)
  • CI passes

Description updated with AI

@andreatgretel andreatgretel force-pushed the andreatgretel/feat/processor-plugins branch 7 times, most recently from 403bc69 to a61848e Compare February 11, 2026 20:29
Base automatically changed from andreatgretel/feat/processor-plugins to main February 12, 2026 00:32
Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation
@andreatgretel andreatgretel force-pushed the andreatgretel/feat/processor-plugins-registry branch from 56ccb15 to 79d6a34 Compare February 19, 2026 18:40
@andreatgretel andreatgretel changed the title feat: add processor plugin system feat: add processor plugin support Feb 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove all these demos! (promise!)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreatgretel – perhaps we should convert these demos to e2e tests following the pattern in tests_e2e?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done — moved regex_filter to tests_e2e as a processor plugin e2e test, dropped semantic_dedup (too heavy with sentence-transformers). deleted the whole demo/ dir.

@andreatgretel

Copy link
Copy Markdown
Contributor Author

Lots of LoCs but most of them are for the demo and for the plan, both will be removed:

Demo plugin package 15 +470 0
Plan 1 +122 0
Docs 2 +56 -1
Core source 7 +42 -8
Core tests 1 +30 -4

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.
@andreatgretel andreatgretel marked this pull request as ready for review February 19, 2026 22:19
@andreatgretel andreatgretel requested a review from a team as a code owner February 19, 2026 22:19

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this too I think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be this stays in /plans/<pr$>/processor-plugins.md?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, moved it to plans/299/processor-plugins.md

@greptile-apps

greptile-apps Bot commented Feb 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR extends the plugin system to support third-party processor plugins via entry points. The implementation follows established patterns from column generator and seed reader plugins.

Key changes:

  • Added PluginType.PROCESSOR with processor_type discriminator in plugin.py:23
  • Created processor_types.py following the existing _types module pattern for plugin injection
  • Moved ProcessorConfig from processors.py to base.py for plugin access (proper architectural layering)
  • Changed PluginRegistry._lock from Lock to RLock to prevent re-entry deadlocks during plugin discovery
  • Engine ProcessorRegistry now auto-registers discovered processor plugins in registry.py:28-29
  • Added comprehensive e2e test with regex filter demo plugin
  • Updated documentation in processors.md and plugins/overview.md

Architecture:
The implementation mirrors the existing column generator and seed reader plugin patterns. Plugin processors inherit from ProcessorConfig base class (now in base.py), implement the Processor interface, and register via entry points. The ProcessorConfigT type union in processor_types.py gets plugin types injected at runtime via PluginManager.inject_into_processor_config_type_union().

Testing:

  • All 2217 existing tests pass
  • New unit test in test_registry.py:30-44 verifies plugin registration
  • E2e test in test_e2e.py:73-103 validates end-to-end plugin functionality with regex filter demo

Confidence Score: 5/5

  • This PR is safe to merge - clean implementation following established plugin patterns with comprehensive tests
  • All changes follow existing architectural patterns (column/seed plugins), proper test coverage added (unit + e2e), critical RLock fix prevents deadlocks, refactoring properly maintains separation of concerns, and all 2217 existing tests pass
  • No files require special attention

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/plugins/plugin.py Added PluginType.PROCESSOR with processor_type discriminator - clean implementation following existing pattern
packages/data-designer-config/src/data_designer/plugins/registry.py Changed Lock to RLock to prevent re-entry deadlocks during plugin discovery - necessary fix
packages/data-designer-config/src/data_designer/config/processor_types.py New file following _types pattern for plugin injection into ProcessorConfigT union - matches existing architecture
packages/data-designer-config/src/data_designer/plugin_manager.py Added inject_into_processor_config_type_union() method - consistent with existing column/seed plugin injection patterns
packages/data-designer-config/src/data_designer/config/base.py Moved ProcessorConfig base class here from processors.py for plugin access - proper architectural layering
packages/data-designer-engine/src/data_designer/engine/processing/processors/registry.py Added plugin processor registration loop in registry creation - integrates processor plugins into engine

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Plugin Entry Point] -->|Discovered by| B[PluginRegistry]
    B -->|Stores| C[Plugin Instance]
    C -->|References| D[ProcessorConfig Class]
    C -->|References| E[Processor Implementation]
    
    F[PluginManager] -->|Calls| B
    F -->|inject_into_processor_config_type_union| G[ProcessorConfigT Union]
    G -->|Extended with| D
    
    H[create_default_processor_registry] -->|Queries| B
    H -->|Registers| I[ProcessorRegistry]
    I -->|Maps processor_type to| E
    I -->|Maps processor_type to| D
    
    J[Config Builder] -->|Uses| G
    K[Dataset Builder] -->|Uses| I
    
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style G fill:#bbf,stroke:#333,stroke-width:2px
    style I fill:#bfb,stroke:#333,stroke-width:2px
Loading

Last reviewed commit: b25d308

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

28 files reviewed, 9 comments

Edit Code Review Agent Settings | Greptile

Comment thread tests_e2e/src/data_designer_e2e_tests/plugins/regex_filter/impl.py
Comment thread demo/data_designer_demo_processors/tests/test_regex_filter.py Outdated
Comment thread demo/data_designer_demo_processors/tests/test_semantic_dedup.py Outdated
Comment on lines 29 to 35

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should move this to data_designer.config.base. That's where we have SingleColumnConfig – idea is to have no data_designer deps in that file as a guard against circular deps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done — moved to base.py alongside SingleColumnConfig. processors.py re-imports it so existing imports still work.

nabinchha
nabinchha previously approved these changes Feb 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be this stays in /plans/<pr$>/processor-plugins.md?

Comment on lines +3 to +5
Created: 2026-02-03
Updated: 2026-02-19
Status: Complete

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I like this status tracker here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Comment on lines +25 to +46
class SemanticDedupProcessor(Processor[SemanticDedupProcessorConfig]):
"""Removes near-duplicate rows based on embedding cosine similarity."""

def _initialize(self) -> None:
_suppress_transformers_logging()
self._model = SentenceTransformer(self.config.model_name)

def process_after_generation(self, data: pd.DataFrame) -> pd.DataFrame:
texts = data[self.config.column].astype(str).tolist()
if len(texts) <= 1:
return data

embeddings = self._model.encode(texts, show_progress_bar=False, normalize_embeddings=True)
sim_matrix = np.dot(embeddings, embeddings.T)

keep = set(range(len(texts)))
for i in range(len(texts)):
if i not in keep:
continue
for j in range(i + 1, len(texts)):
if j in keep and sim_matrix[i, j] >= self.config.similarity_threshold:
keep.discard(j)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in this demo/example, we can assume embeddings are pre-generated using EmbeddingColumnCOnfig?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nevermind, I see the comment above that the demo notebook will be removed....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, all removed now — following @johnnygreco'''s suggestion it became an e2e test instead (regex_filter processor plugin in tests_e2e/)

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/
@andreatgretel andreatgretel merged commit 982ce79 into main Feb 25, 2026
47 checks passed
@andreatgretel andreatgretel deleted the andreatgretel/feat/processor-plugins-registry branch April 14, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants