feat: add processor plugin support#299
Conversation
403bc69 to
a61848e
Compare
Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation
56ccb15 to
79d6a34
Compare
There was a problem hiding this comment.
Will remove all these demos! (promise!)
There was a problem hiding this comment.
@andreatgretel – perhaps we should convert these demos to e2e tests following the pattern in tests_e2e?
There was a problem hiding this comment.
done — moved regex_filter to tests_e2e as a processor plugin e2e test, dropped semantic_dedup (too heavy with sentence-transformers). deleted the whole demo/ dir.
|
Lots of LoCs but most of them are for the demo and for the plan, both will be removed:
|
Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly.
There was a problem hiding this comment.
Remove this too I think?
There was a problem hiding this comment.
may be this stays in /plans/<pr$>/processor-plugins.md?
There was a problem hiding this comment.
good idea, moved it to plans/299/processor-plugins.md
Greptile SummaryThis PR extends the plugin system to support third-party processor plugins via entry points. The implementation follows established patterns from column generator and seed reader plugins. Key changes:
Architecture: Testing:
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/plugins/plugin.py | Added PluginType.PROCESSOR with processor_type discriminator - clean implementation following existing pattern |
| packages/data-designer-config/src/data_designer/plugins/registry.py | Changed Lock to RLock to prevent re-entry deadlocks during plugin discovery - necessary fix |
| packages/data-designer-config/src/data_designer/config/processor_types.py | New file following _types pattern for plugin injection into ProcessorConfigT union - matches existing architecture |
| packages/data-designer-config/src/data_designer/plugin_manager.py | Added inject_into_processor_config_type_union() method - consistent with existing column/seed plugin injection patterns |
| packages/data-designer-config/src/data_designer/config/base.py | Moved ProcessorConfig base class here from processors.py for plugin access - proper architectural layering |
| packages/data-designer-engine/src/data_designer/engine/processing/processors/registry.py | Added plugin processor registration loop in registry creation - integrates processor plugins into engine |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Plugin Entry Point] -->|Discovered by| B[PluginRegistry]
B -->|Stores| C[Plugin Instance]
C -->|References| D[ProcessorConfig Class]
C -->|References| E[Processor Implementation]
F[PluginManager] -->|Calls| B
F -->|inject_into_processor_config_type_union| G[ProcessorConfigT Union]
G -->|Extended with| D
H[create_default_processor_registry] -->|Queries| B
H -->|Registers| I[ProcessorRegistry]
I -->|Maps processor_type to| E
I -->|Maps processor_type to| D
J[Config Builder] -->|Uses| G
K[Dataset Builder] -->|Uses| I
style B fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#bbf,stroke:#333,stroke-width:2px
style I fill:#bfb,stroke:#333,stroke-width:2px
Last reviewed commit: b25d308
There was a problem hiding this comment.
I'm thinking we should move this to data_designer.config.base. That's where we have SingleColumnConfig – idea is to have no data_designer deps in that file as a guard against circular deps.
There was a problem hiding this comment.
done — moved to base.py alongside SingleColumnConfig. processors.py re-imports it so existing imports still work.
There was a problem hiding this comment.
may be this stays in /plans/<pr$>/processor-plugins.md?
| Created: 2026-02-03 | ||
| Updated: 2026-02-19 | ||
| Status: Complete |
There was a problem hiding this comment.
Oh I like this status tracker here.
| class SemanticDedupProcessor(Processor[SemanticDedupProcessorConfig]): | ||
| """Removes near-duplicate rows based on embedding cosine similarity.""" | ||
|
|
||
| def _initialize(self) -> None: | ||
| _suppress_transformers_logging() | ||
| self._model = SentenceTransformer(self.config.model_name) | ||
|
|
||
| def process_after_generation(self, data: pd.DataFrame) -> pd.DataFrame: | ||
| texts = data[self.config.column].astype(str).tolist() | ||
| if len(texts) <= 1: | ||
| return data | ||
|
|
||
| embeddings = self._model.encode(texts, show_progress_bar=False, normalize_embeddings=True) | ||
| sim_matrix = np.dot(embeddings, embeddings.T) | ||
|
|
||
| keep = set(range(len(texts))) | ||
| for i in range(len(texts)): | ||
| if i not in keep: | ||
| continue | ||
| for j in range(i + 1, len(texts)): | ||
| if j in keep and sim_matrix[i, j] >= self.config.similarity_threshold: | ||
| keep.discard(j) |
There was a problem hiding this comment.
Perhaps in this demo/example, we can assume embeddings are pre-generated using EmbeddingColumnCOnfig?
There was a problem hiding this comment.
oh nevermind, I see the comment above that the demo notebook will be removed....
There was a problem hiding this comment.
yep, all removed now — following @johnnygreco'''s suggestion it became an e2e test instead (regex_filter processor plugin in tests_e2e/)
- Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/
Summary
Extends the plugin system to support third-party processor plugins via entry points, alongside existing column generator and seed reader plugins.
Changes
Added
PluginType.PROCESSORwithprocessor_typediscriminatorprocessor_types.py—ProcessorConfigTtype union with plugin injection (followscolumn_types.py/seed_source_types.pypattern)inject_into_processor_config_type_union()inPluginManagerRegexFilterProcessor— filters rows by regex pattern (process_before_batch)SemanticDedupProcessor— removes near-duplicate rows via embedding similarity (process_after_generation)Changed
PluginRegistryusesRLockinstead ofLockto prevent deadlocks when plugin imports trigger re-entryProcessorConfigTmoved fromprocessors.pytoprocessor_types.pyfor plugin injectionconfig_builder.py,data_designer_config.py,validation.pyDocs
Attention Areas
processor_types.py— New file following the_typesmodule patternregistry.py(PluginRegistry) —Lock→RLockchange to prevent deadlocks during plugin discovery with nested importsTest Plan
Description updated with AI