Skip to content

Add catalog table management and lineage tracking#346

Merged
Edwardvaneechoud merged 34 commits intomainfrom
claude/improve-catalog-manager-7XNB2
Mar 13, 2026
Merged

Add catalog table management and lineage tracking#346
Edwardvaneechoud merged 34 commits intomainfrom
claude/improve-catalog-manager-7XNB2

Conversation

@Edwardvaneechoud
Copy link
Copy Markdown
Owner

Summary

This PR adds comprehensive catalog table management capabilities to Flowfile, including table registration, preview, deletion, and lineage tracking between flows and tables. It introduces new UI components for table browsing and management, backend services for table operations, and integration with the flow designer through new CatalogReader and CatalogWriter nodes.

Key Changes

Frontend - Catalog Management

  • New Components:

    • TableDetailPanel.vue: Displays table metadata, schema, data preview, and read-by flows
    • RegisterTableModal.vue: Modal for registering new tables from data files
    • CreateNamespaceModal.vue: Extracted modal for creating catalogs/schemas
    • RegisterFlowModal.vue: Extracted modal for registering flows
  • Enhanced CatalogView:

    • Added search and filter functionality for catalog items
    • Added "Register Table" button in sidebar
    • Integrated table selection and detail view
    • Added table display in catalog tree with search/filter support
    • Displays tables produced by flows in FlowDetailPanel
  • Read Node Enhancement:

    • Added "Browse Catalog" button to select tables from catalog
    • Dialog for browsing and selecting catalog tables

Frontend - Node Types

  • CatalogReader Node: New node type for reading tables from the catalog with namespace/table selection and schema preview
  • CatalogWriter Node: New node type for writing data to catalog with configurable table name, namespace, and write mode
  • Added SVG icons for both new node types

Backend - Catalog Service

  • Table Operations:

    • get_table(), list_tables(), create_table(), update_table(), delete_table()
    • Table preview generation with configurable row limits
    • Table metadata enrichment (row count, column count, size)
  • Lineage Tracking:

    • list_tables_for_flow(): Get tables produced by a flow
    • list_readers_for_table(): Get flows that read a table
    • upsert_read_link(): Track read relationships between flows and tables
    • Bulk operations for N+1 query elimination
  • API Endpoints:

    • GET/POST /tables: List and register tables
    • GET /tables/{id}: Get table details
    • GET /tables/{id}/preview: Get data preview
    • DELETE /tables/{id}: Delete table
    • POST /tables/{id}/read-links: Track table reads

Database

  • New Models:

    • CatalogTable: Stores table metadata (name, description, file path, row/column counts, size)
    • CatalogTableReadLink: Tracks which flows read which tables
  • Schema Updates:

    • Added tables field to NamespaceTree
    • Added tables_produced field to FlowRegistrationOut
    • New schemas: CatalogTableOut, CatalogTableCreate, CatalogTableUpdate, CatalogTablePreview, CatalogTableSummary

Flow Graph Integration

  • CatalogReader Node Handler: Resolves catalog table by ID or name, reads materialized Parquet file, generates schema callback
  • CatalogWriter Node Handler: Writes output data to catalog as Parquet, creates/updates table metadata, tracks lineage

State Management

  • Extended catalog-store with table selection, preview loading, and all tables list
  • Added table-related actions and getters

Notable Implementation Details

  • Table data is materialized as Parquet files in catalog_tables_directory
  • Lineage is tracked bidirectionally: flows know what tables they produce, tables know what flows read them
  • Table preview is lazy-loaded and configurable
  • Search/filter in catalog tree supports both flows and tables
  • Modal components extracted for reusability and cleaner code organization
  • Bulk database queries used to prevent N+1 problems when enriching flow data

claude and others added 18 commits February 28, 2026 08:26
Implements the catalog table feature that allows users to register data
files (CSV, Parquet, Excel) as materialized Parquet tables in the catalog.
Tables appear in the catalog tree alongside flows and artifacts, with
schema metadata, row/column counts, and data preview capabilities.

Backend:
- CatalogTable SQLAlchemy model with schema_json, row_count, size_bytes
- Pydantic schemas (CatalogTableCreate/Out/Preview, ColumnSchema)
- Repository layer with full CRUD + namespace queries
- Service layer with Polars-based materialization, preview (first N rows)
- REST endpoints: GET/POST/PUT/DELETE /catalog/tables, GET preview
- catalog_tables_directory in shared storage config
- TableNotFoundError, TableExistsError domain exceptions

Frontend:
- CatalogTable TypeScript types and API client methods
- Catalog store with table state, selection, and preview loading
- TableDetailPanel component with metadata grid, schema table, data preview
- CatalogTreeNode updated with table items (green table icon, row count)
- CatalogView with Register Table modal and table detail integration
- Browse Catalog button in Read node settings for selecting catalog tables

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
The Register Table action was previously only discoverable by hovering
over schema-level tree nodes. This adds a visible table icon button
in the sidebar header and a namespace selector dropdown in the
registration modal so users can register tables without navigating
the tree first.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
- Fix: Register Table button was greyed out because file selection
  required double-click. Now single-click file selection also captures
  the file path, enabling the Register button immediately.

- Add "Publish to Catalog" checkbox to the output (write) node that
  registers the written file as a catalog table after execution.
  Includes optional table name and namespace selector fields.

- Backend: Add publish_to_catalog, catalog_table_name, and
  catalog_namespace_id fields to OutputSettings. After output writes,
  if publish_to_catalog is true, auto-register via CatalogService.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
Documents the end-to-end flow: UI settings, flow graph execution,
CatalogService materialization to Parquet, database schema, file
layout, and future Iceberg integration path.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
… modals

- Add CatalogReader node to read tables from the catalog into flows
- Add CatalogWriter node to write flow data to catalog as Parquet tables
- Revert publish_to_catalog from Output node in favor of dedicated nodes
- Add search and "show unavailable" filter to catalog sidebar
- Add file_exists field to CatalogTable for availability tracking
- Extract modals into CreateNamespaceModal, RegisterFlowModal, RegisterTableModal
- Remove outdated catalog-publish-from-output-node.md docs

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
…ardvaneechoud/Flowfile into claude/improve-catalog-manager-7XNB2
- Fix icon loading: add catalog_reader.svg and catalog_writer.svg to
  BUILTIN_ICONS set so they're served from bundled assets instead of
  being looked up in user_defined_nodes/icons/
- Add source_registration_id and source_run_id columns to CatalogTable
  to track which flow produced a table
- Add CatalogTableReadLink junction table to track which flows read
  from which tables (populated when catalog_reader node resolves)
- Show "Produced by" flow link in table detail panel
- Show "Read by Flows" list in table detail panel
- Show "Tables Produced" list in flow detail panel
- Add DB migration for new columns on catalog_tables
- Add FlowSummary and CatalogTableSummary schemas for lightweight refs

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
…ardvaneechoud/Flowfile into claude/improve-catalog-manager-7XNB2
The read link was being recorded in add_catalog_reader() which runs
when the node is configured in the designer, before the flow has a
source_registration_id. Move the upsert_read_link call into _func()
so it executes at flow runtime when the registration context is set.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
Instead of recording which catalog tables a flow reads during execution
(_func closure), record the links when the flow is saved. This ensures
source_registration_id is always available and aligns with user
expectations that lineage is captured on save.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
The source_registration_id was None at save time because it was only
resolved before flow execution. Now the save_flow route looks up the
flow registration by path before calling save_flow, so that
_sync_catalog_read_links can record the read relationships.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
- Remove max-width: fit-content from catalog-detail so it fills the screen
- Change catalog reader/writer icon color from indigo (#6366F1) to deep
  green (#16a34a)
- Increase CATALOG label font-size from 12 to 16 in both SVGs

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
- Produced by: truncate long flow names with ellipsis, show full name
  on hover via title attribute
- Read by Flows: replaced inline chip list with a clickable meta card
  that opens a modal dialog listing all reading flows

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 4, 2026

Deploy Preview for flowfile-wasm canceled.

Name Link
🔨 Latest commit 16d6d88
🔍 Latest deploy log https://app.netlify.com/projects/flowfile-wasm/deploys/69b44266834e240008e41f1b

claude and others added 5 commits March 4, 2026 17:19
Quality fixes:
- Move json, pathlib.Path, uuid to top-level imports in service.py
- Move sqlalchemy.func to top-level import in repository.py
- Remove unnecessary getattr() for source_registration_id and
  source_run_id in _table_to_out — these are proper model columns
- Add CatalogTableReadLink and CatalogTable to test cleanup

Tests added (16 new tests):
- TestReadLinks: upsert idempotency, list_readers_for_table,
  list_read_tables_for_flow, multiple readers per table
- TestTableLineage: source_registration_id storage and nullability,
  list_tables_for_flow, bulk_get_tables_for_flows
- TestServiceLineageEnrichment: source_registration_name enrichment,
  read_by_flows enrichment, tables_produced enrichment

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
Guard navigateToFlow emit with source_registration_id null check
so TypeScript can narrow the type from number | null to number.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
…ardvaneechoud/Flowfile into claude/improve-catalog-manager-7XNB2
Tests (8 new):
- TestCatalogWriter: table creation, source_registration_id lineage,
  overwrite mode replaces existing table
- TestCatalogReader: load by table ID, load by name + namespace
- TestSyncCatalogReadLinks: save_flow records read links, skips when
  no source_registration_id
- TestCatalogRoundTrip: write → read preserves data and column names

Quality fixes in add_catalog_writer:
- Remove redundant `import logging` / local logger (module-level
  logger from flowfile_core.configs already available)
- Remove redundant lazy imports of CatalogService, repository, and
  get_db_context (already imported at module top)
- Replace lazy `import uuid` with top-level `uuid4` import

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
claude and others added 5 commits March 5, 2026 06:58
Capture defineEmits return as 'emit' and use it instead of $emit() in
the template. This resolves the ESLint vue/require-explicit-emits rule
violations on Windows CI where warnings are treated as errors.

- Replace $emit('deleteTable', ...) with emit('deleteTable', ...)
- Replace $emit('navigateToFlow', ...) with emit('navigateToFlow', ...)
- Extract inline modal click handler into handleReadByClick function

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
Collapse multi-line function signatures and extract template literal
URL to a variable to avoid parser confusion with generics spanning
multiple lines. The vue-eslint-parser (used as top-level parser)
could not parse `axios.get<Type>(\`template\`)` when split across
lines on Windows CI.

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
Critical fixes:
- Fix double-materialization: add register_table_from_parquet() so the
  catalog writer node doesn't re-copy an already-written Parquet file
- Add cascade delete of CatalogTableReadLink rows when deleting a table
- Replace pandas df.to_pandas().values.tolist() with Polars df.rows()

Design improvements:
- Batch _sync_catalog_read_links into a single DB session instead of
  opening one session per catalog_reader node
- Move file_path from query parameter to CatalogTableCreate request body
- Add total_tables stat card to StatsPanel.vue

Minor fixes:
- Use logger.error (not warning) for re-raised exceptions in catalog writer
- Use lazy logger formatting (%s) instead of f-strings
- Remove redundant list_all_tables repository method
- Remove unnecessary (table as any) cast in CatalogTreeNode.vue
- Remove unused CatalogTable import in CatalogTreeNode.vue

https://claude.ai/code/session_01AcexA8fgAu5D4apWsGE6AV
@Edwardvaneechoud Edwardvaneechoud merged commit 961507b into main Mar 13, 2026
27 checks passed
@Edwardvaneechoud Edwardvaneechoud deleted the claude/improve-catalog-manager-7XNB2 branch March 13, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants