Skip to content

[Feature Request - RAGFlow+OceanBase] Add OceanBase Storage Support for Table Parser #12770

@yuzhichang

Description

@yuzhichang

Overview

RAGFlow's Table Parser is a chunking method that parses Excel/CSV files and stores the structured data for retrieval in chat conversations. Currently it supports storing parsed table data in Elasticsearch and Infinity. This feature request aims to add OceanBase as a storage backend for the Table Parser, enabling users to query Excel data stored in OceanBase through natural language in chat.

Technical Details

Location

  • rag/app/table.py - Table parser implementation (lines 460-495)
  • rag/utils/ob_conn.py:1632 - OceanBase connection with TODO for sql() function

Current Implementation

The Table Parser processes Excel/CSV files and stores data differently based on the document engine:

  1. For Infinity (DOC_ENGINE_INFINITY):

    • Stores data in a JSON column chunk_data with original column names as keys
    • Flexible schema handling via JSON
    data_json[str(clmns[j])] = row[clmns[j]]
    d["chunk_data"] = data_json
  2. For Elasticsearch/OpenSearch (default):

    • Stores data as individual fields with type suffixes
    • Type suffixes: _tks (text tokens), _kwd (keyword), _long (integer), _flt (float), _dt (datetime), _bool (boolean)
    fld = clmns_map[j][0]  # e.g., "name_kwd", "age_long"
    d[fld] = row[clmns[j]] if clmn_tys[j] != "text" else rag_tokenizer.tokenize(row[clmns[j]])

Implementation Plan

  1. Add OceanBase storage logic in rag/app/table.py (after line 495)

    elif settings.DOC_ENGINE_OCEANBASE:
        # Store in OceanBase using JSON column for flexibility
        data_json[str(clmns[j])] = row[clmns[j]]
    d["chunk_data"] = data_json
  2. Implement sql() function in rag/utils/ob_conn.py (line 1632)

    def sql(self, sql_text: str, fetch_size: int =1024, format: str = "json"):
        """
        Execute SQL query on OceanBase for table data retrieval
        
        Args:
            sql_text: SQL query string (generated by LLM via text-to-SQL)
            fetch_size: Maximum number of rows to return
            format: Output format ("json" or "markdown")
        
        Returns:
            Query results in specified format
        """
        # 1. Parse and validate SQL
        # 2. Execute query against OceanBase
        # 3. Handle JSON column data (chunk_data)
        # 4. Format results (JSON or Markdown table)
        # 5. Apply fetch_size limit
  3. Key Technical Challenges:

    • Dynamic Schema: Excel files have different columns, use JSON column for flexibility
    • Data Type Mapping: Map Python types to OceanBase types (int, float, text, datetime, bool)
    • Text-to-SQL Integration: Generate SQL from natural language queries
    • Result Formatting: Support both JSON and Markdown output formats

Core Functionality

The Table Parser provides:

  • Excel/CSV Parsing: Multi-sheet support, header detection, data type inference
  • Data Type Detection: Automatically detects int, float, text, datetime, bool types
  • Tokenization: Text fields are tokenized for full-text search
  • Field Mapping: Creates pinyin-based field names for Chinese content
  • Chunk Creation: Each row becomes a searchable chunk with metadata
  • Image Extraction: Extracts and describes images within Excel cells

Estimated Effort

  • Code size: ~150 lines (table.py modifications + sql() implementation)
  • Difficulty: ⭐⭐⭐ Medium
  • Priority: 🟡 Medium

Related Files

  • rag/app/table.py - Table parser (modify storage logic)
  • rag/utils/ob_conn.py - OceanBase connection (implement sql() function)
  • deepdoc/parser/excel_parser.py - Excel parsing utilities
  • common/settings.py - DOC_ENGINE configuration

Acceptance Criteria

Provide screenshot(s) demonstrating:

  1. Upload an Excel file to a knowledge base with OceanBase as the document engine
  2. Configure the Table Parser as the chunking method
  3. Successfully parse and store the Excel data in OceanBase
  4. Query the stored table data through chat using natural language
  5. Display query results correctly (Markdown table format)

Example test scenario:

User uploads: sales_data.xlsx with columns (date, product, quantity, revenue)
Chat query: "Show me the top 5 products by revenue"
System: Generates SQL → Queries OceanBase → Returns results

Background

This is part of the RAGFlow + OceanBase Hackathon. Adding OceanBase storage support to the Table Parser will enable users to leverage OceanBase's powerful SQL capabilities for querying structured Excel data through natural language conversations in RAGFlow.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions