Overview
RAGFlow's Table Parser is a chunking method that parses Excel/CSV files and stores the structured data for retrieval in chat conversations. Currently it supports storing parsed table data in Elasticsearch and Infinity. This feature request aims to add OceanBase as a storage backend for the Table Parser, enabling users to query Excel data stored in OceanBase through natural language in chat.
Technical Details
Location
rag/app/table.py - Table parser implementation (lines 460-495)
rag/utils/ob_conn.py:1632 - OceanBase connection with TODO for sql() function
Current Implementation
The Table Parser processes Excel/CSV files and stores data differently based on the document engine:
-
For Infinity (DOC_ENGINE_INFINITY):
- Stores data in a JSON column
chunk_data with original column names as keys
- Flexible schema handling via JSON
data_json[str(clmns[j])] = row[clmns[j]]
d["chunk_data"] = data_json
-
For Elasticsearch/OpenSearch (default):
- Stores data as individual fields with type suffixes
- Type suffixes:
_tks (text tokens), _kwd (keyword), _long (integer), _flt (float), _dt (datetime), _bool (boolean)
fld = clmns_map[j][0] # e.g., "name_kwd", "age_long"
d[fld] = row[clmns[j]] if clmn_tys[j] != "text" else rag_tokenizer.tokenize(row[clmns[j]])
Implementation Plan
-
Add OceanBase storage logic in rag/app/table.py (after line 495)
elif settings.DOC_ENGINE_OCEANBASE:
# Store in OceanBase using JSON column for flexibility
data_json[str(clmns[j])] = row[clmns[j]]
d["chunk_data"] = data_json
-
Implement sql() function in rag/utils/ob_conn.py (line 1632)
def sql(self, sql_text: str, fetch_size: int =1024, format: str = "json"):
"""
Execute SQL query on OceanBase for table data retrieval
Args:
sql_text: SQL query string (generated by LLM via text-to-SQL)
fetch_size: Maximum number of rows to return
format: Output format ("json" or "markdown")
Returns:
Query results in specified format
"""
# 1. Parse and validate SQL
# 2. Execute query against OceanBase
# 3. Handle JSON column data (chunk_data)
# 4. Format results (JSON or Markdown table)
# 5. Apply fetch_size limit
-
Key Technical Challenges:
- Dynamic Schema: Excel files have different columns, use JSON column for flexibility
- Data Type Mapping: Map Python types to OceanBase types (int, float, text, datetime, bool)
- Text-to-SQL Integration: Generate SQL from natural language queries
- Result Formatting: Support both JSON and Markdown output formats
Core Functionality
The Table Parser provides:
- Excel/CSV Parsing: Multi-sheet support, header detection, data type inference
- Data Type Detection: Automatically detects int, float, text, datetime, bool types
- Tokenization: Text fields are tokenized for full-text search
- Field Mapping: Creates pinyin-based field names for Chinese content
- Chunk Creation: Each row becomes a searchable chunk with metadata
- Image Extraction: Extracts and describes images within Excel cells
Estimated Effort
- Code size: ~150 lines (table.py modifications + sql() implementation)
- Difficulty: ⭐⭐⭐ Medium
- Priority: 🟡 Medium
Related Files
rag/app/table.py - Table parser (modify storage logic)
rag/utils/ob_conn.py - OceanBase connection (implement sql() function)
deepdoc/parser/excel_parser.py - Excel parsing utilities
common/settings.py - DOC_ENGINE configuration
Acceptance Criteria
Provide screenshot(s) demonstrating:
- Upload an Excel file to a knowledge base with OceanBase as the document engine
- Configure the Table Parser as the chunking method
- Successfully parse and store the Excel data in OceanBase
- Query the stored table data through chat using natural language
- Display query results correctly (Markdown table format)
Example test scenario:
User uploads: sales_data.xlsx with columns (date, product, quantity, revenue)
Chat query: "Show me the top 5 products by revenue"
System: Generates SQL → Queries OceanBase → Returns results
Background
This is part of the RAGFlow + OceanBase Hackathon. Adding OceanBase storage support to the Table Parser will enable users to leverage OceanBase's powerful SQL capabilities for querying structured Excel data through natural language conversations in RAGFlow.
Overview
RAGFlow's Table Parser is a chunking method that parses Excel/CSV files and stores the structured data for retrieval in chat conversations. Currently it supports storing parsed table data in Elasticsearch and Infinity. This feature request aims to add OceanBase as a storage backend for the Table Parser, enabling users to query Excel data stored in OceanBase through natural language in chat.
Technical Details
Location
rag/app/table.py- Table parser implementation (lines 460-495)rag/utils/ob_conn.py:1632- OceanBase connection with TODO forsql()functionCurrent Implementation
The Table Parser processes Excel/CSV files and stores data differently based on the document engine:
For Infinity (
DOC_ENGINE_INFINITY):chunk_datawith original column names as keysFor Elasticsearch/OpenSearch (default):
_tks(text tokens),_kwd(keyword),_long(integer),_flt(float),_dt(datetime),_bool(boolean)Implementation Plan
Add OceanBase storage logic in
rag/app/table.py(after line 495)Implement
sql()function inrag/utils/ob_conn.py(line 1632)Key Technical Challenges:
Core Functionality
The Table Parser provides:
Estimated Effort
Related Files
rag/app/table.py- Table parser (modify storage logic)rag/utils/ob_conn.py- OceanBase connection (implement sql() function)deepdoc/parser/excel_parser.py- Excel parsing utilitiescommon/settings.py- DOC_ENGINE configurationAcceptance Criteria
Provide screenshot(s) demonstrating:
Example test scenario:
Background
This is part of the RAGFlow + OceanBase Hackathon. Adding OceanBase storage support to the Table Parser will enable users to leverage OceanBase's powerful SQL capabilities for querying structured Excel data through natural language conversations in RAGFlow.