Skip to content

feat: Entity extraction uses JSON structured output instead of delimiter-based text#2684

Closed
MrGidea wants to merge 4 commits intoHKUDS:mainfrom
MrGidea:feat/json-structured-extraction
Closed

feat: Entity extraction uses JSON structured output instead of delimiter-based text#2684
MrGidea wants to merge 4 commits intoHKUDS:mainfrom
MrGidea:feat/json-structured-extraction

Conversation

@MrGidea
Copy link
Copy Markdown
Contributor

@MrGidea MrGidea commented Feb 7, 2026

Summary

  • Replace delimiter-based entity extraction with JSON structured output, significantly improving extraction quality and compatibility with smaller models
  • Support native JSON mode for OpenAI-compatible APIs (response_format: json_object), Ollama (format: json), and Gemini (response_mime_type)
  • Auto-fallback: if a provider doesn't support response_format, automatically retry without it (relying on JSON prompt + json_repair)
  • Backward compatible: configurable via ENTITY_EXTRACTION_USE_JSON env var (default: true), cache rebuild auto-detects JSON vs delimiter format
  • Add EXTRACTION_MAX_TOKENS config to prevent output truncation (many APIs default to only 1024)
  • Skip relationships with empty descriptions to prevent merge errors

Changed Files

  • lightrag/types.py - Add EntityExtractionResult Pydantic model
  • lightrag/prompt.py - Add JSON-mode prompt templates and examples
  • lightrag/operate.py - Add _process_json_extraction_result() parser, modify extraction pipeline
  • lightrag/utils.py - Pass entity_extraction flag through use_llm_func_with_cache
  • lightrag/lightrag.py - Add entity_extraction_use_json and extraction_max_tokens config
  • lightrag/llm/openai.py - response_format: json_object with auto-fallback retry
  • lightrag/llm/ollama.py - format="json" for entity extraction
  • lightrag/llm/gemini.py - response_mime_type="application/json" for entity extraction
  • lightrag/llm/*.py (others) - Pop entity_extraction kwarg for compatibility

Test Plan

  • Tested with Moonshot API (OpenAI-compatible, direct connection)
  • Tested with Google Gemini 2.0 Flash via OpenRouter
  • Tested with DeepSeek V3 via OpenRouter
  • Tested with Qwen 2.5 72B via OpenRouter
  • Tested with Meta Llama 3.1 70B via OpenRouter
  • Verified JSON parsing, gleaning, cache rebuild, and graph construction
  • Verified auto-fallback when response_format is not supported
  • Verified backward compatibility (ENTITY_EXTRACTION_USE_JSON=false)

…ter-based text

- Add EntityExtractionResult Pydantic model for structured JSON output
- Add JSON-mode prompt templates for entity/relationship extraction
- Add _process_json_extraction_result() JSON parser in extraction pipeline
- Add entity_extraction_use_json config option, default True
- Add extraction_max_tokens config to prevent output truncation
- OpenAI: use response_format json_object with auto-fallback retry
- Ollama/Gemini: use native JSON mode for entity extraction
- Other providers: pop entity_extraction kwarg for compatibility
- Cache rebuild auto-detects JSON vs delimiter format
- Skip relationships with empty descriptions to prevent merge errors
@danielaskdd danielaskdd added enhancement New feature or request server LightRAG Server labels Feb 9, 2026
@danielaskdd danielaskdd added tracked Issue is tracked by project labels Feb 23, 2026
@MrGidea
Copy link
Copy Markdown
Contributor Author

MrGidea commented Mar 8, 2026

Superseded by a new combined PR that includes the JSON structured extraction changes together with the newer multimodal and role-based pipeline updates, while excluding the entity disambiguation experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request server LightRAG Server tracked Issue is tracked by project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants