LLM-BioDataExtractor is an automated pipeline that leverages large language models (LLMs) to extract various biochemical data, including enzyme kinetics parameters (e.g., Km, Kcat), enzyme activity, and ribozyme data, from scientific literature. The pipeline consists of three main steps:
- PDF to Markdown (pdf_2_md): Converts PDF files to Markdown format.
- LLM Data Extraction (LLM_extract_data): Extracts key biochemical data from Markdown files using LLMs.
- Evaluate Extracted Data (evaluate_extracted_data): Compares the extracted data with ground truth to assess accuracy.
- Fig. 1 Schematic of our LLM-based agentic workflow for enzyme kinetic data extraction.
-
Table 1. Overall performance of various models examined on the annotated dataset of 156 protein enzyme papers.
We offer a suite of online tools and a data portal designed to streamline access to and processing of biological data. Key features include:
- Automated Enzyme Kinetics Extractor: A user-friendly tool for extracting enzyme kinetics data from scientific literature.
- Golden Benchmark for Enzyme Kinetics: A simple interface for searching and browsing a collection of benchmark datasets, enabling the evaluation of enzyme kinetics extraction models.
- Golden Benchmark for Ribozyme Kinetics: A simple interface for searching and browsing a collection of benchmark datasets, enabling the evaluation of ribozyme kinetics extraction models.
- LLM Enzyme Kinetics Archive (LLENKA): An intuitive platform for searching and browsing a comprehensive dataset sourced from 3,435 articles. LLENKA provides the research community with a structured, high-quality resource of enzyme kinetics data, advancing future research endeavors.
Visit the Automated-Enzyme-Kinetics-Extractor for more details and to start using these tools.
You can also download the dataset from here:
- Golden Benchmark for Enzyme Kinetics
- Golden Benchmark for Ribozyme Kinetics
- LLM Enzyme Kinetics Archive (LLENKA)
Ensure the required dependencies are installed:
pip install -r requirements.txtConvert PDF files to Markdown format and process documents with no more than 50 (customizable, 50 by default) pages.
from extract_pipeline import pdf_2_md
pdf_2_md()Extract key biochemical data from Markdown files and save it in the response folder.
from extract_pipeline import LLM_extract_data
LLM_extract_data()s2_LLM_data_extract/LLM_response_aggregate.py is a Python script designed for aggregating responses from 4 language model responses.
Place Markdown files of scientific literature in the data/md/ directory, and place 4 model responses in the data/response/ directory. The script will process these responses and aggregate them into a single response.
python LLM_response_aggregate.pyCompare the extracted data with ground truth to assess accuracy.
from extract_pipeline import evaluate_extracted_data
evaluate_extracted_data().
├── analyze_code # Code for analyzing extracted data
│ ├── data # Data files used for analysis
│ │ └── ...
│ │
│ ├── analyzing.ipynb # Jupyter notebook for analyzing extracted data
│ ├── requirements.txt # Required dependencies
│ └── readme.md # Project overview and usage instructions
│
├── data # Data files used for extraction and evaluation
│ ├── pdf # PDF files to be processed
│ ├── md # Converted Markdown files
│ ├── txt # Extracted text files
│ ├── response # Extracted data files
│ └── results # Evaluation results
│
├── prompt # Prompt files
│ ├── p_3_2_0806.txt # Prompt for data extraction
│ └── p_2_0826.txt # Prompt for merge data
│
├── s1_pdf_2_md # PDF to Markdown conversion pipeline
│ ├── ocr_mathpix.py # High-performance PDF to Markdown conversion
│ ├── ocr_pymupdf.py # Free but less effective PDF to text conversion
│ ├── readme.md # Usage instructions
│ └── readme_pymupdf.md # Instructions for text conversion logic
│
├── s2_LLM_extract_data # LLM data extraction pipeline
│ ├── LLM_data_extraction.py # Main logic for data extraction
│ ├── LLM_response_aggregate.py # Aggregate responses
│ └── readme.md # Usage instructions
│
├── s3_evaluate_extracted_data # Evaluation pipeline
│ ├── evaluate_extracted_data.py # Main logic for evaluation
│ └── readme.md # Usage instructions
│
├── extract_pipeline.py # Main processing logic
├── readme.md # Project overview
└── requirements.txt # Dependency list
- data_folder_dir: Path to the data folder, default is
"data/". - pdf_folder_dir: Path to the PDF folder, default is
"data/pdf". - md_folder_dir: Path to the Markdown folder, default is
"data/md".
- md_folder: Path to the Markdown folder, default is
"data/md/". - response_folder: Path to the response folder, default is
"data/response/". - prompt_extract_dir: Path to the extraction prompt file, default is
"prompt/p_3_2_0806.txt". - prompt_merge_dir: Path to the merging prompt file, default is
"prompt/p_2_0826.txt".
- response_dir: Path to the folder containing LLM extraction results, default is
'data/response/prompt_p_3_2_0806_claude-3-5-sonnet-20240620_128k_stream_max_tokens_8192_temperature_0.1'. - ground_truth_dir: Path to the ground truth file, default is
'data/ground_truth/km_kcat_all.csv'. - seq: Delimiter, default is
"|". - order: Target column index, default is
-7. - have_dir: Whether subdirectories exist, default is
0.
This section provides a detailed guide on how to use the analyze_code directory. The directory contains a Jupyter notebook, analyzing.ipynb, which can be used to analyze the extracted data. The notebook includes code snippets for loading and analyzing the extracted data, as well as visualizing the results.
The script uses the logging module for recording logs. By default, the log level is set to INFO. You can adjust the log level as needed.
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')- Ensure all paths and filenames are correct.
- Complete the
pdf_2_mdstep successfully before runningLLM_extract_data. - Complete the
LLM_extract_datastep successfully before runningevaluate_extracted_data.
Please cite this project if you find it useful in your research:
@article {Jiang2025.03.03.641178,
author = {Jiang, Jinling and Hu, Jie and Xie, Siwei and Guo, Menghao and Dong, Yuhang and Fu, Shuai and Jiang, Xianyue and Yue, Zhenlei and Shi, Junchao and Zhang, Xiaoyu and Song, Minghui and Chen, Guangyong and Lu, Hua and Wu, Xindong and Guo, Pei and Han, Da and Sun, Zeyi and Qiu, Jiezhong},
title = {Enzyme Co-Scientist: Harnessing Large Language Models for Enzyme Kinetic Data Extraction from Literature},
elocation-id = {2025.03.03.641178},
year = {2025},
doi = {10.1101/2025.03.03.641178},
publisher = {Cold Spring Harbor Laboratory},
abstract = {The extraction of molecular annotations from scientific literature is critical for advancing data-driven research. However, traditional methods, which primarily rely on human curation, are labor-intensive and error-prone. Here, we present an LLM-based agentic workflow that enables automatic and efficient data extraction from literature with high accuracy. As a demonstration, our workflow successfully delivers a dataset containing over 91,000 enzyme kinetics entries from around 3,500 papers. It achieves an average F1 score above 0.9 on expert-annotated subsets of protein enzymes and can be extended to the ribozyme domain in fewer than 3 days at less than $90. This method opens up new avenues for accelerating the pace of scientific research.Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2025/03/11/2025.03.03.641178},
eprint = {https://www.biorxiv.org/content/early/2025/03/11/2025.03.03.641178.full.pdf},
journal = {bioRxiv}
}
Thank you for using LLM-BioDataExtractor! We hope it helps you efficiently process and analyze a wide range of biochemical data from scientific literature.






