SCP-116K Dataset Pipeline

This repository contains the code implementation for the paper: "SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain"

Paper Link

Dataset available at: https://huggingface.co/datasets/EricLu/SCP-116K

Pipeline Overview

This is a generalized pipeline for automatically extracting high-quality problem-solution pairs from various publicly available documents crawled from the internet. The pipeline consists of the following steps:

fileter_doc_from_lib_meta.py
- Filter and identify potential available documents from database metadata
transfer_pdf_to_text_with_4o.py
- Convert PDF documents to text format with enhanced OCR capabilities
get_doc_page_unit_start_index.py
- Generate page and unit indices for better content organization
split_doc_to_chunk_by_llm_index.py
- Split documents into manageable chunks using LLM-based indexing
extract_problem_and_solution_from_doc_text.py
- Extract potential problem-solution pairs from the processed text
filter_problem_and_solution.py
- Filter and validate the extracted problem-solution pairs
recall_solutions_for_problems.py
- Match problems with their corresponding solutions
judge_problems_and_solutions_match.py
- Verify and validate the matched problem-solution pairs

Usage

For detailed information about each step and how to use the pipeline, please refer to:

The individual Python files in this repository
The research paper

Citation

@misc{lu2025scp116khighqualityproblemsolutiondataset,
      title={SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain}, 
      author={Dakuan Lu and Xiaoyu Tan and Rui Xu and Tianchu Yao and Chao Qu and Wei Chu and Yinghui Xu and Yuan Qi},
      year={2025},
      eprint={2501.15587},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.15587}, 
}

License

Dataset is licensed under the CC-BY-NC-SA 4.0 license. Code is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCP-116K Dataset Pipeline

Pipeline Overview

Usage

Citation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
experiment_train_data.zip		experiment_train_data.zip
extract_problem_and_solution_from_doc_text.py		extract_problem_and_solution_from_doc_text.py
fileter_doc_from_lib_meta.py		fileter_doc_from_lib_meta.py
filter_problem_and_solution.py		filter_problem_and_solution.py
get_doc_page_unit_start_index.py		get_doc_page_unit_start_index.py
gpt4_request.py		gpt4_request.py
judge_problems_and_solutions_match.py		judge_problems_and_solutions_match.py
recall_solutions_for_problems.py		recall_solutions_for_problems.py
split_doc_to_chunk_by_llm_index.py		split_doc_to_chunk_by_llm_index.py
train_config.yaml		train_config.yaml
transfer_pdf_to_text_with_4o.py		transfer_pdf_to_text_with_4o.py
utils.py		utils.py

AQA6666/SCP-116K-open

Folders and files

Latest commit

History

Repository files navigation

SCP-116K Dataset Pipeline

Pipeline Overview

Usage

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages