HisDoc1B Dataset

The HisDoc1B dataset comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in terms of scale (as shown in the below table). Additionally, it is the only dataset with complete book-level annotations and punctuation annotations.

Dataset	#Books	#Document images	#Characters	#Character categories	Text punctuation
MTHv1[1]	-	1,500	521,370	4,058	×
MTHv2[2]	-	3,199	1,081,678	6,733	×
IC19 HDRC[3]	-	11,715	2,482,994	8,353	×
M5HisDoc[4]	-	8,000	4,367,360	16,151	×
CASIA-AHCDB[5]	-	-	2,276,740	10,350	×
HisDoc1B (Ours)	40,281	3,163,330 (270×)	1,082,544,808 (248×)	30,615 (1.9×)	✓

Table 1: Comparison of HisDoc1B with existing Chinese historical document datasets. The highest and second highest values within each column are denoted by bold and underline, respectively.

Usage & Download

OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTdPghMv281sKYsq0?e=fIuK65
BaiduYun: https://pan.baidu.com/s/1CQnfmHwh6hGigyvHNlmPCQ?pwd=aziq

The HisDoc1B dataset can only be used for non-commercial research purposes. Scholars or organizations wishing to use the HisDoc1B dataset should first complete this Application Form and send it via email to us (lianwen.jin@gmail.com or eelwjin@scut.edu.cn). When submitting the application form to us, please list or attach 1-2 of your publications from the past 6 years to demonstrate that you (or your team) conduct research in the related research fields of Historical Document Analysis, Optical Character Recognition, Document Image Processing, and so on. Currently, this dataset is only freely available to scholars in the above-mentioned fields. We will send you the decompression password for the dataset after your letter has been received and approved.

Important Note

The original data of the dataset is sourced from public channels such as the Internet, and its copyright shall remain with the original providers. The collated and annotated dataset presented in this case is for non-commercial use only and is currently licensed to universities and research institutions. To apply for the use of this dataset, please fill in the corresponding application form in accordance with the requirements specified on the dataset’s official website. The applicant must be a full-time employee of a university or research institute and is required to sign the application form. For the convenience of review, it is recommended to affix an official seal (a seal of a secondary-level department is acceptable).

License

The HisDoc1B dataset should be used and distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

Directory Format

The dataset is organized in the following directory format:

├── HisDoc1B
    ├── books
    │   ├── xxx.pdf/.djvu
    │   └── ...
    ├── annos
    │   ├── xxx.json
    │   └── ...
    ├── readme.md
    ├── book2im.py
    ├── read_anno.py

Citation

Please cite our paper when using the dataset:

@article{shi2025large,
  title={A large-scale dataset for Chinese historical document recognition and analysis},
  author={Shi, Yongxin and Peng, Dezhi and Zhang, Yuyi and Cao, Jiahuan and Jin, Lianwen},
  journal={Scientific Data},
  volume={12},
  number={1},
  pages={169},
  year={2025},
  publisher={Nature Publishing Group UK London}
}

Contact

For any questions about the dataset, please contact the authors by sending an email to Prof. Jin(eelwjin@scut.edu.cn, or lianwen.jin@gmail.com).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
application-form		application-form
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HisDoc1B Dataset

Usage & Download

Important Note

License

Directory Format

Citation

Contact

About

Uh oh!

Releases

Packages

SCUT-DLVCLab/HisDoc1B

Folders and files

Latest commit

History

Repository files navigation

HisDoc1B Dataset

Usage & Download

Important Note

License

Directory Format

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages