The HisDoc1B dataset comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in terms of scale (as shown in the below table). Additionally, it is the only dataset with complete book-level annotations and punctuation annotations.
| Dataset | #Books | #Document images | #Characters | #Character categories | Text punctuation |
|---|---|---|---|---|---|
| MTHv1[1] | - | 1,500 | 521,370 | 4,058 | × |
| MTHv2[2] | - | 3,199 | 1,081,678 | 6,733 | × |
| IC19 HDRC[3] | - | 11,715 | 2,482,994 | 8,353 | × |
| M5HisDoc[4] | - | 8,000 | 4,367,360 | 16,151 | × |
| CASIA-AHCDB[5] | - | - | 2,276,740 | 10,350 | × |
| HisDoc1B (Ours) | 40,281 | 3,163,330 (270×) | 1,082,544,808 (248×) | 30,615 (1.9×) | ✓ |
Table 1: Comparison of HisDoc1B with existing Chinese historical document datasets. The highest and second highest values within each column are denoted by bold and underline, respectively.
OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTdPghMv281sKYsq0?e=fIuK65
BaiduYun: https://pan.baidu.com/s/1CQnfmHwh6hGigyvHNlmPCQ?pwd=aziq
The HisDoc1B dataset can only be used for non-commercial research purposes. Scholars or organizations wishing to use the HisDoc1B dataset should first complete this Application Form and send it via email to us (lianwen.jin@gmail.com or eelwjin@scut.edu.cn). When submitting the application form to us, please list or attach 1-2 of your publications from the past 6 years to demonstrate that you (or your team) conduct research in the related research fields of Historical Document Analysis, Optical Character Recognition, Document Image Processing, and so on. Currently, this dataset is only freely available to scholars in the above-mentioned fields. We will send you the decompression password for the dataset after your letter has been received and approved.
The original data of the dataset is sourced from public channels such as the Internet, and its copyright shall remain with the original providers. The collated and annotated dataset presented in this case is for non-commercial use only and is currently licensed to universities and research institutions. To apply for the use of this dataset, please fill in the corresponding application form in accordance with the requirements specified on the dataset’s official website. The applicant must be a full-time employee of a university or research institute and is required to sign the application form. For the convenience of review, it is recommended to affix an official seal (a seal of a secondary-level department is acceptable).
The HisDoc1B dataset should be used and distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.
The dataset is organized in the following directory format:
├── HisDoc1B
├── books
│ ├── xxx.pdf/.djvu
│ └── ...
├── annos
│ ├── xxx.json
│ └── ...
├── readme.md
├── book2im.py
├── read_anno.py
Please cite our paper when using the dataset:
@article{shi2025large,
title={A large-scale dataset for Chinese historical document recognition and analysis},
author={Shi, Yongxin and Peng, Dezhi and Zhang, Yuyi and Cao, Jiahuan and Jin, Lianwen},
journal={Scientific Data},
volume={12},
number={1},
pages={169},
year={2025},
publisher={Nature Publishing Group UK London}
}
For any questions about the dataset, please contact the authors by sending an email to Prof. Jin(eelwjin@scut.edu.cn, or lianwen.jin@gmail.com).