Review patent process by kermitt2 · Pull Request #1082 · grobidOrg/grobid

kermitt2 · 2024-02-05T21:21:43Z

This is an update and review of the task of patent and non-patent reference extraction from patent documents.

Support of Deep Learning models with adapted segmentation of input sequences for training and prediction.
Batch processing for prediction.
Include a BidLSTM_CRF_FEATURES model, which improves significantly the extraction accuracy for NPL references as compared to CRF (+10 points F1-score), relatively similar for patent citations (+1 point F1-score). Note that the training also supports all BERT flavor, in particular Google BERT for Patents
Update of the mapping of US application prefix numbers to years (for patent publication number normalization according to epodoc), up to 2021/2022.
Review XML serialization to only include in the result XML paragraphs with references. References are given for each paragraphs (rather all at the end), with position offsets referencing the paragraph (and not all the document as before). This improve readability without changing the XML parser normally for getting the references.

With GPU and DL model, 8 threads, the processing 500 EP B publications took 775 seconds.

Related: grobid_client_python has been extended to process directories of patent files (ST36 or PDF), for example

grobid_client --input /media/lopez/data/document-quality-data/citation_recognition/patent/ground_truth/  --output resources/test_out/ --n 8 processCitationPatentST36

TODO: proper 10-fold cross evaluation of models, benchmark and optionnally automatic download of fine-tuned BERT large model for patents

coveralls · 2024-02-05T23:03:49Z

coverage: 39.783% (-0.1%) from 39.893%
when pulling 1ebc6c8 on review-patent
into 4816a7a on master.

kermitt2 added 13 commits February 4, 2024 19:57

review training parsing and selection for DL models

c943239

some training fixes

8c4f19a

update DL models

0be5097

remove outdated xml parser

a4eb584

refactor process for DL models batch

fbb238c

update US patent application mapping to year

a0b5bc2

review process and serialization

d66896e

update model

1550756

cleaning used model

a01d5ec

fix language code mismatch for Korean

80ca203

add additional tokenizer mode

53f8c1d

review method profile and fix test

0d524f7

extend default config

017bc28

kermitt2 marked this pull request as draft February 5, 2024 21:21

review sequence segmentation following max sequence length

08c0405

kermitt2 marked this pull request as ready for review February 6, 2024 12:44

kermitt2 added 4 commits February 6, 2024 15:51

add tests, cleaning

92d3c1d

add tests

5750ad7

fix usage of parameters

8282dad

review serialization

1ebc6c8

kermitt2 merged commit 269c897 into master Feb 7, 2024

lfoppiano added this to the 0.8.1 milestone Jun 9, 2024

lfoppiano deleted the review-patent branch March 21, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review patent process#1082

Review patent process#1082
kermitt2 merged 18 commits intomasterfrom
review-patent

kermitt2 commented Feb 5, 2024 •

edited

Loading

Uh oh!

coveralls commented Feb 5, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kermitt2 commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kermitt2 commented Feb 5, 2024 •

edited

Loading

coveralls commented Feb 5, 2024 •

edited

Loading