Skip to content

Review patent process#1082

Merged
kermitt2 merged 18 commits intomasterfrom
review-patent
Feb 7, 2024
Merged

Review patent process#1082
kermitt2 merged 18 commits intomasterfrom
review-patent

Conversation

@kermitt2
Copy link
Copy Markdown
Collaborator

@kermitt2 kermitt2 commented Feb 5, 2024

This is an update and review of the task of patent and non-patent reference extraction from patent documents.

  • Support of Deep Learning models with adapted segmentation of input sequences for training and prediction.
  • Batch processing for prediction.
  • Include a BidLSTM_CRF_FEATURES model, which improves significantly the extraction accuracy for NPL references as compared to CRF (+10 points F1-score), relatively similar for patent citations (+1 point F1-score). Note that the training also supports all BERT flavor, in particular Google BERT for Patents
  • Update of the mapping of US application prefix numbers to years (for patent publication number normalization according to epodoc), up to 2021/2022.
  • Review XML serialization to only include in the result XML paragraphs with references. References are given for each paragraphs (rather all at the end), with position offsets referencing the paragraph (and not all the document as before). This improve readability without changing the XML parser normally for getting the references.

With GPU and DL model, 8 threads, the processing 500 EP B publications took 775 seconds.

Related: grobid_client_python has been extended to process directories of patent files (ST36 or PDF), for example

grobid_client --input /media/lopez/data/document-quality-data/citation_recognition/patent/ground_truth/  --output resources/test_out/ --n 8 processCitationPatentST36

TODO: proper 10-fold cross evaluation of models, benchmark and optionnally automatic download of fine-tuned BERT large model for patents

@kermitt2 kermitt2 marked this pull request as draft February 5, 2024 21:21
@coveralls
Copy link
Copy Markdown

coveralls commented Feb 5, 2024

Coverage Status

coverage: 39.783% (-0.1%) from 39.893%
when pulling 1ebc6c8 on review-patent
into 4816a7a on master.

@kermitt2 kermitt2 marked this pull request as ready for review February 6, 2024 12:44
@kermitt2 kermitt2 merged commit 269c897 into master Feb 7, 2024
@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 9, 2024
@lfoppiano lfoppiano deleted the review-patent branch March 21, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants