This project classifies patent abstracts into relevant categories using large language models (LLMs) for financial insights. It leverages few-shot learning examples and keyword matching to refine predictions and provides functionalities for creating balanced datasets for training, validation, and testing.
The repository contains code to process, filter, and classify patent abstracts from a public dataset, enabling the identification of patents relevant to various financial and technological industries.
- Patent Dataset Handling: Loads, filters, and processes the public patents dataset.
- Few-Shot Learning: Incorporates few-shot examples to enhance classification prompts.
- Custom Industry Classification: Classifies patents based on industry-specific keywords.
- Balanced Dataset Creation: Generates datasets with equal positive and negative samples.
- Reviewer Allocation: Prepares review tables for manual verification and expert review.
- Python: Version 3.8 or above.
- Google Colab or Jupyter Notebook: For running the notebook.
- Install the following Python libraries:
pip install datasets google-generativeai pandas numpy matplotlib
-
required_files:- Contains all the files required for the project:
patent_classification.json: Few-shot examples for classification.financial_topics.txt: List of financial topics and associated keywords.- Any other supporting files used in the notebook.
- Contains all the files required for the project:
-
notebooks:- Contains the main Jupyter Notebook (
code_pipeline.ipynb) to execute the pipeline.
- Contains the main Jupyter Notebook (
-
output_files:- Where classified datasets and review tables are saved.
Ensure that file paths in code_pipeline.ipynb point to the required_files folder. For example:
file_path = 'required_files/financial_topics.txt'Open code_pipeline.ipynb in Jupyter Notebook or Google Colab and execute the cells sequentially.
Set up your Google Generative AI API key in the notebook:
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')Update the industry variable in the notebook with the desired category:
industry = "Autonomous Vehicles"Execute the classify_patents function to classify patent abstracts:
classified_data = classify_patents(total_abstracts_df_sets, industry, keywords, few_shot_examples)Use the save_balanced_csv function to create a balanced dataset:
balanced_data = save_balanced_csv(classified_data, "output_files/balanced_dataset.csv", industry)Generate a review table for manual validation:
review_table = prepare_review_table(classified_data)
review_table.to_csv('output_files/classified_review_set.csv', index=False)patent_classification.json: Contains few-shot examples categorized by industries.financial_topics.txt: A list of financial topics and their associated keywords.public_patents_dataset.pkl: Pickled dataset of processed patent abstracts.
Generates a prompt for the LLM using few-shot examples and keywords.
Uses the LLM to classify a patent abstract as relevant (Yes) or not (No) for a given industry.
Processes the entire dataset, filters by keywords, and classifies abstracts into positive and negative categories.
Creates a balanced dataset with equal positive and negative samples.
Prepares a table for manual review, assigning reviewers and adding space for comments.
- Balanced Dataset:
- Saved in
output_files/balanced_dataset.csv.
- Saved in
- Classified Review Table:
- Saved in
output_files/classified_review_set.csv.
- Saved in
To classify patents for a different industry:
- Update the
industryvariable with the new category name. - Add relevant keywords to the
industry_keywordsdictionary in the notebook. - Ensure that
patent_classification.jsonincludes few-shot examples for the chosen industry.
- API rate limits may slow processing; adjust the
rpmparameter accordingly. - Manual keyword updates are needed for new industries.