SP-BTT-Patent-Classification

Overview

This project classifies patent abstracts into relevant categories using large language models (LLMs) for financial insights. It leverages few-shot learning examples and keyword matching to refine predictions and provides functionalities for creating balanced datasets for training, validation, and testing.

The repository contains code to process, filter, and classify patent abstracts from a public dataset, enabling the identification of patents relevant to various financial and technological industries.

Features

Patent Dataset Handling: Loads, filters, and processes the public patents dataset.
Few-Shot Learning: Incorporates few-shot examples to enhance classification prompts.
Custom Industry Classification: Classifies patents based on industry-specific keywords.
Balanced Dataset Creation: Generates datasets with equal positive and negative samples.
Reviewer Allocation: Prepares review tables for manual verification and expert review.

Installation

Prerequisites

Python: Version 3.8 or above.
Google Colab or Jupyter Notebook: For running the notebook.

Install the following Python libraries:

pip install datasets google-generativeai pandas numpy matplotlib

Folder Structure

required_files:
- Contains all the files required for the project:
  - patent_classification.json: Few-shot examples for classification.
  - financial_topics.txt: List of financial topics and associated keywords.
  - Any other supporting files used in the notebook.
notebooks:
- Contains the main Jupyter Notebook (code_pipeline.ipynb) to execute the pipeline.
output_files:
- Where classified datasets and review tables are saved.

Usage

Step 1: Update File Paths

Ensure that file paths in code_pipeline.ipynb point to the required_files folder. For example:

file_path = 'required_files/financial_topics.txt'

Step 2: Run the Notebook

Open code_pipeline.ipynb in Jupyter Notebook or Google Colab and execute the cells sequentially.

Step 3: Configure the API

Set up your Google Generative AI API key in the notebook:

from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

Step 4: Specify Industry

Update the industry variable in the notebook with the desired category:

industry = "Autonomous Vehicles"

Step 5: Generate Classifications

Execute the classify_patents function to classify patent abstracts:

classified_data = classify_patents(total_abstracts_df_sets, industry, keywords, few_shot_examples)

Step 6: Save Results

Use the save_balanced_csv function to create a balanced dataset:

balanced_data = save_balanced_csv(classified_data, "output_files/balanced_dataset.csv", industry)

Generate a review table for manual validation:

review_table = prepare_review_table(classified_data)
review_table.to_csv('output_files/classified_review_set.csv', index=False)

Files in `required_files`

patent_classification.json: Contains few-shot examples categorized by industries.
financial_topics.txt: A list of financial topics and their associated keywords.
public_patents_dataset.pkl: Pickled dataset of processed patent abstracts.

Key Functions

`generate_prompt`

Generates a prompt for the LLM using few-shot examples and keywords.

`predict_industry`

Uses the LLM to classify a patent abstract as relevant (Yes) or not (No) for a given industry.

`classify_patents`

Processes the entire dataset, filters by keywords, and classifies abstracts into positive and negative categories.

`save_balanced_csv`

Creates a balanced dataset with equal positive and negative samples.

`prepare_review_table`

Prepares a table for manual review, assigning reviewers and adding space for comments.

Outputs

Balanced Dataset:
- Saved in output_files/balanced_dataset.csv.
Classified Review Table:
- Saved in output_files/classified_review_set.csv.

Customization

To classify patents for a different industry:

Update the industry variable with the new category name.
Add relevant keywords to the industry_keywords dictionary in the notebook.
Ensure that patent_classification.json includes few-shot examples for the chosen industry.

Limitations

API rate limits may slow processing; adjust the rpm parameter accordingly.
Manual keyword updates are needed for new industries.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
required_files		required_files
README.md		README.md
code_pipeline.ipynb		code_pipeline.ipynb
data_eda.ipynb		data_eda.ipynb
gemini_sample.ipynb		gemini_sample.ipynb
patent_classification.json		patent_classification.json
patent_classification_with_keywords.json		patent_classification_with_keywords.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SP-BTT-Patent-Classification

Overview

Features

Installation

Prerequisites

Folder Structure

Usage

Step 1: Update File Paths

Step 2: Run the Notebook

Step 3: Configure the API

Step 4: Specify Industry

Step 5: Generate Classifications

Step 6: Save Results

Files in `required_files`

Key Functions

`generate_prompt`

`predict_industry`

`classify_patents`

`save_balanced_csv`

`prepare_review_table`

Outputs

Customization

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SP-BTT-Patent-Classification

Overview

Features

Installation

Prerequisites

Folder Structure

Usage

Step 1: Update File Paths

Step 2: Run the Notebook

Step 3: Configure the API

Step 4: Specify Industry

Step 5: Generate Classifications

Step 6: Save Results

Files in required_files

Key Functions

generate_prompt

predict_industry

classify_patents

save_balanced_csv

prepare_review_table

Outputs

Customization

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Files in `required_files`

`generate_prompt`

`predict_industry`

`classify_patents`

`save_balanced_csv`

`prepare_review_table`

Packages