This is a resume parser application built using Streamlit, Spacy, Plotly, and Google Serp API. It helps in identifying the job title, skills, and experience of candidates from their resumes. Additionally, it provides a job search feature based on the job title and location using Google Serp API.
Follow the steps below to set up and run the application:
Clone this repository to your local machine using the following command:
git clone <repository-url>switch to the main directory
cd Resume-Reader
pip install -r requirements.txtstreamlit run main.pyWe have fine-tuned two models namely roberta-base for Named Entity Extraction and bert-base for generation of job title after getting the skills as keywords in input.
| Model Name | Training Time | GPU used | Inference Speed |
|---|---|---|---|
| roberta-base | 1 hour | Tesla-T4 (Colab) | 9.4 seconds |
| bert-base | 1 hour | P-100 (Kaggle) | 3.2 seconds |
This files defines a set of functions and configurations for parsing and processing text data from various formats such as PDF, DOCX, and HTML. Below is an explanation of each component:
-
parse_pdf(pdf_content):
- Uses PyMuPDF (fitz) library to parse text content from a PDF file.
- It iterates through each page of the PDF document, extracts text, and concatenates it.
-
extract_text_from_resume(html_content):
- Parses HTML content using BeautifulSoup.
- Extracts text from HTML elements and preprocesses it by removing extra whitespaces.
-
parse_docx(docx_content):
- Utilizes the python-docx library to extract text content from a DOCX file.
- It reads each paragraph from the document and concatenates them into a single text string.
-
preprocess_text(text):
- Removes extra whitespaces from the input text.
-
process_text(text):
- Splits the text into sentences based on full stops.
- Concatenates the sentences with a newline after each.
- colors: Defines a dictionary containing colors for different named entities.
- options: Configuration for entity visualization using spaCy's
displacymodule.- Specifies named entities (e.g., "JOB TITLE", "SKILLS") and their associated colors.
- PyMuPDF (fitz): For PDF parsing.
- BeautifulSoup (bs4): For HTML parsing.
- python-docx: For parsing DOCX files.
- spacy: For named entity recognition and visualization.
- plotly: For generating interactive visualizations like graphs and charts.
This code defines a Streamlit application for parsing and visualizing resume data, as well as performing job searches based on extracted information. Below is an explanation of each component:
- streamlit: Framework for building interactive web applications.
- spacy: For named entity recognition and visualization.
- plotly: For generating interactive visualizations like graphs and charts.
- pandas: For data manipulation and analysis.
-
Introduction Page:
- Displays a welcome message and an about section in the sidebar.
- Provides a file uploader to upload resumes in PDF, DOCX, or HTML format.
- Processes the uploaded file to extract entities (Job Title, Skills, etc.).
- Visualizes the extracted entities using spaCy's displacy module.
-
Process Uploaded File:
- Parses the uploaded file based on its format (PDF, DOCX, HTML).
- Utilizes spaCy for named entity recognition on the parsed text.
- Renders the parsed text and named entities using Streamlit components.
-
Visualization Page:
- Displays visualizations of extracted entities from the resume.
- Generates a sunburst chart using Plotly Express to visualize entity categories and values.
-
Search Jobs Page:
- Allows users to search for job opportunities based on extracted skills or job titles.
- Users can enter a query (job title or keyword) and location for the job search.
- Performs a job search using predefined functions from the 'jobs' module.
- Logs the job search event.
- File Logging: Logs events such as resume upload, entity extraction, visualization, and job searches to a log file named 'resume_parser.log'.
- Provides information about the time taken for model loading and inference.