The Problem:
An Indianapolis-based nonprofit that regulates student athletes receives student transcripts from high schools around the country. The transcripts contain the required information, but in thousands of different formats. As a result, employees had to enter the documents into a database manually. This was costing the company hundreds of man-hours per week. This was a time consuming and error-prone process. Other local companies were tasked with parsing the data but were ultimately unsuccessful.
The Goal: Parsing the Data
Systematically identify courses and grades across academic year regardless of transcript format. And parse the data from all transcripts into a standard format – eventually entering it into the client’s transcript system. This reduces the manual time and cost.
How We Solved it:
- Robosource used a text extraction library to pinpoint the x and y coordinates for course name, course ID, and course grade.
- We then captured the coordinates of each data point within each document format.
- Parsing the data, we also prescrubbed it before loading it into the database. Up until now those steps had to be done manually.
Results:
- Robosource demonstrated the ability to automate the process of identifying and capturing transcript data across thousands of formats. Phase 2 of this project will save the company hundreds of thousands of dollars annually.
- We are also starting the process of using machine learning from previous transcripts so when the algorithm recognizes the format it puts the transcript into the correct template.
- Future steps are to build API to the database and utilize Optical Character Recognition to capture information from scanned transcripts and images.

