Inspiration

The inspiration for the project was to enable users to find relevant information quickly and easily without typing the exact spelling and sequence of phrases. If the user wants to know about something but does not remember the exact way to refer to the information in the document then the user can use our tool and find the relevant information quickly and easily.

What it does

The project takes a file as an input (supports .pdf, .txt, .docx, .jpg, .png). The user enters a prompt (something that the user wants to find inside the document). Then we search for the most relevant information using the f algorithm of natural language processing for data cleanup and extraction of hot words. Then the output is sent to frontend and displayed.

How we built it

We used the flask framework for the frontend and backend integration. HTML and CSS was used for the frontend design. We take input of the file from the user then check the file type using magic library. According to the file type, we send it to the text extraction function tailored for the specific file type. It extracts text from the file at the backend and sends it along with the prompt (user query) to the control f algorithm.

The control f algorithm had 3 major components based on NLP (Natural language processing). This algorithm returned sentences from the text that best matched up with the query.

  1. Data Cleanup of text and query

Both the input text and query was cleaned to remove special unicode characters (i.e. \n), as well as had lemmatization performed. Lemmatization converts words to their root version. I.e. running becomes run, penguins becomes penguin.

  1. Getting Hot/Key Words

This next step involved getting important words from the query, i.e. keywords. The tool Keybert is used to determine all the keywords from the text. The list of keywords is filtered by the words in the query. (I.e. If in an example where the keywords found are “learning” and “python”, the query was “learning about stuff”, the only keyword left would be ‘learning”). Keywords cannot be stop words (ex. the, a, is). Each keyword has a correlation score, i.e. (‘learning’, .7). The scores are artificially raised, so the highest keyword has a value of 1. Other words are adjusted accordingly. This will be talked about more later. Ex: [(‘learning’, .7), (‘object’, .5)] -> (‘learning’, 1), (‘object’, .8).

  1. Selection algorithm.

Word2vec is used to determine the similarity between multiple words in the text we are passed in. It does this by looking at how close words are in the text, and knowing their representations in the English language. We used the Skip Gram version of Word2Vec, which compares a word to its 5 surrounding context words (how close they are in a sentence) to determine similarity. This similarity score (cosine similarity) will be used to determine whether or not we select a sentence. The selection algorithm works on each sentence of the input text. In each sentence, each word will be compared to the keywords we found in the previous phase. In short, we are trying to find the similarity between each keyword and the sentence. To do that, we compare each keyword to each word in the sentence using word2vec, and record the highest possible similarity score for each keyword in the sentence. If the keyword is in the sentence, the similarity score for that word is 1 (ex if the hotword ‘learn’ was in the sentence ‘the human learn the task’).

Hotwords that have a lesser importance as determined by the last step will have a bonus similarity score applied for them. In the context of the last example mentioned, the similarity score between “object” and a word in a sentence will be multiplied by 0.8, and .2 will be added to the similarity score in the end. This effectively makes it so “object” has a lesser significance for the next step.

The final step is calculating the average of all the best similarity scores for each hotword. This ‘average’ is treated as a similarity score of all the hotwords to the sentence. If the average is above a certain threshold (0.9 for one keyword, 0.8 for two keywords, 0.7 for three or more keywords (determined arbitrarily by what looked good), that means there’s enough of a significance (according to our algorithm), and that sentence is added to the output.

The code returns all these sentences found to be significant, creating the intelligent control f algorithm.

Challenges we ran into

Installing all the libraries across different computers because sometimes we landed into dependency conflicts or the process exited due to a subprocess. Then we had to fix the path of the library and add it to the environment variables. There were also issues with GitHub including but not limited to landing to merge conflicts and stashing the changes One of our teammates also had to re-clone the whole repository to resolve the conflict.

With Word2Vec, there was an issue of the correlation results being inaccurate, being very high numbers. We originally tried to adjust the code to better respond to these high numbers, but realized after looking at stackoverflow posts and documentation this was due to a lack of training epochs for Word2Vec. A challenge posed was simply finding the tools we used for NLP. One of our group members had a lot of prior knowledge in NLP, but still had to search the internet for various algorithms that were applicable to the problem.

Accomplishments that we're proud of

One thing we’re proud of is being able to return search results that don’t use all of the query words. Considering the main crux of our algorithm was flexibility, this was a huge success. In a lot of results the reasoning behind what the AI did can be approximated, which is nice. We’re also proud that we were able to determine keywords in queries, allowing for higher relevance of certain bits of information.

What we learned

We learned a lot about what NLP tools are used to solve problems like this (like KeyBert and Word2Vec), and how they work. We also learned more about how to create a fullstack flask app, and the libraries needed to extract text

We learnt about optical character recognition that allowed us to extract characters using pytesseract. We learnt how to have an effective back-and-forth communication between backend to frontend and vice-versa.

What's next for Intelligent Ctrl F

  1. Allowing more file types as input. We have confirmed .txt and .pdf works. We have put work into allowing .docx, .png, and .jpg, but our work on that isn’t finalized, so we’re hoping to accomplish that. We also hope to allow text to speech.
  2. More testing. The more edge cases that are tested and the more the code’s parameters are adjusted, the better the accuracy for the search results. We hope to do more testing to make our algorithm more intelligent.
  3. Returning more than just sentences. It would be nice to break text up on parts of sentences, i.e. splitting on commas and new lines, and seeing how that works.
Share this project:

Updates