'The Fountain of Knowledge' - A-Level NEA

A Search Engine and Summarisation tool for Students and Academics

(My A-Level NEA Project)


Playlist for Testing Videos

A-Level Computer Science NEA [75/75, 100%] • Apr 2022 - Feb 2023 • Programmed an Academic Search Engine using Web Scraping and Frequency Analysis. Python Functional Programming Modular structure with a fully integrated SQL database and professional GUI. Also communicated the design process through comprehensive documentation.

Copyright © 2022-2024 Oscar Ryley. All rights reserved



Pre-Development Introduction

My idea for an A-level NEA project is a search engine tool or application to help with internet research and summarisation for scholarly reading. The title for this project is “the fountain of knowledge”.

       Internet research and learning through reading can be time consuming and confusing. My project would be used to look up information to different levels of complexity, starting with a brief description of what the subject is and some basic information about it and going up to further resources aggregated by the algorithm. It would also be used to summarise the information that it searches for and other documents of text inputted by the user.

       Research and learning through the different websites and reading sources makes up a lot of my working life as a student. My dream for this project is to be able to collate and summarise all of the necessary information and further reading needed to learn more about any subject the user were to enter.

       The end users will be academics, students (especially of History, English and other humanities), and people who need to conduct internet research in order to learn about anywhere from a broad subject area to a specific event or item. Another problem I could make easier with this project is reading large amounts of information through definitions, descriptions and different analysis of the text. I will get input from colleagues who are students studying subjects that require research, especially those who primarily use internet research.

       The project product would be a user friendly application that takes inputs through a text box search feature. This would be used to research and gather relevant information using web-scraping (in Python) of probably primarily wikipedia and it’s linked sources to being with, in order to collect big data to be stored in large SQL databases and operated upon (again in python) in order to conduct frequency and linguistic analysis. Code will be programmed within IDEs and exe files could be produced in order to run the application without the IDE.

       My project will meet the complexity criteria for an A-level Project as it will include the following items from Section A of the technical complexity table in the NEA specification. A Complex data model in a database will be used within the data collection and storage section of the project with use of linked SQL tables to reduce complexity. A merge sort would also be operated on the data as part of the frequency analysis approach to linguistic analysis. The summarisation area of the project could be defined by a user-defined algorithm, as the utility of the algorithm would be updated and changed to best fit the needs of the test base of end users.


Thoughts on the Project Development Process

Personal reflection after an interview with my end user

All of the objectives and requirements of my end user have been met, with particular focus on the effectiveness of the outcome when it comes to the efficiency as it allows for automatising the first part of the research process. Particular highlights were the utility of the frequency list summary for the understanding of whether a source would be useful to a specific research query and the utility of the easy to use and understand link buttons next to each source.

       The point raised that saves add to the efficiency of long projects and the loading of data in a matter of fractions of a second rather than the roughly estimated 10-15 second load time for a full web scraping search. The time taken for a search that makes a large number of get requests (roughly 30 or so) is understandable, however, the database element does help to negate some of the downsides of this, especially in big research projects.

       The suggested extension if the project were to be taken further of more summarisation features such as highlighting different types of words or extending the summarisation algorithm are good points to raise and I will explore further what could be improved within the project if it was extended in my conclusion.


What could be improved?

The main point of extension brought up within the interview and discussion with my primary user was more features being added to the Summarisation aspect of the project. As a proof of concept and an extension I have written up what the beginnings of this could look like within my frequency_analysis.py module:

def find_nouns(freq_list: list):
    """Function: find_nouns
    @params
    freq_list: a frequency list (as created by the function create_frequency_list)

    Returns a list of all of the nouns from a frequency list.
    """
    nouns = []
    nouns_list = open_file("C:/Users/Oscar/Documents/NEA/NEA/Data/nouns.txt")
    for i in freq_list:
        if i[1] in nouns_list:
            nouns.append(i)
    return nouns

I have used another text file which lists every noun in the English language to find which words in the list are nouns. This may not be as efficient as possible, but is a workable solution. Perhaps to signify what type of word each word in the frequency list is, a third index could be added to the 2D array, with a variable known as word_type. This level of detail as to what summary data is displayed to the user and how is also another possible extension, with the GUI element possibly using colour to make clear to the user the nouns, verbs and adjectives, for example, in an extract.

       One other small point for extension could be an algorithmic approach to solving the search query suggestions for the search box feature which was briefly mentioned by some prospective users within the interview stage. However, this would be difficult to implement and may require a heavy amount of the algorithmic workload to be put onto a possible external module that deals with word similarity given that Google’s search results complete the same job.

       Given further research into this specific objective and how it could be solved utilising google’s computed search suggestions, an approach using web scraping would also work. This would possibly require the use of json files, create more data and clutter within the program itself and slow down the program to an extent, even when no searches are being made as more frequent google web scrapes would need to take place.

       The increased number of google web scrapes could also become an issue given an issue found when web scraping from google earlier in this project that caused a temporary ip block from get requests if too many were made in a short amount of time or in a robotic manner. This would require infrequent search suggestion updates if this was taken forward as an objective for future extension.


What went well?

The project is able to successfully help with internet research through web scraping and summarisation algorithms (in the form of word frequency) for scholarly reading. It succeeds at separating these sources served to the academic target end user into four different levels of complexity; basic, intermediate, journalistic articles and academic papers. The issue of aggregation of sources was solved through human, primary user, selection of useful and valuable sources.

       Features like simple GUI design and easy to understand data output, along with a word frequency list produced help the end user to better understand the sources provided in the context of whether or not they would be useful for their research project and therefore read further. Easy access to the broad range of sources via the link buttons makes this selection and aggregation even more easy to use and aid further in academic research.

       All of my objectives have been met, as evidenced through the processes and testing earlier in this project. The feedback from the end user stressed the utility of the outcome for its initial use case and objectives and all of their requirements have been met, with their recommendations for further exploration into utility within this system not impeding on its current usefulness. I believe that, overall, I have succeeded in producing the outcome which I set out to at the beginning of this project. All of my objectives have been achieved and meet the requirements of my primary user, and therefore it is a useful program for its target audience of academics.


Built With

Share this project:

Updates