Inspiration

In this decade, other than pure technological advancements, major tech companies are looking at sales and marketing to boost their revenue and recognition in their industry. As such, through the incorporation of Machine Learning and Artificial Intelligence, we hope to assist current sales personnel in interpreting their customer's words and taking the most optimum steps to advance sales.

Architecture

Our project revolves around a backend and a frontend.

Backend

Our backend utilizes Flask for (1) Passing data from React to Flask route via the 'POST' method, (2) Passing data between models, (3) Passing data back to React.

In addition, as several of our models have conflicting dependencies requirements, we have decided to contain each model in its own virtual environment, each with its own Flask app.

WhisperX model (Entry point for backend) -> hosted on localhost:5000
Snips-NLU and BERT -> hosted on localhost:5001

(1) We used request.files['audio'] to hold the audio data with the multipart/form-data encoding before passing it to WhisperX for transcribing the audio.

(2) The transcribed text is passed to localhost:5001 via requests.post where the snips-nlu and bert model are hosted on. As the snips-nlu and bert model are not dependent on each other, we utilized multiprocessing to shorten the processing time, achieving concurrency.

(3) Finally, a JSON data is passed back to localhsot:5000 as response and subsequently, to the React frontend.

Frontend

Our frontend is mainly developed using React components for our layout design, which includes an input file function. This function allows users to upload an MP3 file. The frontend then sends this input as a request to our backend Flask server using Axios for API integration.

Upon receiving the response from our backend, which is a JSON file containing various results such as a to-do list, the sentiment of the customer and salesperson, and the Automatic Speech Recognition (ASR) transcript of the speech in the audio, the frontend processes and displays these results in a user-friendly way. The results are formatted to include a line before displaying each section, ensuring clarity and ease of understanding for the user.

Main Features

Information Extractor

Our information extractor leverages Natural Language Understanding (NLU), a subset of Natural Language Processing (NLP). NLU enables the system to decompose the key aspects of the transcripts and thereby accurately determine user intent. We decided to adopt NLU under the following grounds:

Entity Recognition NLU employs named entity recognition (NER) techniques to classify important information to predefined categories such as names, locations, and times in the given context. This is achieved using machine learning models that have been trained to identify and categorize entities based on their context in the text. For instance, in the sentence "Schedule a meeting with John at 3 PM," NLU would recognize "John" as a person entity and "3 PM" as a time entity. This capability is crucial for extracting key information to be extracted for future planning and decision-making.
Intent Recognition Intent recognition involves interpreting the user's intent behind a statement by analyzing linguistic nuances and patterns. This is typically done using classification algorithms that have been trained on labeled datasets. The model learns to associate certain phrases and sentence structures with specific intents. For example, phrases like "I need to" or "Can you" might indicate a request from the customer that the salesperson will need to follow up on.

To train the model, we prepared a labeled dataset of intents and entities relevant to customer service call logs with its associated intents, such as to schedule a follow-up session, or record the product data. Using the annotated training data, the snips-nlu engine learns to recognize the patterns in the text that indicate the specific intents and entities.

After training, the model will make inferences based on the user input and extract the intent accordingly. This will output a To Do List, which will include the list of tasks that need to be accomplished by the sales assistant based on the call.

Sentiment Analysis

Sentiment analysis is a way to analyze text to determine the connotations of digital texts, which usually range between positive, negative, and neutral. This makes use of Natural Language Processing and Computer Linguistics to identify, analyse and eventually, qualify the underlying intents and emotions behind the language.

For the implementation of sentiment analysis, the Bidirectional Encoder Representations from Transformers (BERT) model is being used. There are several reasons why BERT is being adopted for this project:

1.Bidirectional Contextual Understanding

BERT reads the text from left to right and right to left, providing a more comprehensive context when making predictions.

2.Masked Language Modeling (MLM)

As BERT is pre-trained on text in an unsupervised manner, it can make predictions in the event of missing words. In the context of sales calls, due to potential noise or interference that affects the audio quality, the transcript may miss out on words. Through the predictions by BERT, it fills in the best possible words given the context, allowing for more accurate predictions of sentiment analysis.

3.Handling of Polysemy

As BERT takes into account the entire sentence and context, it allows for words with similar meanings to be differentiated based on the context. In the event of an emotional or ambiguous statement made by the customer, BERT can factor them in to make a sound judgment

4.Well-Versed in different languages

For the particular model that is used, texts in both English and Chinese, which are the 2 most spoken languages in the world, can be input into it, allowing it to be used in multiple countries.

Using a pre-trained BERT model titled bert-multilingual-go-emtions from HuggingFace, we incorporated it into the backend of our code to provide sentiment analysis. This makes use of the class transformers.BertTokenizerFast as its tokenizer and the class transformers.BertForSequenceClassification as its model. As the model is being pre-trained on Go_Emotions, a google research dataset, the output will come from the 28 emotion categories, according to the GoEmotions taxonomy, namely admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', and 'neutral'.

Given the wide range of emotions that can be output, it allows for sales assistants to make a more sound judgment based on the emotions.

What's Next

Given that the current implementation focuses mainly on the information extracted from the call, plans on moving forward include providing recommendations to the sales assistant based on the call. This will include a checklist from the to-do list, and other possible steps that they can take to improve the situation or continue the upward sales trend. Under the sentiment analysis portion, lead scoring capabilities can also be incorporated to allow sales assistants to be more decisive when dealing with customers.