(Please read the Attached Latex file for the same information as below with formatted images and algorithmic details)
We are competing for the Bill.com challenge and as a Visiting Team
Superkey Feature Extraction
Superkey Determinant Features}
To begin our project, we looked at the entries in the users.csv file to see what would differentiate each entry, uniquely, from all of the others. In the world of databases, tuple composed of these features is called the superkey.
We decided the features making up the superkey are as follows:
- The name of the vendor, a literal.
- The address of the vendor, a literal.
- The date a transaction took place, a datetime.
- The transaction total, a float.
Candidate Keys
However, because it may not be possible for us to correctly identify all of this information, it may not be possible for us to use our superkey to uniquely identify a wrong. In this case the goal will be to extract as many of the four defining superkey features as possibles and pivot off of those features for the matching algorithm. In order, we hypothesize that transaction total, date, vendor address, and name in that order will be the most uniquely identifying of a transaction.
Named Entity Recognition
Motivated by a number of typo errors in the users.csv and the need to identify our superkey features from the ocr data, we decided to perform named character recognition on a subset of the receipt images.
Named Entity recognition is composed of hand annotation of features of interest followed by the training of a bidirectional recurrent neural network which models the relationship between the hand-labeled features and each feature in a bag of unknown ocr'ed features to provide a model to enable us to perform superkey extraction from ocr like datasets.
Recognizing Total Transaction
We were very concerned about how the BiRNN model would perform in order when it came to identifying transactions totals because each receipt contained many different dollar value amounts and because the median of the distribution of prices on each receipt varied widely, for example receipts from Sanyu's Stationary Shop had totals often under ten dollars while receipts from some business-facing enterprises in the dataset had much larger line item prices and also totals. We wanted to consider methods in addition to the BiRNN in order to identify transaction total from the ocr'ed data.
Possible Approach 1: Statistical Methods}
Because two of our team members are primarily statisticians the first approach we considered was a statistical one. Would it be possible to build a numerical predictive model to a transaction total and then select the closest number from the list of possible total transaction amounts?
The main disadvantage with this method is that if we trained on data aggregated from all the receipts, the wide range in each businesses typical transaction values, but simultaneous association of prices with only the specific range of typical line item costs particular to that business, a statistical model would likely select for the average of all the total transactions.
The large standard deviation for the distribution underlying all of the total transaction amounts means that guessing the average in every single case will create dangerous amounts of error.
Fitting a unique prior distribution for each type of receipt is also unlikely to succeed because n for each individual receipt is typically not high enough to properly fit priors.
Possible Approach 2: Deterministic Mathematics
Next, we considered if it was possible to create an algorithm to deterministically find total amount based on how transactions occur and what types of line items are included in receipts.
It is not as easy as selecting the largest number from all those that are dollar amounts on the receipt because if unexact change was used for the transaction, cash tendered will be the largest number.
It is also not possible to fully reply on the ocr'ed coordinate data, because typically individual transactions are printed, followed by the total, followed by the change, and finally cash tendered.
Following below is the algorithm we used to attempt to determine total amount. The algorithm begins by assuming the largest monetary amount on the receipt is the total, then checks to see if that value can be created by summation of any two of the other values in the monetary values list (i.e. the summation of the true total and change given). If yes, the next highest number is assigned to the total and checks are performed based on the common ordering of receipts. Additional alterations were made to account for edge cases and can be found below.
Possible Approach 3: Graphical Decision Making
Finally, we could use the x, y coordinates from the ocr'ed data and line item labels produced by the BiRNN to try to find the transaction total based on the distance from various monetary amounts to the "total" label on each receipt.
The most significant disadvantage of this strategy is that it propagates forward errors made by the BiRNN in labeling the total transaction label.
Distance computed is simple Euclidian distance with only float types eligible for selection.
The errors from BiRNN labeling and errors from the transaction labeling are independent, and propagation would simply be the product of the two errors.
The following graphic shows a likely case for the propagation of errors through this graphical approach. It would be difficult for the BiRNN to identify the total label because one of the most identifying features, the word total is obscured. While comping distances to all of the monetary amounts would work in this case, it will fail without recourse in BiRNN total label identification fails.
BiRNN: Bidirectional Recurrent Neural Network
We used BiRNN created by the Named Entity Recognition, as shown in images in the LaTeX file attached.
BERT: Bidirectional Encoder Representations from Transformers
Finally to perform matching we utilized a BERT algorithm to perform fuzzy matching of our super or candidate key to the various options in users.csv
BERTs use an attention mechanism that attempts to understand contextual relations between words. The model is able to learn the context between different words by looking left and right between them, enabling the model to pick up on similarities even if it is only a small fragment of the total to-be-matched input.
BERTS are bidirectional encoders, encoding the inputs on the way in and decoding the predictive outputs as the very last step.
Hyperparameter Optimization: Fuzzy \ Matching Weights
Ideally, the weight of each item in the candidate or super key for fuzzy matching would undergo hyperparameter optimization via k-fold cross validation. We briefly tested a range of values by hand and selected a better than average combination from the combinations we hand tested due to time constraints.
The accuracy of the final model was 30.4 for the holdout set.
Built With
- bert
- named-entity-recognition
- python
Log in or sign up for Devpost to join the conversation.