We were tasked with classifying an anonymized database containing annual reports in the form of XBRL files. We consulted a few research papers on similar tasks and learned that many of the common machine learning approaches to this problem involved supervised learning, which requires labeled data. However, in our database, there were no labels provided. Therefore, we had to try to think of a novel method to automatically flag reports that could potentially indicate fraud.
While brainstorming, we considered both the perspectives of someone committing fraud and someone manually trying to detect it. We thought that someone committing fraud might want to either overstate or understate performance in the annual report, for investor or tax purposes, respectively. This helped us decide on what metrics we'd want to check for divergences. We also thought that someone wanting to detect fraud would probably look for performance measures that looked unusually high or unusually low given the values of the rest of the variables in the annual report.
Considering how we would try to detect fraud manually, we tried to model this cognitive process in a neural network. We trained the network on all of the reports, using a reported performance metric as the ground truth and all the other reported variables as the input. Then, given a new report, we could calculate the difference between the actual reported performance and the predicted performance - what would be expected given the values of the rest of the variables.
An added bonus of using a neural network is that it should be able to cope with all the missing data in the annual reports. Since some variables only appeared in one file, and there were no variables that all files had in common, the network had to learn to draw a conclusion given whatever information was at hand.
Flagging only a minority of over 30,000 reports, we thought that our product would be best suited for investors who want to reduce the manpower required to detect fraud.


Log in or sign up for Devpost to join the conversation.