Fraudulent actions often have patterns that deviate from the norm and are reflected in certain company key figures or in relations between key figures. Machine learning methods can be used to detect fraudulent patterns in company data. However, before these methods can effectively detect fraud, they must be trained accordingly.
The quality of the training data plays a crucial role. Without suitable realistic data, the algorithms cannot learn and thus cannot make reliable statements. In particular, data from companies with proven fraudulent behaviour is important here. At the same time, the focus of the analysis must also be taken into account when choosing the data basis.
If large companies are being investigated, the data basis should also come from large companies. Basically, the entire accounting data of a company is not needed. A selection of ratios covering the areas of leverage, efficiency, profitability and insolvency risk as well as, as a target variable, a classification into fraudulent or non-fraudulent can be sufficient. The best-known data set of this kind was compiled by Patricia M. Dechow and covers 146,000 US stock exchange companies from 1993 to 2014.
On this basis, algorithms can be trained and then used for the AI-supported classification of companies to be checked into non-fraudulent and fraudulent. Various methods are possible, which pursue different approaches and thus also have different strengths and weaknesses.
While procedures such as artificial neural networks are so-called black-box procedures and the solution must be accepted without knowledge of the inner workings, there are also procedures such as the CART algorithm (Classification and Regression Tree) or the RIPPER algorithm (Repeated Incremental Pruning to Produce Error Reduction) that provide the user with a comprehensible result with the help of a decision tree or a rule set. Ensemble methods compensate for weaknesses of the individual methods through the interaction of different algorithms.
Unsupervised learning techniques can also be used for fraud detection. These are particularly effective when fraud that is difficult to detect needs to be detected. What they all have in common is that predictions of fraud can only be given by probabilities. The classification as fraudulent therefore in no way means that the company has actually committed fraud in the annual financial statement, but merely that the patterns show anomalies that should be subjected to a more detailed case-by-case examination.
The strengths and weaknesses of the procedures are reflected in their performance in prototypical calculations using the above data set. The most important criterion here is the accuracy of the procedures. This lies in the range between 71% and 95% for the tested procedures. The analysed algorithms thus all offer a gain in information and can be used for the evaluation of clients in the context of an audit.
For use in practice, only one option of the various procedures or even only a first-best solution should be used. With regard to the quality indicators as a whole, the most promising candidate is the Random Forest algorithm, also because the results can be interpreted in a user-friendly way.
Method | Accuracy |
Classification-And-Regression-Tree-Algorithm | 93,6% |
Random Forest | 94,8% |
Artificial neural networks | 94,7% |
Support Vector Machine | 95,2% |
Logistic regression | 90,7% |
Gaussian Naive Bayes | 92,5% |
Repeated Incremental Pruning to Produce Error Reduction | 90,9% |
k-Nearest-Neighbour | 94,0% |
Balanced Bagging | 90,7% |
Ensemble | 94,6% |
Local outlier factor | 71,2% |
Isolation Forest | 84,7% |
k-Means | 94,0% |
The infrastructure (e.g. the database) necessary for the implementation of AI procedures for fraud detection is already available in oktant. I will explain exactly what this looks like in the next article.