Monitoring Financial Transactions using Supervised Machine Learning

Rushil Choksi
5 min readAug 28, 2022
Source: Pexels

In the evolving era of modern technologies, security in the banking sector becomes a significant concern to authorities wherein AML transaction monitoring tools alleviate the process of minimizing fraudulent transactions. The transaction monitoring process involves several features and metrics that alleviate the risk of any fraudulent transactions. This technology allows for the detection of such transactions from a set of input features and attributes based on which a classification algorithm could flag the same as either legitimate or fraudulent with a probability of certainty and based upon which additional verification for processing may be required or the transaction gets declined straight away. Such techniques are beneficial in anti-money laundering wherein attackers try wire or bank transfers to forged accounts; the victim or the bank gets penalized heavily, usually based on the mode of the payment.


AML technologies are prevalent in the banking industry as it allows vendors to authorize legitimate transactions and reduce the number of transactions that could be initiated with malicious intent. Furthermore, it also allows gaining trust within their users to execute transactions safely by monitoring various aspects of the transactions such as the amount, intended recipient, mode of payment, and geo-location of origin of the transaction. A model could be trained to flag the trade with a higher precision rate based on these metrics.


The primary motive behind developing a model is that it could accurately determine whether a transaction is fraudulent or not based on the data that is being retrieved. The goal here is to use random forest classification to allow multiple decision trees to predict whether the transaction is fraudulent or not accurately.

The proposition based on which Random Forest classification is ideally suited is that it combines multiple decision trees to compute the resultant class variable. In contrast, other classification algorithms usually take a longer computation time, such as KNN, or do not produce a higher accuracy than random forest classification.

The methodology behind the implementation is to add attributes to the data frame that are not readily available. Secondly, perform splitting of data in training and testing sets. Thirdly, implement scaling of the dataset, use RandomForestClassifier() to perform fitting of the data based on the input parameters such as the number of decision trees to be computed, and lastly, perform prediction on the testing set based on which accuracy is to be calculated, following which other vital metrics such as precision value and recall could also be determined. Finally, once the same is completed, we could look for areas where the classification could be improved. Use the following URL for the dataset used in the following implementation, or use for the actual dataset.

Importing necessary modules & loading the unbiased dataset

Once the necessary modules have been imported and the data is loaded successfully, we can proceed to integrate features into the same that could be useful for the classification process.

Adding features to the current dataset to enhance model training

Secondly, we can now perform the analysis of the data retrieving basic information metrics from the same which would ultimately be our desired goal.

Determining statistics for input dataset to verify output feature remains unbiased

Once the analysis is completed and we are mindful of the data is in no way biased toward legitimate transactions, we can proceed further to feature selection and data scaling of the same. The verification is required in order to efficiently train our model, else in a case of 99% legitimate transactions, upon training the model, it is very much likely to produce an accuracy of 0.98 to 0.99 since the input data provided is biased and a fair evaluation would not be performed.

Performing feature selection for the output RF and feature scaling using the standardized scaler

Now since the feature selection and data scaling have been performed using the StandardScaler() , we can use the RandomForestClassifer() and perform fitting of our scaled dataset and then perform the prediction on the testing dataset.

Using RFClassifier on the training data and computing the accuracy of testing data


Based on our classification technique applied, the analysis that can be drawn from the output indicates that the originator’s old balance is the most vital feature that helps in determining whether a transaction should be categorized as fraudulent or not, followed by the new balance of the receiver’s account are in a close correlation between determining the category of the receiver.

Based on the analysis of our model, which we have developed, we see that accuracy of 92.95% is being achieved using the Random Forest classification algorithm having only eight decision trees. This level of accuracy is an excellent indicator of how well our model could correctly determine whether a transaction from the given set of input features is to be classified as fraudulent or not. Moreover, the value of the F1 score as calculated is also fairly high, averaging at 0.93; being the harmonic mean retrieved from the values for precision and recall.

Visualizing feature importance for the output dataset based on the classification algorithm


Based on the premises of the above-stated analysis and findings, we can prove that implementation of any classification algorithm that computes the class variable not only from a set of decisions based on input features but computes based on several metrics combined projects a relatively higher accuracy level as compared to other such algorithms.

The actual implementation of any such algorithm based on the computing of several trees or features could result in a decrease in fraudulent transactions and prevent them before they can occur.



Rushil Choksi

Researcher @ ISI • Security Architect • DevSecOps • Cyber Security Enthusiast