Naive Bayes Classifier — Predicting Accounting Policy Compliance

Unlocking the Power of Machine Learning for More Efficient Financial Due Diligence

7 min readMay 1, 2023

Picture of a robot — Photo by Alex Knight on Unsplash

As the whole world is being revolutionized with Artificial Intelligence, a lot of automation has also been implemented in the M&A due diligence process. If you utilize one of the top legal firms, there is a high likelihood that an AI tool is taking a first pass through the contracts. Financial due diligence regularly now incorporates quick analytics on large top-line datasets, which was not that common 5–6 years ago. Having said that there is still a lot to be desired in incorporating more efficient automation in an M&A due diligence process.

There are a number of areas where further automation can be explored, including accounting standards compliance, EBITDA normalization adjustments, and identification of NWC and debt-like items. In this article, we will explore the possibility of predicting whether the company’s revenue recognition policy is in-line with US GAAP or not. If you want to read more on how to automate EBITDA normalization adjustments, you can read my articles on the topic here (EBITDA normalization — Part 1 and EBITDA normalization 2 — Part 2).

While there are more sophistic ways for implementing this automation, including using Long Short Term Memory networks (LSTM), a type of Recurrent Neural Networks (RNN) capable of learning long term dependencies, or training a closed version of GPT (or other LLMs) when those are available for specific use-case training, we will implement a Naïve Bayes classifier for our example.

What is Naïve Bayes classifier or Bayes Theorem

Naïve Bayes may sound complicated but it is probably one of simplest algorithm to interpret once you understand the math behind it. We will implement a Naïve Bayes classifier in Python but you can actually do the math manually as well, if you are up for the challenge. Before we dive into specific implementation of Naïve Bayes, let us refresh our memory on the Bayes Theorem. I promise you have probably studied this in your statistics class, even if you do not remember it.

Bayes Theorem

During the 18th century Rev. Thomas Bayes grappled with the question of “how well do we know what we think we know?”. His work on this theory was published posthumously. Later on Pierre-Simon Laplace, a French mathematician, developed a more complete formula for this theory which is what we know as the Bayes Theorem.

Bayes Theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event. As such, it provides a way to update probabilities as new data is acquired or new evidence becomes available. Bayesian view of probability differs from a frequentist view of probability (which is the traditional view of probability that you would have likely study in an intro to statistics course). Bayesians think of probability as a measure of belief and therefore the probability is subjective and refers to the future, whereas a frequentist view probability to refer to past events.

To understand this difference through a classical coin flipping example a frequentist would argue that if you flip a coin an infinite number of times then there is 50% probability of it coming up heads. Whereas for a Bayesian a statement that there is a 50% chance of the coin coming up heads, simply means that you have no prior belief to favor one outcome or the other.

Simply stating you can understand Bayes Theorem as a way to calculate conditional probability, given some prior knowledge of conditions that might be related to that event. The basic formula is as below:

P(A|B) = P(B|A) * P(A) / P(B)

You can simply understand the above as the posterior probability of A given some observation B.

Naïve Bayes

Naïve Bayes is an application of Bayes Theorem that assumes that all of the features (in our model) that are used to make a classification decision are independent of each other. This assumption simplifies the calculation of the likelihoods, as these can now be calculated separately for each feature.

For our model, we will use a multinomial Naïve Bayes classifier that assumes that each posterior probability follows a multinomial distribution. This makes it very useful for text classification problems as it also looks at the frequency of words.

Enough of the theory, lets get into the model.

So how do we use Naïve Bayes classifier to analyze accounting standard compliance

For the purpose of this exercise, we will be using a very simple training dataset and a simple testing dataset. Our goal is to train our model to learn which revenue recognition policies are in compliance with ASC 606 and which are not. We will then get the model to predict whether a target company’s revenue recognition policy is in line with ASC 606 or not.

We will be coding our model in Python.

Training data

Test data

Step 1: Import the required libraries

import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import nltk

Step 2: Import Stopwords and Lemmatizers

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

We will be using these later to preprocess revenue recognition policy. Stopwords is a collection of commonly used words (such as “the”, “at”, “and” etc.) that do not add any predicting value to our model and therefore we will remove these.

Lemmatization is used to fine the root form of a word. This avoids the need to decipher entire words as most words can be understood by using their root words.

Step 3: Load our training dataset

# load the training data
df = pd.read_csv('./training_data.csv')

Step 4: Define a function to preprocess text data

# Preprocess text data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

# Define stop words to be removed
stop_words = set(stopwords.words('english'))

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define function to preprocess text data
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
   
    # Remove non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
   
    # Tokenize text into words
    words = nltk.word_tokenize(text)
   
    # Remove stop words
    words = [word for word in words if word not in stop_words]
   
    # Lemmatize words
    words = [lemmatizer.lemmatize(word) for word in words]
   
    # Join words back into text
    text = ' '.join(words)
   
    return text

In this code block:

We are first downloading the required packages
We are then defining stop words to be used, which is in English in our example, and initializing WordNetLemmatizer
We then define our function that does the following:

Converts texts to lower text (this is important as text is treated as case sensitive),
Removing any non-alphanumeric characters,
Tokenizing text into words,
Removing stop words, and
Lemmatizing words.

Step 4: Apply the function to our training data set

# Apply preprocessing to training data
df['text'] = df['text'].apply(preprocess_text)

Step 5: Convert is_asc_compliant column to Boolean

# Convert is_asc_compliant column to boolean
df['is_asc_606_compliant'] = df['is_asc_606_compliant'].astype(bool)

Step 6: Extract features from the text using CountVectorizer

# extract features from the text using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])

You can simply think of this step as converting the text to a vector of numbers so that we can apply mathematical operation on it.

Step 7: Training our Naïve Bayes classifier

# train the Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X, y)

Step 8: Loading our Test data and applying the preprocessing to test data

# load the test data
test_data = pd.read_csv(‘/testing_data.csv')

# Apply preprocessing to testing data
test_data['text'] = test_data['text'].apply(preprocess_text)

# Convert is_asc_compliant column to boolean
test_data['is_asc_606_compliant'] = test_data['is_asc_606_compliant'].astype(bool)

Step 9: Extracting features from the test data using the same CountVectorizer

# extract features from the test data using the same CountVectorizer
test_X = vectorizer.transform(test_data['text'])

Step 10: Predicting the labels for the test data using our trained model

# predict the labels for the test data
predicted = clf.predict(test_X)

Step 11: Checking the predictions

# adding a predicted column to the test data and changing it to boolean
test_data['Predicted'] = predicted
test_data['Predicted'] = test_data['Predicted'].astype(bool)

#checking the results
test_data

As we can see from the example above, our model correctly predicted where the accounting policy is in compliance with ASC 606 or not in all cases.

Conclusion

Currently there is little to no automation in an M&A due diligence process. This is an illustration of a simple model to automate predicting compliance with revenue recognition standard. I am sure with a large training dataset there can be a number of use-cases where automation can be introduced in an M&A due diligence process.

I would love to hear from you as to what you think are the low hanging fruits that can be automated. Please feel free to add your comments/thoughts. Alternatively, if you are in the market to disrupt M&A due diligence process, I would love to connect and explore collaboration opportunities.