The statistics are astounding. Data is everywhere and growing – 90% of all data was generated in the last 2 years! Much of it is text data, every minute we generate 456,000 Tweets, 510,000 Facebook comments, 16 million text messages, and 156 million emails. In addition, your own organization is probably collecting a significant amount of text data daily from customers and employees through a number of sources like surveys, employee comments, chat logs, and transcribed phone calls. How can we mine all that potentially information rich data for actionable insights to improve our customer experience and process efficiency? Manually reading comments is not going to cut it. We want to take advantage of machine learning and modern data science techniques. The challenge is that Natural Language Processing (NLP) can be a confusing topic for newcomers, given the vast range of approaches and techniques for answering a variety of questions about text data. Plus, this is an area of active research so things are changing all the time. The good news is we don’t need to understand deep neural networks and word embeddings to get started. In this post I’d like to share a great tool for classifying text documents into a number of categories using a simple NLP technique called ‘bag of words’. We’ll use this technique to construct a classifier that you probably encounter every day; one which can pick out spam emails from a set of emails.
How do we approach the problem of teaching a computer to understand the topic of a given text? We could start by explicitly programming rules to parse sentences into parts of speech to identify nouns and verbs etc., so the computer can ‘understand’ what is being written. But anyone who’s tried to learn a new language knows, there are always exceptions to the rules, and then exceptions to the exceptions. Plus we need to take into account common spelling and grammatical mistakes and also the fact even correctly written sentences often are ambiguous
This morning I shot an elephant in my pajamas. How he got in my pajamas I don’t know. – Groucho Marx
It’s clear this approach, if it’s even possible, is going to be prohibitively complicated. Instead, we can approach the problem from a different angle.
Bag of Words
By looking at each of the individual words in a given text, we can use the presence of specific words and their frequency to provide clues or evidence that a text is about a particular topic. This is called the ‘bag of words’ approach, because it doesn’t take into account word order or grammar, just the words (or sometimes phrases) themselves. It’s relatively simple, but also surprisingly effective.
For example, suppose we have three short text documents:
- The customer service was the worst.
- I was happy with the billing.
- I found the customer service excellent.
Using ‘bag of words’, sentence one is represented as “The” (2 times), “customer” (1), “service” (1), “was” (1), “worst” (1). We can do the same for all the documents and populate in a matrix like this:
Tokenization and Term Frequency (TF)
We call this ‘tokenization’, splitting our documents into parts like words, phrases or even letters. In this case, we’ve split into words and done a couple of other small things like converting all to lower case (so we don’t end up with ‘the’ and ‘The’ as separate words) and removing all punctuation. We’ve effectively turned our text documents into a numerical representation, which means we could now input this into a machine learning algorithm which uses these word frequencies as features. We can already see just by scanning the matrix, that sentence 1 & 3 are much closer to each other than say 1 and 2. Sentence 1 & 3 are more likely to come from the same category.
So you might be saying, ‘hold on a sec’, what about all the information we threw out by ignoring the word order, phrases and context? Don’t we need those? Nope. We are going to sacrifice all that complexity for ease of implementation. That’s ok because we are going to let the algorithm use statistical methods to tease out the topic information from the words themselves without ‘understanding’ what the write is actually saying. That doesn’t mean we are done here, because there are a number of things we can do to help improve our classifier by further quantifying how important each word is to the document and topic. If we consider that each non-zero entry in our matrix represents the ‘strength’ of a feature, ie how much information it provides in terms of the topic then we can evolve our method to ‘weight’ each of those occurrences with more granularity to improve the quality of our features.
We have a number of ‘low value’ words in our documents like ‘I’, ‘the’, ‘was’, ‘with’. These are words that really don’t tell us anything about the topic of the document, so they add little value. These are often called ‘stop words’, and most text classification packages in R and Python have functions to remove those automatically based on a preset dictionary of these terms. You can also manually create a list of words that you think could be safely removed without impacting the quality of the classification model. Once we do that we’re left with a smaller matrix that has almost the same information:
Another thing we should consider, especially if we have documents of varying lengths, is to use word proportion of total (count / # of words in document) instead of absolute counts. This helps normalize our values which will improve algorithm performance.
Stemming involves reducing an inflected (or derived) word to it’s base form, and is sometimes useful as part of pre-processing text data for classification. For example, ‘billing’, ‘billed’, ‘bills’, ‘billable’ could be all reduced to ‘bill’ as part of word stemming. The advantage is to reduce the number of unique words, and capture the similar meaning between all these words. Lemmatization is a more complex way of doing this that requires detailed dictionaries for each word, and for that reason we will ignore this method for now despite having the cooler name.
Inverse Document Frequency (IDF)
What about words that appear in every document versus words that only appear in one or two documents, which will have more information or value in determining which category a word belongs to? If a word is common and appears in every document it is not very useful and should have a low weight. Conversely, unique words that show up in only a few documents are more likely to be of more value and should have a higher weight. To incorporate this into our weighting matrix, we can use something call the inverse document frequency which weights uncommon words more highly than common words.
If we multiply the term frequency weighting (TF) with the inverse document frequency (IDF) we get a commonly used weighting metric called TF-IDF. It is a commonly used method in classification and document retrieval tasks like search engines. Again, most text classification packages in Python and R have a function to calculate these values for you.
Pulling It All Together in R
In order to build our classifier we will want to follow a few basic steps based on the principles above. First load our labelled data, which will have all the actual document data and the manual category label applied to each document. This will likely be the most time consuming step, as it often requires you to manually label each document to be used to train and test the classifier. Next we will process the data as we outlined above and create our matrix (often called a document term matrix or document frequency matrix). Then we will train the model on a portion of our data set using a machine learning algorithm. Then we use the remaining labelled data to test our model to see how accurately it classifies new data that it wasn’t trained on. Finally, we output various metrics to assess model performance and make further changes to improve performance if necessary.
So, let’s make it real and apply all these steps into our spam vs. ham classification example using R. Code below and also at my github repository. First we load required packages and set parameters
library(quanteda) # text classification package library(tidyverse) # data manipulation # set parameters set.seed(1912) train_prop <- 0.7 # % of data to use for training
Next, we’ll load the text messages that we want to classify. This data has 2 columns; one with the actual text data and the other with the spam or ham label. There are 5574 total documents. We also want to randomize the data for splitting into train and test sets and remove any blank rows.
# read data from csv df <- read.table("SMSSpamCollection.txt", header=FALSE, sep="\t", quote="", stringsAsFactors=FALSE) # Ham/Spam test data # prepare data names(df) <- c("Label", "Text") # add column labels df <- df[sample(nrow(df)),] # randomize data df <- df %>% filter(Text != '') %>% filter(Label != '') # filter blank data
Many text mining packages in R and Python like our data to be data structure called a corpus, which is simply a collection of documents. From the corpus we start by creating our document frequency matrix (dfm). This data ends up having 5574 rows (one for each document), and 9127 columns (one for each word that appears in the corpus). These types of matrices usually have a lot of zeros in them, and are called ‘sparse’ matrices. We will want to do our best to reduce the number of columns or features to a more manageable number. This will be more efficient from a memory perspective and also will help our algorithm as a large number of features can impact both performance and accuracy. Let’s start by applying word stemming, which reduces the number of columns to 7746 by combining similar words. We can perform another function, dfm_trim, which allows us to remove very infrequent terms and also very short documents. This results in 1763 columns or features for our algorithm. Finally, let’s apply the dfm_tfidf function to apply the tf-idf weighting scheme, and we can set the calculation method for both the tf (count) and idf (inverse) term. The quanteda package allows us to do all this data preparation with only a few short lines of code!
# create document corpus df_corpus <- corpus(df$Text) # convert Text to corpus docvars(df_corpus) <- df$Label # add classification label as docvar # build document term matrix from corpus df_dfm <- dfm(df_corpus, tolower = TRUE) # stem words df_dfm <- dfm_wordstem(df_dfm) # remove low frequency occurence words df_dfm <- dfm_trim(df_dfm, min_termfreq = 5, min_docfreq = 3) # tf-idf weighting df_dfm <- dfm_tfidf(df_dfm, scheme_tf = "count", scheme_df = "inverse")
Next, we split the data into a training and testing set
# split data train/test size <- dim(df) train_end <- round(train_prop*size) test_start <- train_end + 1 test_end <- size df_train <- df[1:train_end,] df_test <- df[test_start:test_end,] df_dfm_train <- df_dfm[1:train_end,] df_dfm_test <- df_dfm[test_start:test_end,]
Most of our work is now done, building and testing the model is straightforward with only 2 lines of code. Here we use the quanteda function textmodel_nb to use Naive Bayes algorithm to predict ham/spam. Of course you could use a number of different algorithms here with similar results, and sometimes specific algorithms work better with specific datasets.
# build model with training set df_classifier <- textmodel_nb(df_dfm_train, df_train$Label) # test model with testing set df_predictions <- predict(df_classifier, newdata = df_dfm_test)
The predict function will return a vector with the predicted category for each document in the test set. We can compare that to the actual category labels to see how well our model is working using something called a confusion matrix.
|Actual Ham||Actual Spam|
|Predicted Ham||True Negative||False Negative|
|Predicted Spam||False Positive||True Positive|
The most common classifier assessment metric is accuracy, which is simply the ratio of correct predictions over the total. While useful because it is easily to calculate and interpret, it can be misleading. Consider an example where spam represents 99% of the data, with only 1% ham (which feels like my email inbox some days). If we build a model that predicted spam for every document then we would have a 99% accurate model, but not a very useful one, since it wouldn’t identify any ham emails. As a result, there are a few more metrics that are commonly used. ‘Precision’ in simple terms captures what proportion of predicted Spam is actually Spam, and ‘Recall’ captures what proportion of real Spam was predicted as Spam.
Finally, to create a single measure of model ‘goodness’, we can use something call the F1 score (sometimes F-score or F-measure) which is the harmonic mean between Precision and Recall.
An F1 score of 1 indicates perfect precision and recall while 0 is the opposite. Let’s run the code below to see how our model performs.
conf_matrix <- table(df_predictions, df_test$Label) accuracy <- (conf_matrix[1,1] + conf_matrix[2,2]) / sum(conf_matrix) precision <- conf_matrix[2,2] / sum(conf_matrix[2,]) recall <- conf_matrix[2,2] / sum(conf_matrix[,2]) f1_score <- 2 * ((precision * recall) / (precision + recall)) cat("Confidence Matrix:") conf_matrix cat("Accuracy: ", accuracy) cat("Precision: ", precision) cat("Recall: ", recall) cat("F1 Score: ", f1_score)
Confidence Matrix: df_predictions ham spam ham 1419 11 spam 23 219 Accuracy: 0.9796651 Precision: 0.9049587 Recall: 0.9521739 F1 Score: 0.9279661
Wow, we were able to generate a model that is 98% accurate with an F1 score of 0.93 with only a few lines of code! I leave it to the reader to experiment with the model to see if further improvement is possible. Generally speaking, improving the feature engineering process through the dfm matrix is often the most effective (hint: in this case, try not stemming words). Also, more training data is often very effective at improving performance. Assuming we’re happy with the performance, we can now use this model to predict categories for new unseen data without any additional work. Cool!
Hopfully this post has convinced you that you can begin building text classifiers today with only a few lines of code and some labelled data. In this example we had more than 5000 data points, which definitely helped accuracy. In my experience, it’s more feasible to label a few hundred documents in a few hours and that should produce a starting accuracy of about 80%. Give it a try! Input your data into the code above and see the results. Let me know what you discover in the comments below. Note that you can use data that has more than two labelled categories without any changes to the code.
From this initial model there is a lot more you can do to get more sophisticated with your text classification efforts and you are only limited by your time and willingness to learn. The simplifying assumption of using only single ‘context-free’ words as features made it possible to quickly create such our model but will likely become a limiting factor as you try to increase accuracy even further. You may want to experiment with including n-grams in your model, which just means sets of consecutive words. For example bi-grams will include frequencies of all 2 word phrases in a document. The advantage will be that you can capture some distinctions between things like ‘customer service’ and ‘service provider’, in which a bag of words model treats the word service as the same in both cases. Creating n-grams is done at the tokenization stage and most packages allow you to specify the ‘n’. You can also include interaction terms which look for the co-occurrence of 2 words that are not right next to each other. The trick will be determining which of these new features will be useful to the model while discarding the rest, as n-grams and interactions will increase your matrix size exponentially. This is where dimensionality reduction techniques will come in handy. You may also want to learn about newer methods like word2vec or doc2vec which get better at capturing word context through a concept called word embeddings. I will review some of these methods in future posts, but I wanted to share the basic framework in a simple manner so you could get started today. Thanks for reading and happy text mining!
You don’t have to see the whole staircase, just take the first step – Martin Luther King Jr.