loader

Elasticsearch Machine Learning and Spam Email Identification

Well-known as a powerful search engine, Elasticsearch can help users collect and transform data in real-time. However, its capabilities of data analytics, especially machine learning may not be widely used. This blog will unravel how to create an ML-based classification model with Elasticsearch and provide a practical example for spam email identification.

Background

It can be annoying when you find a number of unwanted emails in your inbox. Filtering out the spam is tiresome since indicators of “junk” are fuzzy. Traditionally, spam emails are blocked by certain sender domains and email addresses. However, it is an endless process to identify a list of suspicious senders. Among a variety of solutions, supervised machine learning techniques have been proven to be fast and reliable in detecting spam based on the message content. 

Supervised machine learning uses a training dataset to teach the algorithm to accurately assign data into a specific category. In the case of spam detection, we will use an example set of spam and ham emails to create a classification model. With this model, we will be able to find the underlying patterns and make accurate predictions.

Elastic Stack makes it easy to ingest the dataset into Elasticsearch and access the Kibana UI to use machine learning features. 

Figure 1: An Overview of Elastic Stack

Elasticsearch machine learning supports the end-to-end workflow from training, evaluating to deploying. The next section will demonstrate the process of identifying spam using Elasticsearch step-by-step.

Creating a Classification Model with Elasticsearch

In order to use supervised machine learning to identify spam, we will use Elasticsearch to prepare and extract features from an email dataset. The next step is to use Kibana to develop a classification machine learning job based on the dataset. Finally, we will evaluate how well the model can identify spam emails and explore how to deploy the model at the ingestion level.  

Preparing data

The dataset we are going to use was collected from the SpamAssassin site. We can upload the CSV file and view the data via Kibana UI directly. It shows that the dataset contains fields namely “Body” and “Label”. If the “Label” is “1”, it means the corresponding email is spam. Conversely, “Label” 0 means the email is not spam.

Figure 2: Fields in the Dataset

After indexing the data into Elasticsearch with a name as “email-example”, we can explore the data distribution in Kibana. As the bar chart shows, around 4000 records have been tagged as “not spam” and 2000 records have a label as “spam”. 

Figure 3: Distribution of Label

To validate the quality of the public dataset, next we need to deal with missing values in the data. All the emails with no “Body” or “Label” will be removed from the training dataset. After cleaning up the data, we have 5982 records which are ready to be used for analysis. 

Figure 4: Example Record without the “Label” Field

Feature Engineering

Before we dive deeper into the data model development, we will do feature extraction from the data. This process of extracting more attributes from raw data is called feature engineering. By telling the machine learning algorithm more characteristics shared by independent units in the data, we will be able to improve the performance of our classification model. 

Those new features to be generated are:

  • The length of an email message

We can use a script to calculate the length of the message for each record in our training data. 

PUT_scripts/email_length
{
"script": {
"lang": "painless",
"source": """
def text = ctx['Body'].toLowerCase();
def length = text.length();
ctx['email_length'] = length;
"""
}
}

To apply the script to the dataset, we need to create an ingest pipeline to transform the data.

PUT _ingest/pipeline/count_email_length{
"description": "This is to count the length of a text",
"processors": [
{"script": {
"id":"email_length"
}}
]
}

  • The number of spam trigger words in an email

While spam emails normally try to persuade someone to take action, some certain words appear frequently in those messages to draw attention. Therefore, we can also predict how likely the email is spam based on how many keywords presenting in the email. 

To keep things simple, we will use a selected list of keywords as an example. Here is the tag cloud of the selected corpus.  

Figure 5: Keywords Tag Cloud

Similarly, we can use a script to calculate the occurrence of keywords in each email message. 

PUT _scripts/num_of_keyword{
"script": {
"lang": "painless",
"source": """
def keywordList = ['act','apply','bonus','buy','call','cheap','click','earn','free','get','gift','limited','offer','order','save'];
def keywordCount = 0;
for (def i=0;i<keywordList.size();i++)
{
def text = ctx['Body'].toLowerCase();
if (text.contains(keywordList[i]))
{
keywordCount +=1;
}
}
ctx['keyword_count'] = keywordCount;
"""
}
}

And we also build a pipeline to implement the script.

PUT _ingest/pipeline/count_keyword_freq
{
"description": "This is to count the occurance of keywords in a text",
"processors": [
{"script": {
"id":"num_of_keyword"
}}
]
}

The last step in pre-processing data is to add those new attributes to the dataset.

Let’s create a top-level pipeline to include pipelines built earlier and to drop unnecessary columns.The last step in pre-processing data is to add those new attributes to the dataset.

PUT _ingest/pipeline/reindex_pipeline{
"description" : "The top-level pipeline",
"processors" : [
{
"pipeline" : {
"name": "count_keyword_freq"
}
},
{
"pipeline" : {
"name": "count_email_length"
}
},
{
"remove": {
"field": "column1"
}
}
]
}

Finally, we can create a new index for our enriched training dataset using the pipeline.

POST _reindex
{
"source": {
"index": "email-sample"
},
"dest": {
"index": "enriched-email-sample",
"pipeline": "reindex_pipeline"
}
}

Here is how the new dataset looks like with extra fields. Finally, we can create a new index for our enriched training dataset using the pipeline. 

Figure 6: Example Data with Feature Engineering

Training Data in Elasticsearch 

Building a classification model with Elasticsearch is quite straightforward. We can access the Kibana Data Frame Analytics tab to use the machine learning wizard. 

Here we select “Classification” as the job type, “enriched-email” as the source index, and “Label” as the dependent variable we want to predict. All the other fields will be included in the analysis.

Figure 7: Kibana Machine Learning UI

We can also look at the scatterplot matrix to understand the relationships between the fields. 

Figure 8: Scatterplot Matrix of Variables

The job will go through several phases to finish analysing the data and generating results. 

Figure 9: Stats of the Classification Job

Evaluating the model

After the training process is complete, we will use the confusion matrix to evaluate the classification model. It shows the percentage of correctly predicted values in each category. 

For example, from the table we can see that 78% of label 1 emails were predicted as “spam” by the model, which indicates the True Positive Rate. Comparably, 79% of label 0 emails were tagged with “safe”, which indicates the True Negative Rate. Of course we can also know how many label 1 emails were identified as “non-spam” (False Negative Rate) as well as how many label 0 emails were assigned to the “spam” category (False Positive Rate).

In addition, a ROC (Receiver Operating Characteristic) curve is also provided. It compares the True Positive Rate against the False Positive Rate at different classification thresholds. If the AUC (Area Under the Curve) is higher, the model performs better in predicting the classes. 

In reality, this model requires improvement by different approaches. For example, we can use a larger size of training data or extract more features such as the number of bigram-based keywords in an email. Nevertheless, it is an ongoing process and there may be a tradeoff between False Positives and False Negatives. 

Figure 10: Evaluation of the Classification Model

Deploying the model

Once we are satisfied with the evaluation results, we can deploy the model in Elasticsearch, for example, as a processor in an ingest pipeline, to make predictions against new data. 

PUT _ingest/pipeline/spam_prediction{
"description": "Filter spam emails using ML",
"processors": [
{
"inference": {
"model_id": "spam_email_classification-1618899599824"
}
}
]
}

Conclusion

In this blog we have seen what is Elasticserach machine learning and how simple it can be when we use Elastic Stack to build a classification model. We selected spam email detection as an example to experiment with and successfully predicted 80% of email spam. Although this model needs further adjustment until it can be used in practice, it provides a starting point for using Elastic machine learning as a complement to other algorithms adopted by some email service providers. 

Apart from spam email detection, we can extend the classification method to a lot more use cases including identifying whether a domain is risky or determining whether an application is malicious. In conclusion, the integrated solution provided by Elastic Stack enables us to accelerate the process of machine learning deployment and adapt quickly to changes.

Listen to our webinar for a hands-on demonstration of building the classification model. You can also visit our Skilledfield website to know more about building an Elastic Stack environment and implementing machine learning with Elasticsearch. 

References

    1. https://heartbeat.fritz.ai/spam-filtering-using-bag-of-words-1c5484ff07f1
    2. https://www.kaggle.com/owenpatrickfalculan/spam-email-classification
    3. https://bdtechtalks.com/2020/11/30/machine-learning-spam-detection/
    4. https://www.elastic.co/guide/en/machine-learning/current/flightdata-classification.html

Author: Ziqing(Astrid) Liu