Machine Learning Built-in

Exploratory Data Analysis

Exploratory Data Analysis is a very important process of data science. It helps the data scientist to understand the data at hand and relates it to the business context.

The open-source tools that I will be using in visualizing and analyzing my data is Word Cloud. Word Cloud is a data visualization tool used for representing text data. The size of the texts in the image represent the frequency or importance of the words in the training data.

Steps to take in this section:
  • Get the form data
  • Explore and analyze the data
  • Visualize the training data with Word Cloud & Bar Chart

There are some open-source data sets, such as the spambase data set of the University of California, Irvine, and the Enron spam data set. But these data sets are for educational and test purposes and aren’t of much use in creating production-level machine learning models.

Companies that host their own email servers can easily create specialized data sets that tune their machine learning models to the specific language of their line of work. For instance, the data set of a company that provides financial services will look much different from that of a construction company.

Training the machine learning model

Once you have processed the data and assigned the weights to the features, your machine learning model is ready to filter spam. When a new email comes in, the text is tokenized and run against the Bayes formula. Each term in the message body is multiplied by its weight and the sum of the weight determines the probability that the email is spam. (In reality, the calculation is a bit more complicated, but to keep things simple, we’ll stick to the sum of weights.)