By Marla Rosner and Keith Moore
Previously, we’ve covered the basics of machine learning, including AI, deep learning, and neural networks. This time, we’re going to take a deep dive into another term that takes a little longer to explain, but is creating an exciting new field within modern machine learning: automated model building.
Automated model building, often also referred to as meta learning, is AI that designs AI systems. Humans can design AI models, of course, but it’s a lengthy and tricky process. These man-made models often struggle to scale across large operations and cannot effectively handle edge cases that occur under extreme or unusual operating parameters.
Thus, automated model building has become increasingly important to machine learning, as it creates dynamic, accurate models that take less time to develop and can adapt to changing conditions without needing a human in the loop every step of the way.
There are four major steps in the process of automated model building: cleaning, feature generation, feature selection, and the construction of either a supervised or unsupervised model.
Cleaning
Not all data sets are created equal. To make a data set algorithmically usable, an automated model building software needs to convert all data into a usable format, scale the data, and in some cases rebalance sample sizes.
Data also comes in a number of formats, such as categorical, numerical, and date/time. The software must convert all data into a usable format, so it can be used together in a single model. Often this means converting items like categorical data to numeric data, or recognizing a date/time feature and ensuring that the data is treated as a time series.
Finally, data needs to be scaled and rebalanced so that all data is in similar sample sizes and a single scale, without too wide a variation in the range of values and similar sample sizes. This ensures that all of the data can be meaningfully compared and used together.
Feature Generation
Once data has been cleaned, it must be manipulated to generate more appropriate features for solving a particular problem. A feature is a piece of measurable information used in machine learning–for instance, the age, weight, and yearly income for a set of people would be three possible features of that set.
One challenge in generating features is determining how to window time steps. Automated model building can take care of this by using deep neural networks that automatically learn filters and adjust window sizes for the time frame being used.
Feature Selection
Once automated cleaning and feature generation have taken place, the data set is ready to be used to build a model. There are two types of learning models: supervised and unsupervised. The process of feature selection and model building will differ based on whether the resulting model is supervised or unsupervised.
Supervised Learning Models
In supervised learning, models are expected to learn from past data and produce some useful prediction about unseen or future data. The models are trained by feeding them the past data, which is outfitted with labels specifying the values we would like for the model to predict. The model is then tasked with learning from that data and generalizing what it learns to unseen examples.
There are many different approaches to automated supervised model building. One such method is the evolutionary algorithm employed by SparkCognition’s automated model building tool, Darwin. Darwin begins by analyzing characteristics of the input dataset and the specified problem, and applying past knowledge to construct an initial population of machine learning models that are highly likely to produce accurate predictions on the given supervised learning problem. In the case of neural networks, these models are then passed to an optimizer, which uses popular techniques like backpropagation to train the models.
Darwin then employs principles from evolutionary biology to further optimize the models. This process is begun by generating thousands of neural network models and scoring them on their performance. After creation, the first generation of models is speciated, or clustered, based on shared characteristics. Within each species, the software identifies the elite models that performed best towards solving the problem. These elite models can then be genetically mutated and optimized by deep learning-based backpropagation.
Once this is done, the altered elites are reintroduced to the general population of models. Over a number of generations, the models are refined until they are complex and sophisticated enough to accurately solve the problem at hand. The end result is a machine learning model that is highly optimized to the specific supervised learning problem and the data available to the user, with no manual work required.
Unsupervised Learning Models
Unsupervised models are models that do not use training data, but are generated by simply feeding an algorithm unlabeled data and allowing it to determine independently how best the data should be grouped.
Data comes in a number of formats, such as categorical, numerical, and date/time. The software must convert all data into a usable format, so it can be used together in a single model. Often this means converting items like categorical data to numeric data, or recognizing a date/time feature and ensuring that the data is treated as a time series.
For time series data, SparkCognition’s approach to anomaly detection is performed using a type of multilayer deep neural network known as a convolutional neural network. Another approach used is an autoencoder, which is a neural network for compressing a feature set to the smallest size possible and then decompressing that small feature set with as little loss as possible. Think of this as a neural-network-based way to “zip” and “unzip” data like you would on a computer.
By using these processes, the model learns from past data so that it can identify anomalies in new data sets.
For non-time series data, anomaly detection can be performed by creating a random decision tree of features, and then calculating how long a path along this tree it takes to isolate any one data point. Anomalous data points will have a much shorter path to isolate, as they share fewer features with all other data points.
Once anomaly detection is performed, users have the option of clustering their data. This process is the same for both time series and non-time series data.
Like with anomaly detection, there are many approaches that can be used to cluster data. One approach SparkCognition has used is using an algorithm to label points in a data set by their similarity to one another. The data from these labels is then fed through a neural network trained to determine whether two data points are similar or dissimilar.
Once the data has gone through this network, Gaussian mixture model is employed to fit the data to multiple Gaussian distributions, forming clusters. However, at this point there will be data points that do not cluster, and that can therefore be determined to be anomalies.
All of these processes are what occur inside automated model building software when it creates a new model. A human can perform these tasks as well, but with so many complex steps needed to create a quality model, it’s easy to see why it often takes a team of data scientists months or longer to build a model. Automated model building, on the other hand, can shorten this process to weeks or even days. It’s time to let data scientists handle the human tasks that require critical and creative thinking, and let AI handle building more AI.