By: Karl Weinmeister and Jarred Capellman
As the saying goes, “a chain is only as strong as its weakest link.” In machine learning applications, the feature extraction process is this critical determining factor. It ensures that the right data is pulled efficiently from relevant data sources and is then derived into features, which are the inputs that a machine model uses to make a prediction.
In this blog post, we’ll discuss why performance tuning during feature extraction is important and how to improve this process, using examples from SparkCognition’s machine learning (ML)-based anti-malware solution, DeepArmor. Specifically, we’ll focus on feature extraction, because for approaches like decision trees, this can be the most time-consuming part of the process.
When is the right time?
Performance tuning is never the first task for a data science team. Understanding the business problem, exploring the data, and creating a model should all be done first.
That said, after an iteration or two on the data science project flow, data scientists should have been able to pick out the most relevant features, and at that point, the experimentation has moved on to iterating through models built on those features. Therefore, it might be worth inserting performance tuning into the process to speed the iteration time on those models. This could result in significant cost reductions during the research process.
In the simple data science process above, performance tuning steps can be inserted at both the “Evaluate” and “Prepare Data” stages. The first step would be to baseline the feature extraction time along with other model accuracy metrics. In the next iteration of the process, the insights gleaned from that process can be used to experiment with alternate approaches in data preparation, which can then be measured and compared to the baseline.
What are the benefits?
The primary benefits of performance tuning are cost and time, and they can be seen during the research phase of the project, long before a model is even released to production. Those benefits will spill over to production, reducing the cost of operating the solution and improving the user experience.
With massive amounts of data, it is common for data scientists to partition data across multiple nodes in a distributed computing environment like hdfs or Amazon’s s3, and then to run compute jobs using an execution environment such as Apache Spark. Experiments are run over and over again to iteratively improve the solution. By improving the speed of feature extraction, answers can be had more quickly, and the cost of research jobs will decrease. When multiplying the costs of performance improvements across many processes and many nodes, even small performance gains scale up quickly. Ultimately, that will translate into accelerating the project’s schedule or reducing its cost. It can also improve the accuracy of the models produced by enabling more experiments to be run on more data within the same schedule and budget.
In production, the end user also benefits from performance tuning. As an anti-malware solution, DeepArmor needs to quickly detect and mitigate any threats, so the machine learning model is actually deployed on the end user’s device. Keeping the feature extraction process extremely fast ensures that the product can react in real time, and also maintains low CPU usage. In a hosted application scenario, these performance benefits not only translate into lower operating expenses for servers, but also provide protection in an offline scenario.
Performance tuning for feature extraction is a valuable technology that can provide large benefits to a data science team. Join us in part two to learn how to implement performance tuning in your own operations.