At SparkCognition, we create machine learning models that can predict some amazing things — like forecasting the failure of a gas turbine days ahead of time instead of hours. However, as our company grew to accommodate our customer base, we ran into a problem inherent in the field: Models produced by different teams of data scientists can vary widely in terms of required inputs and interpreting outputs.
These differences can make it difficult to adapt existing models for new data pipeline infrastructures and can force us to train new models to fit an old pipeline. Are there more efficient ways to deal with this issue? It turns out data science and data pipeline infrastructure can benefit from my field of expertise: software development.
Standardizing Software Boundaries
Let’s imagine a scenario where a picture sharing company has an app that allows users to purchase pictures and have them printed and shipped via a postal service. Each module of the app could have its own user management logic, developed independently by different teams of developers. At first, the picture sharing system in Figure 1 was implemented. Users could log in and upload their pictures, which were shared to other people.
The company then added an accounting system to allow users to pay another user for the right to print their picture. However, the existing user management logic was hard to extract from the picture sharing system, so the accounting team decided to write their own. Unfortunately, this caused user management code to be duplicated and bug fixes to one system were not necessarily fixed in another system. Or worse, if the user management code is shared by all systems, code changes made by the shipping team could create bugs in the printing system without anyone noticing.
Software developers learn early on in their careers the importance of properly encapsulated abstractions with clearly defined boundaries. Each of these abstractions, or modules, can be stitched together via their exposed interfaces either via an API or other publicly exposed function calls, to form more complex systems that will achieve various business goals.
There are considerable benefits of sharing modules. In Figure 2, the user management system is only implemented once and re-used by any other module that has a need to manage its users. Developers can focus on the specific features of their own modules, reducing their cognitive load and decreasing the number of developers needed to work on each module.
Furthermore, each abstraction exposes a public interface that acts as a protective layer to other parts of the system. When the internal details of a module change, other modules will be not be affected as long as the public interface remains the same. Another benefit is that new features and bug fixes to the user management system are immediately usable to everyone.
There is one more crucial benefit: A module can be duplicated with brand new code achieving a completely different goal while implementing the same exact interface. Users of this new module will never know of the change (nor will they care). This has huge implications in terms of how easy it is to update a complex software system over time. A team that has spent the time and effort to properly abstract a storage module for a blogging engine, for example, would have the freedom to change how the storage of data is done without impacting the rest of the business.
Creating a Machine Learning Model Interface
What does this have to do with machine learning models?
Simply this: If a machine learning model can be abstracted into a module with a public interface that is unlikely to change, then the model can be swapped with another model without having to modify the analytics platform that uses it. A model could be re-trained or re-worked, and plugged back into the system with no further changes needed. Even better, new models could be created for different scenarios or customers and be used by existing data pipelines.
What kind of interface can we create around a machine learning model? Our early efforts at SparkCognition were to create a model runner, a piece of code whose inputs are 1) the file location of a CSV file containing data that we wish to run through the model and 2) the file location where the model will save a CSV file with prediction or classification results. This model runner is the interface that abstracts away the actual model, which may be written in any language, launched directly from the model runner, or spun up as a separate process on the machine.
Since then, however, our concept of a model has evolved a little and we are shifting toward model packages, which include extra detail and metadata about the model itself that we want to expose to our customers. For example, a classification model package includes a model runner, a list of feature importance for each cluster that it detected, two-dimensional cluster coordinates so a website can plot a graph, and more. Our data science and infrastructure teams have begun a standardization effort: our model runners still expect two file paths for their input dataset and output result file, but we have also added accessor functions to retrieve the rest of the information that the model package offers us.
This standardization effort created an interface that any new model package containing a classification model must follow. Because of this effort, we are now able to deliver new unsupervised learning models to new customers by re-using existing data pipelines and infrastructure with minimal lead time. This shortens the timeline for clients to implement our products and allows them to rapidly gain insights into their business.
Want more from Louis? Watch his video explaining the distributed filesystem Hadoop.