Machine Learning (ML) powers an increasing number of the applications and services that we use daily. For organizations who are beginning to leverage datasets to generate business insights — the next step after you’ve developed and trained your model is deploying the model to use in a production scenario. That could mean integration directly within an application or website, or it may mean making the model available as a service.
As ML continues to mature the emphasis starts to shift from development towards deployment. You need to transition from developing models to real world production scenarios that are concerned with issues of inference performance, scaling, load balancing, training time, reproducibility and visibility.
In previous posts we’ve explored the ability to save and load trained models with TensorFlow that allow them to be served for inference. This means that we can make them available to end users of our application or service to make predictions.
Training models in production will bring any latent performance concerns to the fore very quickly. During development, long training times may not present themselves as urgent issues. But long training times become show-stoppers when deployed in production. Depending on the type of model that you’ve built it may remain relatively static or you may have to retrain the model regularly based on new data.
Certain classes of problems, like face recognition don’t need to be retrained. However, something like a recommendation engine will need constant retraining to remain current as the data changes and evolves. Your model can quickly become stale and yield any predictions useless unless you train it in production.
In any production scenario reproducible builds become critical in a case where you have to roll back or isolate a problem. In the case of ML, where you are largely managing data rather than code, traditional source control isn’t as well suited to manage this kind of deployment. Tools like Pachyderm which is essentially like ‘git’ for ML datasets let you manage and version training data for your model.
There are additional concerns around data pipelines with extremely large data sets, especially if you are using a GPU to train your models. You may find yourself quickly bottlenecked if your database queries designed to run on a CPU and aren’t suited to deal with the constraints of GPUs. We’re seeing increasing development in designing your data pipeline to work end-to-end with your training environment. GPU database systems like MapD are emerging as potential tools to solve this problem.
For many of the big players cloud-based ML services are becoming their primary focus. Amazon, Google, and Microsoft Azure all offer sophisticated AI-as-a-service tools for developers to develop and deploy ML models into the cloud for use in their applications and services.
These services allow cloud hosted models, training in the cloud, and even pre-trained models for common use-cases that can be spun up as needed and scaled to accommodate any problem size.
Services like Google’s DialogFlow help you integrate Natural Language Parsing (NLP) into your app, while Amazon’s Rekognition brings object recognition in images to your application by leveraging common datasets and models.
In addition to the major cloud providers, there are specialized AI platforms like H2O.ai that offer robust tools for model deployment and scaling – making it easier to manage your data, models and deployments. And for those looking for “out of the box” model solution, service providers like MachineBox.io offer ready-to-use models for common tasks like sentiment analysis or facial recognition.
Rolling your own Service
You might look to build your own service using your language of choice if you have a proprietary system that you need to integrate with or need on on-prem solution to host your ML model.. Languages like Go or Python have robust sets of tools and libraries to support development for data science and microservices. Microservice architecture is particularly suited to hosting ML models and a simple inference API is relatively straightforward to build.
In our next installment we’ll walk through building a simple inference API using Python and Flask to host a TensorFlow model. We’ll work through all the steps from freezing and saving the model to deploying and testing the service — and share all the code on GitHub.
Moving your models from development into production? Looking for tips on performance optimization for TensorFlow — including how to get order-of-magnitude speed improvements? Check out our recent webinar on “Optimizing Machine Learning with TensorFlow” where we walk through a number of gains that are available now.
Watch the Webinar