In our webinar “Optimizing Machine Learning with TensorFlow” we gave an overview of some of the impressive optimizations Intel has made to TensorFlow when using their hardware.
You can find a link to the archived video here.
During the webinar, Mohammad Ashraf Bhuiyan, Senior Software Engineer in Intel’s Artificial Intelligence Group, and myself spoke about some of the common use cases that require optimization as well as benchmarks demonstrating order-of-magnitude speed improvements when running on Intel hardware.
TensorFlow, Google’s library for machine learning (ML), has become the most popular machine learning library in a fast-growing ecosystem. This library has over 77k stars on GitHub and is widely used in a growing number of business critical applications.
The ML ecosystem and toolset continue to grow, evolve, and mature. Plus, both the ecosystem and toolset have become more accessible even to those new to ML. This accessibility extends from developing and training models to the deployment of models into production scenarios.
Real-world production scenarios may include tens of thousands or even millions of data points in complex models that could potentially have thousands of features. At that scale, optimized models are critical to both the success and basic functioning of your ML application. Training is often the most computationally expensive part of the process, as processing huge data sets can take on the order of hours, days, or even weeks depending on the complexity of your model. In order to keep pace with the rate of new data acquisition, especially in the case of real-time applications, you need to be able to train your model at a reasonable rate. And if you’re training in the cloud on services like AWS or Azure, you are also likely interested in reducing CPU and memory usage.
Thankfully, even though the deployment of deep-learning models to solve business problems is relatively new, the techniques that yield good computational performance are not. A number of common optimization techniques can be applied to the TensorFlow library in order to reap some impressive gains in performance.
Data Storage Optimizations
Reducing the total number of allocations made can have a dramatic impact on your model’s training time. This is true from both a pure memory usage standpoint and a speed standpoint. The time for an individual allocation may be small, but scaled up to millions of rows of data -- suddenly the potential time saved is significant.
Converting data between formats can also seem inexpensive individually. But they can add up to serious time when multiplied across large graphs and datasets converting between floats and integers or changing precision.
Changing the actual layout of your graph itself can have substantial gains. Gains in this category are achieved by eliminating redundant steps in the graph or by replacing key graph nodes with better optimized versions of equivalent instructions.
TensorFlow is well suited to parallel execution. Ensuring that you are able to leverage multiple cores, nodes and threads wherever available depending on your hardware environment can deliver critical speed improvements.
As mentioned above, sometimes replacing an instruction or set of instructions with an equivalent but better optimized version can result in large gains. In Intel’s case they have a set of optimized MKL (Math Kernel Library) instructions that are adapted for TensorFlow. These can deliver large speed gains on common math operations when using Intel hardware. Replacing a MATMUL operation with its MKL equivalent may result in a dramatic speed improvement that becomes more pronounced as the size of the dataset increases.
ActivePython ships with MKL optimizations enabled for many common numerical and scientific computing libraries. New optimizations will be added as they become available. Let’s take a look at a simple Python example that is running in ActivePython to show the speedup at the individual instruction level.
In this example, the code computes the eigenvalues and normalized eigenvectors of a random square matrix of increasing size:
for nSize in range(0, 10): a = np.random.rand(nSize,nSize) result = np.linalg.eig(a);
The benchmark results from running this code show a clear 4x improvement using ActivePython with MKL over the vanilla Python distro:
It’s also clear from the graph that the gains become more significant and pronounced as the size of the data set increases, so placed within a TensorFlow graph across a huge training set it’s easy to see how these optimizations when taken together amount to what ends up being order-of-magnitude performance improvements.
For more detail on:
- How these optimizations were implemented in TensorFlow
- How to integrate them into your own project to realize these order-of-magnitude speed improvements
- Why now is the time to try out TensorFlow on a CPU
- Where CPUs excel over GPUs
- And much, much more...