To follow up the introduction to oneML (oneML — an optimized and portable machine learning SDK ), we now would like to go deeper into details about the role of this AI/ML SDK into the life cycle of applications and models involving with experimental and state-of-the-art artificial intelligence.
From research to production
AI and ML are evolving very quickly and there are incredible advancements shared with the public on a daily basis. Researchers are hard at work to develop the new state-of-the-art (SOTA) model for the latest trend in their field, while the public is patiently waiting for having access to their amazing work and playing around with the latest application and model.
Between the work of researchers and the expectations of the final users, there is, though, a huge gap: how to bring the new experimental research models to production in the most efficient way possible.
Why is filling this gap necessary? Why not simply deploy the same work/code that researchers use?
Some of the limitations include, but are not limited to:
- Scalability: the code used for experiments, models training, and model evaluation is not designed for serving final users with access to the service and it can face issues and bottlenecks when needing to scale up because of an increasing amount of users using the service
- Performance: the latency of the service deeply impacts the user experience, thus applying some simple optimization techniques to the application and/or making sure that the model is compiled for production are some of the steps that can be taken to make a research project production-ready
- Cost: deploying applications that leverage large deep learning models that are served to a lot of users (in the millions) demands a lot of computing resources, thus it’s very expensive to maintain. The cost can be hugely reduced by improving the overall efficiency of the service, which means demanding fewer resources and using less powerful HW without losing on user experience
Filling the gap
Now, let’s explore how oneML SDK is helping researchers and engineers to efficiently transition from research to production.
Scalability
oneML provides an all-in-one standalone library with many high-level APIs that it’s easy to integrate and use in a wide variety of scenarios. Any service that uses the SDK is easier to scale thanks to the fact that there’s no need to download, install, and set up any dependency. Moreover, the size of the library is very contained (a few MBs) and the model optimization techniques that are applied to the models can also reduce the size of the models themselves that helps the application to startup more quickly and run more smoothly.
Performance
oneML’s code is designed with performance in mind and is implemented in C++, making the execution faster than many other programming languages. Usually, though, the bottleneck of a deep learning application in production is the deep learning model itself because, on average, it takes a lot more resources than anything else in the service. Thus, a lot of effort in our SDK has been put into optimizing the models by using multiple frameworks to simplify, prune, quantize, and, finally, compile the model into a much faster version of the research version.
Cost
Efficiency is another of the major areas of focus of oneML and it’s no surprise that this results in lower running costs for the applications that integrate the SDK. Namely, it’s possible to serve more users when using the same HW or reduce the cost of the HW (less powerful) while serving the same amount of users or, even better, a combination of the previous two altogether.
The process
Here are some key techniques that are part of our process of bringing AI/ML applications and models to production:
- Model Quantization: the SDK can include tools for model quantization, which is the process of converting a high-precision deep learning model into a lower-precision representation. For example, converting a 32-bit floating-point model to an 8-bit integer model. This reduces the memory footprint of the model, allowing it to be stored and processed more efficiently. Lower-precision models also require fewer computational resources during inference, leading to faster inference times and reduced running costs, especially in scenarios where computational resources are limited, such as edge devices or embedded systems.
- Model Compression: deep learning models can be quite large, with millions or even billions of parameters. oneML can provide techniques for model compression, such as pruning, which involves removing redundant or unnecessary parameters from the model. This reduces the size of the model, making it more compact and easier to deploy in production environments. Smaller models also require fewer computational resources during inference, leading to improved performance and reduced running costs.
- Hardware Acceleration: many modern AI deployments leverage specialized hardware accelerators, such as GPUs or TPUs, to accelerate model inference. Our SDK can include optimizations that take advantage of these hardware accelerators, such as optimized libraries or runtime environments, to leverage the full potential of the hardware and achieve faster inference times. Hardware acceleration can significantly improve model performance and reduce inference latency, leading to more responsive AI applications.
- Runtime Optimizations: oneML can include runtime optimizations, such as kernel fusion, tensor layout optimization, or graph optimization, which are designed to optimize the execution of deep learning models during inference. These optimizations can improve the efficiency of model inference, reducing computational overheads and improving overall system performance. For example, kernel fusion can combine multiple operations into a single operation, reducing memory transfers and computation, and improving inference speed.
- Resource Monitoring and Management: the SDK can also provide tools for monitoring and managing computational resources during model inference. This can include features such as dynamic batching, which optimizes the batch size of input data during inference based on the available system resources, reducing the overhead of data transfers and computation. Resource monitoring and management can help optimize resource utilization, leading to improved scalability and reduced running costs.
All these steps can be achieved thanks to the integration of multiple open-source advanced deep learning frameworks (e.g. TVM, MNN, openVINO, TensorRT, etc.) and proprietary code that, in the end, make sure each deployment is optimized for each specific environment.
Real-life example
FRVT (Face Recognition Vendor Test) by NIST (National Institute of Standards and Technology) is a standardized test to evaluate the performance of face recognition algorithms and, as part of Sertis’ efforts in consolidating itself as a leading AI firm worldwide, AI researchers and Machine Learning Engineering teams have collaborated to take part to this test.
To be able to participate to the test, submitted algorithms must comply with strict specifications in terms of structure and performance. One of these requirements is about the latency time of the whole application, including the inference time of the deep learning or machine learning models used (if any). Thus, this is a very good example about how leveraging a production-grade AI SDK like oneML could impact a deployment scenario.
TVM
For this specific use case, we knew the specific HW that our algorithm would have run on and this gave us the chance to specifically optimize our code and models for exactly the right architecture, instructions set, and HW specifications that would have been used during the test.
We decided to use TVM as our framework of choice because of its optimization capabilities, minimal deployment impact, and ease of use. For FRVT, the research-to-production process would look something like the following steps:
- Get inference Python code from researchers
- Translate inference code into optimized C++ code
- Optimize the deep learning model with some of the techniques listed in The process (in TVM, this process is pretty much automatic; the user only needs to set up some desired requirements/specifications and the optimizer will keep iterating until it will have found the best configuration)
- Validate optimized model performance and accuracy
- Integrate the new C++ code and the optimized model into the SDK
- Test the newly integrated functionalities of the SDK (unit and integration tests)
- Test the public APIs of the SDK (end-to-end tests)
- Build a release deployment package of the SDK
- Wrap oneML’s API with some application code (NIST)
- Compile and run an application (NIST)
Final results
During our engineering process, we have examined and tested a wide variety of optimization techniques for both our deep learning models as well as our C++ code. Here we’re going to recap the journey from the performance (latency) point of view.
The most relevant constraint for FRVT deployment is the fact that there is no HW accelerator available, thus we must try to run the whole algorithm (API, including model inference) in less than a second on a single thread of a CPU (Intel Xeon e5–2630 v4).
The following results are representative of a ResNet50-like model and benchmarked on a single thread of a CPU (Intel i5–7500). Our optimization journey evolved as follows:
- Tensorflow → ~2000ms (ORIGINAL RESEARCH MODEL)
- Tensorflow LITE → ~1000ms
- oneDNN → 180ms
- openVINO → 150ms
- TVM (w/o tuning) → 200ms
- TVM (w/ tuning) → 140ms (PRODUCTION-READY MODEL)
In the end, the production-ready model reached a speed up of ~14x over the original research model, thus enabling Sertis to deploy their algorithm for FRVT and highly increase the efficiency of all the processes related to this application.
Takeaways
- It is of crucial importance to productionize the research model for deployment in order to have an efficient and performing system in production.
- Nowadays, it is fairly easy to optimize deep learning and machine learning models and it usually deeply benefits the system, thus it’s worth the effort.
- Close collaboration between research and engineering teams is key to an efficient way of working in a company when trying to fill the gap between research and production.
References
- GitHub — sertiscorp/oneML-bootcamp: Intro and sample apps to showcase oneML functionalities and possible use cases: Face Detection, Face Identification, Face Embedding, Vehicle Detection, EKYC, Person Attack Detection.
- oneML — an optimized and portable machine learning SDK
- GitHub — apache/tvm: Open deep learning compiler stack for cpu, gpu and specialized accelerators
- GitHub — alibaba/MNN: MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
- GitHub — openvinotoolkit/openvino: OpenVINO™ Toolkit repository
- GitHub — NVIDIA/TensorRT: NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
- GitHub — tensorflow/tensorflow: An Open Source Machine Learning Framework for Everyone
- GitHub — oneapi-src/oneDNN: oneAPI Deep Neural Network Library (oneDNN)
- FRVT 1:1 Verification
Written by: Sertis Machine Learning Engineer team
Originally published at https://www.sertiscorp.com/sertis-vision-lab