A new open-source Python module from DeepSpeed called DeepSpeed-MII accelerates over 20,000 popular deep learning models.
Even though open-source software has increased accessibility for AI, inference time and cost remain two major barriers to its widespread adoption.
Although they are not yet generally accessible, system innovations have the potential to reduce the latency and cost of DL model inference. Low latency and low-cost inference are therefore mostly unattainable as many data scientists require the knowledge to correctly identify and carry out the set of system improvements relevant to a given model. The complexity of the DL model inference landscape, which involves large changes in model size, architecture, system performance characteristics, and hardware requirements, is mostly to blame for this lack of availability.
DeepSpeed-MII
A new open-source Python library called DeepSpeed-MII was created by Microsoft Research to encourage the wider adoption of low-latency, economical inference of high-performance models. Numerous commonly used DL models with highly effective implementations are accessible through MII.
MII employs DeepSpeed-Inference optimizations, such as deep fusion for transformers, automated tensor-slicing for multi-GPU inference, and ZeroQuant quantization, for low latency/cost inference. As a result, it makes it possible to deploy these models quickly, simply, and affordably using Azure and AML on-premises.
MII is powered by DeepSpeed-Inference under the hood. MII optimises DeepSpeed-system Inference based on the model type, batch size, and hardware resources to decrease latency and boost throughput. In order to identify the underlying PyTorch model architecture and replace it with an optimised implementation, MII and DeepSpeed-Inference use a number of pre-specified model injection criteria. As a result, the DeepSpeed-comprehensive Inference’s set of optimizations are instantly available to the tens of thousands of widely-used models offered by MII.
open-source databases
There are dozens such transformer models available in numerous open-source model repositories, including Hugging Face, FairSeq, EluetherAI, and others. A few of the applications that MII provides are text creation, question answering, and classification. It supports models with hundreds of millions of parameters such as BERT, RoBERTa, GPT, OPT, and BLOOM. Additionally, stable diffusion and other modern image-production methods are made possible. Additionally, inference workloads may be latency-critical or cost-sensitive, with the main goal being to minimise both.
One of two DeepSpeed-Inference iterations can be used by MII. The majority of the above improvements are found in the first, ds-public, which is a component of the public DeepSpeed library. Users of Microsoft Azure can also connect to ds-azure using MII for a more reliable connection. Both the MII-Public and MII-Azure DeepSpeed-Inference variants can be used to access MII instances.
Conclusion
When compared to the open-source PyTorch implementation, MII-Public and MII-Azure provide considerable latency and cost benefits (Baseline). However, their performances may differ depending on the type of producing task. MII is perfect for latency-critical applications with batch sizes of one since it can cut latency by up to 6x for open-source models and workloads. To obtain the lowest cost, the team increased baseline and MII throughput with a big batch size. Costly language models like Bloom, OPT, etc. can drastically lower inference costs when combined with MII.
Both locally and on any cloud service, MII-Public is capable of running. For deployment-related enquiries, MII implements a straightforward GRPC server and offers a GRPC inference endpoint. Azure and AML Inference can be used to support MII. The researchers also anticipate that a number of models will be made easier by their discoveries. More advanced AI capabilities in applications and products are made possible by MII’s quick reduction of inferencing latency and cost.