Note: Special thanks to Less Wright, Partner Engineer, Meta AI, for review of and additional insights into the post.
This post on creating efficient artificial intelligence (AI) is the second in the “Summer of open source” series. This series aims to provide a handful of useful resources and learning content in areas where open source projects are creating impact across Meta and beyond. Follow along as we explore other areas where Meta Open Source is moving the industry forward by sharing innovative, scalable tools.
Since its initial release in 2016, PyTorch has been widely used in the deep learning community, and its roots in research are now consistently expanding for use in production scenarios. In an exciting time for machine learning (ML) and artificial intelligence (AI), where novel methods and use cases for AI models continue to expand, PyTorch has reached the next chapter in its history as it moves to the newly established, independent PyTorch Foundation under the Linux Foundation umbrella. The foundation is made up of a diverse governing board including representatives from AMD, Amazon Web Services, Google Cloud, Microsoft Azure and Nvidia, and the board is intended to expand over time. The mission includes driving adoption of AI tooling through vendor-neutral projects and making open source tools, libraries and other components accessible to everyone. The move to the foundation will also enable PyTorch and its open source community to continue to accelerate the path from prototyping to production for AI and ML.
PyTorch is a great example of the power of open source. As one of the early open source deep learning frameworks, PyTorch has allowed people from across disciplines to experiment with deep learning and apply their work in wide-ranging fields. PyTorch supports everything from experiments in search applications to autonomous vehicle development to ground-penetrating radar, and these are only a few of its more recent applications. Pairing a versatile library of AI tools with the open source community unlocks the ability to quickly iterate on and adapt technology at scale for many different uses.
As AI is being implemented more broadly, models are trending up in size to tackle more complex problems, but this also means that the resources needed to train these models have increased substantially. Fortunately, many folks in the developer community have recognized the need for models to use fewer resources—both from a practical and environmental standpoint. This post will explore why quantization and other types of model compression can be a catalyst for efficient AI.
Most of this post explores some intermediate and advanced features of PyTorch. If you are a beginner that is looking to get started, or an expert that is currently using another library, it’s easiest to get started with some basics. Check out the beginner’s guide to PyTorch, which includes an introduction to a complete ML workflow using the Fashion MNIST dataset.
Here are some other resources that you might check out if you’re new to PyTorch:
There are many pathways to making AI more efficient. Codesigning hardware and software to optimize for AI can be highly effective, but bespoke hardware-software solutions take considerable time and resources to develop. Creating faster and smaller architectures is another path to efficiency, but many of these architectures suffer from accuracy loss when compared to larger models, at least for the time being. A simpler approach is to find ways of reducing the resources that are needed to train and serve existing models. In PyTorch, one way to do that is through model compression using quantization.
Quantization is a mathematical technique that has been used to create lossy digital music files and convert analog signals to digital ones. By executing mathematical calculations with reduced precision, quantization allows for significantly higher performance on many hardware platforms. So why use quantization to make AI more efficient? Results show that in certain cases, using this relatively simple technique can result in dramatic speedups (2-4 times) for model inference.
The parameters that make up a deep learning model are typically decimal numbers in floating point (FP) precision; each parameter requires either 16 bits or 32 bits of memory. When using quantization, numbers are often converted to INT4 or INT8, which occupy only 4 or 8 bits. This reduces how much memory models require. Additionally, chip manufacturers include special arithmetic that makes operations using integers faster than using decimals.
There are 3 methods of quantization that can be used for training models: dynamic, static and quantize-aware training (QAT). A brief overview of the benefits and weaknesses is described in the table below. To learn how to implement each of these in your AI workflows, read the Practical Quantization in PyTorch blog post.
Additional overhead in every forward pass
May need regular recalibration for distribution drift
Quantize-Aware Training (QAT)
High computational cost
Quantization isn’t the only way to make PyTorch-powered AI more efficient. Features are updated regularly, and below are a few other ways that PyTorch can improve AI workflows:
Inference mode: This mode can be used for writing PyTorch code if you’re only using the code for running inference. Inference mode changes some of the assumptions when working with tensors to speed up inference. By telling PyTorch that you won’t use tensors for certain applications later (in this case, autograd), it adjusts to make code run faster in these specific scenarios.
Low precision: Quantization works only at inference time, that is, after you have trained your model. For the training process itself, PyTorch uses AMP, or automatic mixed precision training, to find the best format based on which tensors are used (FP16, FP32 or BF16). Low-precision deep learning in PyTorch has several advantages. It can help lower the size of a model, reduce the memory that is required to train models and decrease the power that is needed to run models. To learn more, check out this tutorial for using AMP with CUDA-capable GPUs.
Channels last: When it comes to vision models, NHWC, otherwise known as channels-last, is a faster tensor memory format in PyTorch. Having data stored in the channels-last format accelerates operations in PyTorch. Formatting input tensors as channels-last reduces the overhead that is needed for conversion between different format types, resulting in faster inference.
Optimize for inference: This TorchScript prototype implements some generic optimizations that should speed up models in all environments, and it can also prepare models for inference with build-specific settings. Primary use cases include vision models on CPUs (and GPUs) at this point. Since this is a prototype, it’s possible that you may run into issues. Raise an issue that occurs on the PyTorch GitHub repository.
Novel methods for accelerating AI workflows are regularly explored on the PyTorch blog. It’s a great place to keep up with techniques like the recent BetterTransformer, which increases speedup and throughput in Transformer models by up to 2 times for common execution scenarios. If you’re interested in learning how to implement specific features in PyTorch, the recipes page allows you to search by categories like model optimization, distributed training and interpretability. This post is only a sampling of how tools like PyTorch are moving open source and AI forward.
To stay up to date with the latest in Meta Open Source for artificial intelligence and machine learning, visit our open source site, subscribe to our YouTube channel, or follow us on Facebook, Twitter and LinkedIn.