AI is creeping into everything from smartphones to cars, but tailoring neural networks to each new bit of hardware is expensive and contributes to AI’s rapidly-growing carbon footprint. So MIT researchers found a way to train a single network that can run on many different kinds of processors.
The potential environmental impact of AI was brought into stark relief last year, when researchers at the University of Massachusetts, Amherst published a paper showing that the energy required to train a single neural network could lead to CO2 emissions nearly five times the lifetime emissions of the average American car.
That study focused on the leading natural language processing models, which are vast compared to the average neural network. But training even smaller networks can have a significant environmental impact, and if you want to deploy them on a range of different devices the networks have to be tailored for each one.
While this process can be done manually, it increasingly relies on energy-intensive “neural architecture search” (NAS) algorithms. These techniques automate the process of trying out different network architectures to find ones perfectly suited to the device’s hardware constraints, like memory capacity, processor size, or battery life.
With the growth of the Internet of Things and efforts to push AI into edge devices like smartphones and smart speakers, the cost of doing this is ballooning. So the MIT team decided to devise a different approach that trains a single “once-for-all” (OFA) neural network that contains many smaller sub-networks suited to different kinds of hardware.
Not only could this significantly reduce the number of models that need to be trained, in a paper presented at the International Conference on Learning Representations last week, the researchers showed the OFA net can actually outperform architectures produced by several state-of-the-art NAS approaches.
Obviously, simultaneously optimizing so many architectures is not simple, because every tweak made to one sub-network will have knock-on effects on the others. And the researchers found that each OFA net can be made up of more than 10 quintillion different architectures that all have to fulfill the same task.
Their solution was a new “progressive shrinking” algorithm that starts by optimizing the biggest possible network for the task at hand. It then fine-tunes that network so that it features a slightly smaller sub-network that can also solve the task without impacting the performance of the larger one. This process is repeated over and over to produce many networks of varying sizes, all nested within each other like Russian dolls.
When it comes to deploying the AI on a particular device, a simple search algorithm trawls through all of these sub-networks to find one suitable for that processor. In tests on a Samsung Note8, Google Pixel 1 and 2, three NVIDIA GPUs and an Intel Xeon CPU that an OFA net trained to classify images matched or beat networks tailored individually by leading NAS approaches.
The approach is still very computationally intensive. Training the OFA net took 1,200 GPU hours compared to half that for most the NAS approaches the researchers compared against in their paper. As they point out, though, as soon as you’re training networks for more than a handful of devices, the stats start to look a lot more favorable, because their training cost stays the same as all the others rise linearly.
Nonetheless, this means the approach will only prove useful if models are being deployed on many different kinds of hardware. At present that’s not so common, as most AI is still being implemented in big centralized servers.
But things are changing as we try and imbue smarts in ever more products and companies respond to privacy-conscious customers who want to keep their data on their device rather than sending it to the cloud. When that trend catches on, finding a way to cut both the financial and environmental costs could be crucial.