If you want to see what’s next in AI, just follow the data. ChatGPT and DALL-E trained on troves of internet data. Generative AI is making inroads in biotechnology and robotics thanks to existing or newly assembled datasets. One way to glance ahead, then, is to ask: What colossal datasets are still ripe for the picking?
Recently, a new clue emerged.
In a blog post, gaming company Niantic said it’s training a new AI on millions of real-world images collected by Pokémon Go players and in its Scaniverse app. Inspired by the large language models powering chatbots, they call their algorithm a “large geospatial model” and hope it’ll be as fluent in the physical world as ChatGPT is in the world of language.
Follow the Data
This moment in AI is defined by algorithms that generate language, images, and increasingly, video. With OpenAI’s DALL-E and ChatGPT, anyone can use everyday language to get a computer to whip up photorealistic images or explain quantum physics. Now, the company’s Sora algorithm is applying a similar approach to video generation. Others are competing with OpenAI, including Google, Meta, and Anthropic.
The crucial insight that gave rise to these models: The rapid digitization of recent decades is useful for more than entertaining and informing us humans—it’s food for AI too. Few would have viewed the internet in this way at its advent, but in hindsight, humanity has been busy assembling an enormous educational dataset of language, images, code, and video. For better or worse—there are several copyright infringement lawsuits in the works—AI companies scraped all that data to train powerful AI models.
Now that they know the basic recipe works well, companies and researchers are looking for more ingredients.
In biotech, labs are training AI on collections of molecular structures built over decades and using it to model and generate proteins, DNA, RNA, and other biomolecules to speed up research and drug discovery. Others are testing large AI models in self-driving cars and warehouse and humanoid robots—both as a better way to tell robots what to do, but also to teach them how to navigate and move through the world.
Of course, for robots, fluency in the physical world is crucial. Just as language is endlessly complex, so too are the situations a robot might encounter. Robot brains coded by hand can never account for all the variation. That’s why researchers are now building large datasets with robots in mind. But they’re nowhere near the scale of the internet, where billions of humans have been working in parallel for a very long time.
Might there be an internet for the physical world? Niantic thinks so. It’s called Pokémon Go. But the hit game is only one example. Tech companies have been creating digital maps of the world for years. Now, it seems likely those maps will find their way into AI.
Pokémon Trainers
Released in 2016, Pokémon Go was an augmented reality sensation.
In the game, players track down digital characters—or Pokémon—that have been placed all over the world. Using their phones as a kind of portal, players see characters superimposed on a physical location—say, sitting on a park bench or loitering by a movie theater. A newer offering, Pokémon Playground, allows users to embed characters at locations for other players. All this is made possible by the company’s detailed digital maps.
Niantic’s Visual Positioning System (VPS) can determine a phone’s position down to the centimeter from a single image of a location. In part, VPS assembles 3D maps of locations classically, but the system also relies on a network of machine learning algorithms—one or more per location—trained on years of player images and scans taken at various angles, times of day, and seasons and stamped with a position in the world.
“As part of Niantic’s Visual Positioning System (VPS), we have trained more than 50 million neural networks, with more than 150 trillion parameters, enabling operation in over a million locations,” the company wrote in its recent blog post.
Now, Niantic wants to go further.
Instead of millions of individual neural networks, they want to use Pokémon Go and Scaniverse data to train a single foundation model. Whereas individual models are constrained by the images they’ve been fed, the new model would generalize across all of them. Confronted with the front of a church, for example, it would draw on all the churches and angles it’s seen—front, side, rear—to visualize parts of the church it hasn’t been shown.
This is a bit like what we humans do as we navigate the world. We might not be able to see around a corner, but we can guess what’s there—it might be a hallway, the side of a building, or a room—and plan for it, based on our point of view and experience.
Niantic writes that a large geospatial model would allow it to improve augmented reality experiences. But it also believes such a model might power other applications, including in robotics and autonomous systems.
Getting Physical
Niantic believes it’s in a unique position because it has an engaged community contributing a million new scans a week. In addition, those scans are from the view of pedestrians, as opposed to the street, like in Google Maps or for self-driving cars. They’re not wrong.
If we take the internet as an example, then the most powerful new datasets may be collected by millions, or even billions, of humans working in concert.
At the same time, Pokémon Go isn’t comprehensive. Though locations span continents, they’re sparse in any given place and whole regions are completely dark. Further, other companies, perhaps most notably, Google, have long been mapping the globe. But unlike the internet, these datasets are proprietary and splintered.
Whether that matters—that is, whether an internet-sized dataset is needed to make a generalized AI that’s as fluent in the physical world as LLMs are in the verbal—isn’t clear.
But it’s possible a more complete dataset of the physical world arises from something like Pokémon Go, only supersized. This has already begun with smartphones, which have sensors to take images, videos, and 3D scans. In addition to AR apps, users are increasingly being incentivized to use these sensors with AI—like, taking a picture of a fridge and asking a chatbot what to cook for dinner. New devices, like AR glasses could expand this kind of usage, yielding a data bonanza for the physical world.
Of course, collecting data online is already controversial, and privacy is a big issue. Extending those problems to the real world is less than ideal.
After 404 Media published an article on the topic, Niantic added a note, “This scanning feature is completely optional—people have to visit a specific publicly-accessible location and click to scan. This allows Niantic to deliver new types of AR experiences for people to enjoy. Merely walking around playing our games does not train an AI model.” Other companies, however, may not be as transparent about data collection and use.
It’s also not certain new algorithms inspired by large language models will be straightforward. MIT, for example, recently built a new architecture aimed specifically at robotics. “In the language domain, the data are all just sentences,” Lirui Wang, the lead author of a paper describing the work, told TechCrunch. “In robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, we need a different architecture.”
Regardless, researchers and companies will likely continue exploring areas where LLM-like AI may be applicable. And perhaps as each new addition matures, it will be a bit like adding a brain region—stitch them together and you get machines that think, speak, write, and move through the world as effortlessly as we do.
Image: Kamil Switalski on Unsplash