Picking out separate objects in a visual scene seems intuitive to us, but machines struggle with this task. Now a new AI model from Meta has developed a broad idea of what an object is, allowing it to separate out objects even if it’s never seen them before.
It might seem like a fairly prosaic computer vision task, but being able to parse an image and work out where one object ends and another begins is a pretty fundamental skill, without which a host of more complicated tasks would be unsolvable.
“Object segmentation” is nothing new; AI researchers have worked on it for years. But typically, building these models has been a time-consuming process requiring lots of human annotation of images and considerable computing resources. And typically the resulting models were highly specialized to particular use cases.
Now though, researchers at Meta have unveiled the Segment Anything Model (SAM), which is able to cut out any object in any scene, regardless of whether it’s seen anything like it before. The model can also do this in response to a variety of different prompts, from text description to mouse clicks or even eye-tracking data.
“SAM has learned a general notion of what objects are, and it can generate masks for any object in any image or any video,” the researchers wrote in a blog post. “We believe the possibilities are broad, and we are excited by the many potential use cases we haven’t even imagined yet.”
Key to the development of the model was a massive new dataset of 1.1 billion segmentation masks, which refers to regions of an image that have been isolated and annotated to denote that they contain a particular object. It was created through a combination of manual human annotation of images and automated processes, and is by far the largest collection of this type assembled to date.
By training on such a massive dataset, Meta’s researchers say it has developed a general concept of what an object is, which allows it to segment things it hasn’t even seen before. This ability to generalize led the researchers to dub SAM a “foundation model,” a controversial term used to describe other massive pre-trained models such as OpenAI’s GPT series, whose capabilities are supposedly so general they can be used as the foundations for a host of applications.
Image segmentation is definitely a key ingredient in a wide range of computer vision tasks. If you can’t separate out the different components of a scene, it’s hard to do anything more complicated with it. In their blog, the researchers say it could prove invaluable in video and image editing, or help with the analysis of scientific imagery.
Perhaps more pertinently for the company’s metaverse ambitions, they provide a demo of how it could be used in conjunction with a virtual reality headset to select specific objects based on the user’s gaze. They also say it could potentially be paired with a large language model to create a multi-modal system able to understand both the visual and textual content of a web page.
The ability to deal with a wide range of prompts makes the system particularly flexible. In a web page demoing the new model, the company shows that after analyzing an image it can be prompted to separate out specific objects by simply clicking on them with a mouse cursor, typing in what it is you want to segment, or just breaking up the entire image into separate objects.
And most importantly, the company is open-sourcing both the model and the dataset for research purposes so that others can build on their work. This is the same approach the company took with its LLaMA large-language model, which led to it rapidly being leaked online and spurring a wave of experimentation by hobbyists and hackers.
Whether the same will happen with SAM remains to be seen, but either way it’s a gift to the AI research community that could accelerate progress on a host of important computer vision problems.
Image Credit: Meta AI