Metas SAM 3: The Eyes for Language Models
SAM 3 can segment objects via prompt. The AI model is fun as an editor, but also helpful for data labeling and essential for robotics.
Dogs segmented via a prompt.
(Image: Meta)
SAM means Segment Anything Model. With this AI-model it is possible to segment objects and eben audio from pictures and videos. The newsest version of SAM – SAM 3 comes as three models: SAM 3, SAM 3D and SAM 3 body. We talked about use cases and how SAM works with Nikhila Ravi, Research Engineer at Meta.
What exactly is SAM? Is it a new model? Is it gen AI? It’s not a Large Language Model itself.
So, SAM really is a series of models that we've had over the past four years. We've been working on the segment Anything project for the past four years. We released SAM1 in 2023. With SAM1 you can point or click on an object and it would predict the pixel-perfect boundary of that object that you click on. SAM1 was only for images. Then in 2024, we released SAM2, which was doing the same thing, but in videos. And now you can click on an object and get this pixel-perfect boundary of that object on every single frame of the video. SAM3 is a completely new model where now you can specify what you want to outline with text or visual examples.
And so by that, I mean, instead of having to click on every single person, for example, now you can type text “person” and the model will automatically find every single person and draw the boundary around them. So we've created an entirely new interface to be able to do segmentation. And maybe the crucial distinction here compared to generative models is that SAM is predicting a pixel-wise probability if that pixel belongs to the target object. It's not generating any new pixels, it's kind of like taking a highlighter and highlighting something that's already in the image.
You can just say ‚person‘ and write a prompt and SAM will find the person in the picture. Is there something like a Large Language Model in SAM that is used for that process?
Yeah, that's a great question. One of the things we really wanted to do was enable this kind of open vocabulary text interface. But what we decided to do was actually limited to short text phrases. And so, 'person’ is a very simple example. You can do much more complex things like ‚yellow school bus’ or ‚red striped umbrella’. You can only have these two or three words, sort of short phrases. And the reason we do that is because we don't want to use a Large Language Model inside the model, we actually have a small text encoder. Because we use SAM3 for many real world use cases, including for some products at Meta, and it needs to run fast. We could have chosen to use an LLM, but we decided to kind of constrain it to these short phrases so that we can run it fast for product use cases as well.
I tried the Playground and it was so much fun. But what else is SAM already used for?
Definitely precise image and video editing is very much a use case that we highlighted in the playground and it's what we're using internally for Instagram Edits and for other products. And internally we also use it for visual data labeling. SAM1 and SAM2 sped up data labeling a lot. Previously, you had to draw the boundary around the object manually. With SAM1, you could just click on the object and you would get the boundary. But if there were, say, five dogs in the image, you'd have to manually click on each of the five dogs. SAM3 is like another step in speeding up that process, because now you can just type ‚dog’ and you'd get masks for all the dogs. And one of the things we did is we actually collaborated with a data labeling company called Roboflow as part of the SAM3 launch. They are a visual data labeling company that's been integrating SAM1, SAM2, and now SAM3.
They're very excited about how much it's going to speed up data labeling workflows for all their customers and these are like diverse real world use cases. Like in medicine – what many scientists have to do is count cells, more specifically you have to count how many cells are on a microscope slide manually. With SAM3 now you can just use a text prompt like ‚cell’ or you can draw an example box. Another way you can specify something is to draw a box around one example, and then SAM3 will find all the other examples.
(Image: Meta)
There are industrial use cases, too. Robotics is another big application that's very top of mind right now. Especially the video capabilities and the real-time aspect are very interesting because as you're navigating an environment you need to find all the different objects. Say, if you're having a robot that can pick up objects, you need to know where the objects are. Data labeling could be for anything. For example, like an industrial manufacturing line and they want to manufacture some new component and they want to count how many components are being manufactured.
There was a time where humans had to click on everything and describe everything, and now it's way easier and faster, right?
Yeah. It's like an automation for that process. Previously you had to do everything manually. Now we can have this model in the loop where you can prompt the model, it gives you a response, you maybe make a few corrections, but then 80% of the prediction is approximately correct. That sort of speeds up the whole process.
Now we have SAM used in the Playground, we have SAM used for data labeling. Developers can use SAM3, too. What’s next? In which direction are you working?
The research part is really a fundamental innovation in creating this new interface. We like to think about it in terms of “What is a fundamental innovation that then unlocks many new downstream use cases?”. And so the editing use cases are fun. But all these real world use cases really show that the model has generalization capability.
Videos by heise
The developer tools, so to speak, like the code and the models we like to release because we benefit from the community building on top of that as well. Some of the things that the community has built on top of SAM2 is what we actually used for SAM3. They are new benchmarks and some model improvements that the open source community made. And there were some new data sets that the community built.
That's why you're keeping the open source strategy, right?
Yes, for SAM it's been really impactful to have that component.
What is the next bigger, maybe fundamental problem? Is there a bigger picture you are working towards?
I think one of the things we showed in the SAM3 paper is how SAM3 can be a visual primitive with MLLM, multi-modal large language models. For example, SAM3 is really good at localizing, so predicting the pixel-wise mask, MLLMs are really good at reasoning and planning and have all this additional world knowledge. And we've shown how you can combine an MLLM with SAM3 to do more complex visual tasks. So SAM3 is kind of like the eyes, the MLLM is like the brain, and they're working together. And so we showed this experiment in the SAM 3 paper. I think that is definitely a very interesting direction going forward.
And do you think this is needed for an AGI or an AMI or Superintelligence or something like that?
Definitely for robotics. Robotics is a nice example, because it encapsulates a lot of different use cases. But to be able to have embodied agents that can navigate in the world and then be able to perform tasks that require localization and tracking of objects. I think there's definitely an important capability needed there. And then just broadly, how do you connect vision and language together more closely? Because you know, the amount of visual data we have in the world is significantly more than the amount of text data. Being able to connect and to understand visual content as deeply as we do text, is key. We need the eyes, and the eyes at the moment are still very primitive compared to the human eyes.
In the near future, what can we expect?
There's some sort of shorter term things that we want to do, like even trying to make the model a bit faster. Right now it's very fast on images. On videos, it's real time for about five objects, but then the inference time scales with the number of objects. So we have some ideas about how to make inference faster. There's a few low-hanging-fruit-things that we want to do.
I wondered if cutouts, maybe for products in online-shops, are a use case? Or is that way too simple for SAM? Because there is a SAM 3D, too.
SAM 3D is a separate model. So we actually released three different models. SAM 3, the SAM 3D Objects, and SAM 3D Body. For the use case that you mentioned, SAM 3 plus the SAM 3D Objects could be a nice application. We actually did this with Facebook Marketplace. We built this capability where anyone that is a seller of home decor products on Facebook Marketplace has the option to turn their listing into a 3D object that the buyer can then view in augmented reality. This used SAM3 because we needed to mask the object and then use SAM3D to actually lift that into 3D. This was a project that I was particularly very excited about being involved in because this is something that I don't think we could have imagined being able to do like five years ago and and now we can do it.
(Image: Meta)
And what are the limitations? I think I read about the hands as the problem for SAM 3D body. It’s always the hands.
There's different limitations for different models. I think the SAM3 model’s limitation clearly is the short text phrases. So that's something that we hope to solve. There's also very niche domains that require specialized knowledge. For example, how to interpret x-rays. We didn't incorporate that kind of knowledge into the model because we don't have data for that. So, for those kinds of use cases, people will have to collect data and fine-tune the model. But we do provide instructions on how to do fine-tuning in the code release for SAM 3. For SAM 3D, there's the hand reconstruction. I think there's also efforts to improve the speed of the SAM 3D models so that they can also run faster.
(emw)