Segment Anything for Videos
Track-Anything applies SAM, XMem, and E2FGVI to videos, providing video inpainting capabilities, as well as video and multi-object tracking with segmentation masks.
Created on April 25|Last edited on April 26
Comment
Track-Anything applies SAM, XMem, and E2FGVI to videos instead of images. It's capable of video inpainting and video and multi-object tracking with segmentation masks. Their project can be ran via the command line but also via HuggingFace Spaces.
Though at the time of this writing the HuggingFace demo seems to be out of memory, they do have demonstrations and examples on their README.

The user interface lets you track certain objects based on what object you click on and they even let you correct interactively!
SAM for videos would be even more useful if it also eventually supported other types of detections like keypoint estimation/pose. Though training a Pose Anything Model (PAM) would most likely be a bit more difficult as keypoints vary from object to object.
And what if we could generalize this model even further? What if, for any image or video, a model could generate pose coordinates, instance/semantic/panoptic segmentation masks, a bounding box, and an identity for each instance within that image or video frame? And what if it was combined with their Caption-Anything model? For any image or video, objects could be identified, segmented, bounded, classified, and, in addition to this, the object could be described!
This research direction seems reminiscent of the general purpose do-it-all LLM and how, instead of building a dozen different task-specific models, one LLM might suffice in tackling all tasks just as well if not more effectively.
The greater takeaway here is, what if we had one super-model that generalized to a wide variety of computer vision tasks like LLMs do in the NLP space?
References
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.