Personalized Object Localization in Vision-Language Models

Overview: Beyond general recognition

Vision-language models (VLMs) blend visual understanding with language processing, enabling them to recognize broad categories like “dog” or “car.” But users increasingly want these systems to locate a specific, personalized object—think your French bulldog Bowser or a child’s backpack—across different moments in time. A team from MIT and the MIT-IBM Watson AI Lab has introduced a training technique that teaches VLMs to localize personalized objects more reliably, even when the object must be identified from context rather than memorized labels.

The challenge: why personalization is hard for VLMs

Previous attempts to fine-tune VLMs often rely on datasets drawn from general, randomly sampled images. Such data lack coherence: a single object might appear with varying backgrounds across different images, making it hard for the model to track the same item through time. Even though large language models (LLMs) show strong in-context learning, VLMs don’t automatically inherit this ability when a visual component is paired with language. The result is a gap: models can name categories well but struggle to locate the same personalized object in new scenes.

The approach: context-driven localization using video-tracking data

The researchers devised a dataset strategy that leverages video-tracking data. They curate sequences showing the same object across multiple frames, then convert these sequences into multi-image inputs paired with contextual questions about the object’s location. The key idea is to force the model to rely on environmental cues—lighting, shape, relative position, texture—rather than simply recalling a label associated with the object from pretraining.

To prevent the model from “cheating” by exploiting pretrained associations (for example, linking the word “tiger” to a tiger in any frame), the team replaced real object names with pseudo-names like “Charlie.” This prevents the model from leaning on prior class knowledge and nudges it to infer location from context and visual cues present in the current scene.

Results: notable gains in personalized localization

Finetuning VLMs with the new dataset improved personalized localization accuracy by about 12 percent on average, with gains reaching 21 percent when pseudo-names were used. The improvements were especially pronounced as model size increased, suggesting the method scales well with larger architectures while preserving other general capabilities of the model.

Importantly, the researchers reported that their technique preserves the model’s broad language-vision abilities while boosting its ability to locate a specific object across frames. This balance is crucial for real-world use, where a system must both recognize a concept and pinpoint a particular instance within a dynamic scene.

Potential applications and future directions

Contextual, on-the-fly localization could benefit a wide range of tasks. In daily life, AI assistants could help visually impaired users find a specific item in a room or track a child’s backpack through a crowd. In robotics or augmented reality, systems could monitor and locate a designated object across time without requiring repeated task-specific retraining. For ecological monitoring, researchers could track a particular animal across video feeds, reducing manual labeling effort.

The MIT team also plans to probe why in-context learning from base LLMs doesn’t automatically transfer to VLMs and to explore further mechanisms that boost localization without additional data collection. The work reframes few-shot localization as an instruction-tuning problem, underlining a data-centric path to more capable, context-aware AI systems.

Commentary from the researchers

Lead author Jehanzeb Mirza emphasizes that the ultimate goal is for models to learn from context as humans do: “If a model can infer how to perform the task from context after seeing a few examples, we could avoid retraining for every new task.” The study’s collaborators stress the practical significance of fast, instance-specific grounding for real-world workflows, from robotics to assistive technologies.

Overview: Beyond general recognition

The challenge: why personalization is hard for VLMs

The approach: context-driven localization using video-tracking data

Results: notable gains in personalized localization

Potential applications and future directions

Commentary from the researchers

Related articles

You might find interesting

How MIT’s New Method Teaches AI to Find Personalized Objects in Video Context

How MIT’s New Method Teaches AI to Find Personalized Objects in Video Context

Cough Detection Gets Smarter with Wearable Multimodal AI

How MIT’s New Method Teaches AI to Find Personalized Objects in Video Context

How MIT’s New Method Teaches AI to Find Personalized Objects in Video Context