Categories: Technology / Artificial Intelligence

New Method Lets AI Locate Personalised Objects Across Scenes

New Method Lets AI Locate Personalised Objects Across Scenes

New Training Approach Improves Personalised Object Localization in AI

A collaboration between MIT and the MIT-IBM Watson AI Lab has yielded a practical breakthrough in generative vision-language models (VLMs). The researchers developed a training regime that enables these models to locate personalised objects across different scenes—a capability that has traditionally been challenging. The work aims to bridge the gap between recognizing generic object categories and identifying a specific item, such as a particular cat or chair, amid varying lighting, poses, and backgrounds.

How the Method Works

Vision-language models are typically strong at classifying objects into broad categories like dogs or chairs. However, they struggle when asked to pinpoint your exact dog or your unique chair within a bustling scene. To address this, the MIT-led team reframed the problem as a fine-tuning task anchored in video-tracking data. In video sequences, the same object often appears across multiple frames, providing a natural way to teach the model what uniquely identifies that object beyond generic category cues.

Crucially, the researchers introduced pseudo-names—such as “Charlie”—instead of real object names. This approach prevents the model from relying on memorised label associations or word-object shortcuts. By using non-real identifiers, the model learns to reason about context, scene layout, appearance cues, occlusions, and the object’s relative position within a scene. The net effect is a model that can generalise better to new scenes without becoming biased toward memorised labels.

What the Results Show

The training approach produced a noticeable improvement in personalised localization. On average, models trained with this method showed a 12% uplift in correctly localising personalised objects across new scenes. In certain configurations—particularly those employing pseudo-names—the gains rose to 21%. Importantly, these enhancements did not come at the expense of the model’s overall object recognition performance. The balance between specialised localization and general recognition was preserved, demonstrating the method’s practicality for real-world applications.

Why This Matters for Real-World Applications

The ability to localise personalised objects has wide-reaching implications across several sectors. In smart homes, cameras could reliably recognise a pet or a cherished item, improving security and convenience. For assistive technology, systems could help visually impaired users find specific objects in cluttered environments, enhancing independence and safety. In robotics, precise localisation of a chosen object could streamline manipulation tasks and improve interaction with human operators. Beyond households, surveillance systems could better track specific animals for ecological monitoring or study, while research deployments in diverse environments could benefit from improved generalisation with limited example data.

Why The Pseudo-Name Strategy Helps

The use of pseudo-names stands out as a simple yet effective strategy. It steers the model away from memorising exact label-object mappings and encourages a more nuanced inference based on contextual clues. This aligns with how humans identify items—by combining appearance, position, motion, and scene context rather than relying solely on a label. The result is robust localisation across scenes with minimal retraining when new objects are introduced.

Future Directions

Researchers may explore extending this approach to a broader range of personalised objects and more dynamic environments. Additional work could investigate how this method interacts with multimodal inputs, such as audio cues, to further reinforce object localisation. As AI systems become more capable of following personalised identifiers, the line between generic recognition and personalised understanding will continue to blur, opening doors to smarter, more adaptive computer vision in everyday devices.