Re-Aligning Language to Visual Objects
with an Agentic Workflow

1 VCIP, CS, Nankai University    2 SenseTime    
*Equal Contribution     #Corresponding Author

ICLR 2025

🔥[Keypoint] Rather than becoming assistants to enhance human productivity, agents hold a deeper value paradigm that they can establish workflows that serve as a flywheel, sustaining high-value data assets across AI industries. Our paper is an application in multimodal domains to demonstrate this potential.

⭐️Specifically, our workflow is designed to enhance LOD model performance from a data-centric perspective by improving the alignment quality between language and visual objects.

Abstract

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective .

Example of Real-LOD

Visual Demo of Real-LOD

Visual Demo of Real-Model

BibTex


@inproceedings{chen2025realigning,
    title={Re-Aligning Language to Visual Objects with an Agentic Workflow},
    author={Yuming Chen and Jiangyan Feng and Haodong Zhang and Lijun GONG and Feng Zhu and Rui Zhao and Qibin Hou and Ming-Ming Cheng and Yibing Song},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=MPJ4SMnScw}
}
                

Contact

Feel free to contact us at chenyuming[AT]mail.nankai.edu.cn!

This website's source code is borrowed from Xin Jin.
visitor counter for website

Visitor Count