This paper focuses on , which involves identifying not just objects in an image, but the actions (verbs) and the roles played by various entities (e.g., agent, tool, location).
: The researchers address the limitations of existing situational recognition models by leveraging the CLIP (Contrastive Language-Image Pre-training) framework to improve how machines describe complex human-object interactions. 1ffc83b3aa8c274a2477daf6aff5dad7_origi.jpg
: They introduce methods to adapt CLIP's powerful visual-linguistic representations specifically for the task of generating structured descriptions that capture the "who, what, and where" of a scene. This paper focuses on , which involves identifying
: The paper demonstrates that by effectively fine-tuning or prompting CLIP, models can achieve significantly higher accuracy in recognizing verbs and their associated semantic roles compared to previous state-of-the-art systems. This paper focuses on