AI models guess instead of asking for help researchers find
AI Watch

AI models guess instead of asking for help researchers find

Multimodal language models are fundamentally incapable of recognizing when they lack necessary input.

Multimodal language models are fundamentally incapable of recognizing when they lack necessary input. Instead of flagging missing data or requesting clarification, a basic skill expected of human intelligence, these advanced systems often hallucinate plausible but incorrect answers. Researchers developed the ProactiveBench benchmark to systematically test whether current AI models can identify missing visual information and proactively request the required help. The findings reveal a deep architec

Subscribe to the channels

Key Points

  • The Gap Between Sight and Understanding
  • The Illusion of Proactivity
  • Training for True Contextual Awareness

Overview

Multimodal language models are fundamentally incapable of recognizing when they lack necessary input. Instead of flagging missing data or requesting clarification—a basic skill expected of human intelligence—these advanced systems often hallucinate plausible but incorrect answers. Researchers developed the ProactiveBench benchmark to systematically test whether current AI models can identify missing visual information and proactively request the required help. The findings reveal a deep architectural flaw: when presented with tasks that require human intervention, most state-of-the-art models default to guessing, a failure that severely undermines their utility in real-world applications.

The ProactiveBench benchmark is designed to test scenarios impossible to solve without external human input, forcing models to identify hidden objects, clean noisy images, or interpret rough sketches. Testing 22 leading multimodal models, including GPT-4.1 and LLaVA-OV, demonstrated a dramatic performance drop. While these models achieve high accuracy when information is clear, their performance plummets by over 60 percent when the input is compromised or incomplete.

The Gap Between Sight and Understanding
AI models guess instead of asking for help researchers find

The Gap Between Sight and Understanding

The most striking evidence of the models' deficiency comes from the ROD dataset, which tests object identification when items are physically blocked from view. In ideal conditions, models maintain an impressive accuracy rate of 98.3 percent. However, when objects are obscured, that accuracy collapses to a mere 8.2 percent. This disparity confirms that the models are excellent at pattern matching when the data is pristine, but they lack the contextual awareness to understand why they cannot complete a task.

This failure is not limited by model size or architecture. Researchers found that a smaller model, InternVL3-1B, outperformed a significantly larger counterpart, InternVL3-8B, in certain tests. Furthermore, the choice of the underlying language model proved critical; for instance, LLaVA-NeXT paired with Vicuna achieved a much higher success rate than the same setup using Mistral. This suggests that while overall model complexity is important, the specific integration and fine-tuning of the underlying components dictate true contextual capability.


The Illusion of Proactivity

The research further exposed that what appears to be "proactivity" is often merely a low bar for random guessing, not genuine understanding. The researchers stress-tested the models by replacing valid, helpful suggestions with nonsensical alternatives, such as suggesting "Rewind the video" during a sketching task. Models that previously appeared helpful picked these meaningless options with surprising enthusiasm.

This suggests the models are not reasoning about the need for information; they are merely selecting the most statistically probable next action based on their training data, regardless of whether that action makes logical sense in the given context. Even attempting to guide the models using explicit hints within the prompt or conversation history only marginally improves performance, nudging accuracy up to 25.8 percent—a number that still fails to beat chance on average.


Training for True Contextual Awareness

The findings, while sobering, did pinpoint a viable path forward. The core problem is one of training methodology, not just model scale. The breakthrough involves using reinforcement learning, specifically Group-Relative Policy Optimization (GRPO). By fine-tuning models like LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using GRPO on thousands of examples, researchers were able to train the models to understand the difference between a correct guess and a genuine knowledge gap.

The key mechanism here is the reward function. Instead of rewarding any proactive suggestion, the reward function is weighted to score correct predictions significantly higher than any suggested action. This forces the model to develop a genuine internal mechanism: it must only ask for help when it is truly stuck, thereby developing a more sophisticated and reliable understanding of its own limitations.