Overview
The performance gap between simple and complex data visualization remains a major hurdle for generative AI. A new benchmark, RealChart2Code, has demonstrated that even the most advanced proprietary models lose nearly half their accuracy when generating code for intricate, multi-part charts built from real-world datasets. While AI can effortlessly recreate basic charts from simple images, the task shifts dramatically when the visualization requires synthesizing complex layouts, multiple data sources, and nuanced code logic.
The RealChart2Code benchmark, developed by researchers at several Chinese universities, tests 14 leading models—five proprietary and nine open-weight—on over 2,800 test cases derived from massive, real-world Kaggle datasets. Unlike previous tests that relied on synthetic data or single-chart formats, this benchmark forces models to handle composite layouts, 50 different chart types, and raw data files totaling roughly 860 million rows.
The findings reveal a stark "complexity gap." Models that achieve near-perfect scores on simpler tasks plummet in performance when confronted with the messy reality of professional data science workflows. This suggests that while the current generation of LLMs excels at pattern recognition and basic code syntax, they fundamentally struggle with the deep structural and relational logic inherent in advanced data analysis.
The Performance Divide Between Proprietary and Open Models

The Performance Divide Between Proprietary and Open Models
The test results established a clear hierarchy, though the top proprietary models still showed significant weakness. Anthropic’s Claude 4.5 Opus posted the highest average score of 8.2 across eight visual accuracy criteria. Google’s Gemini 3 Pro Preview followed closely with 8.1, securing the top spot for basic chart replication with a score of 9.0. However, OpenAI's GPT-5.1 lagged considerably at 5.4.
The disparity was even more pronounced when comparing proprietary versus open-weight architectures. The best open-weight performers, such as Qwen3-VL-235B and Intern-VL-3.5-241B, scored only 3.6 and 3.4, respectively—less than half the performance of the leading proprietary models. The gap was so wide that a smaller model, DeepSeek-VL-7B, achieved a pass rate of just 9.7 percent on chart replication, indicating that the generated code failed to execute correctly in over 90 percent of cases.

Failure Modes: Syntax vs. Semantics
The error analysis uncovered two distinct failure patterns, providing crucial insights into the current limitations of the technology. Proprietary models, such as Claude 4.5, rarely produced simple syntax errors. Their primary weakness lay in data assignment: the generated visual structure might appear correct, but the individual data series were often mapped to the wrong axes, or the visual attributes failed to align with the specified data parameters.
Conversely, open-weight models exhibited failures at the code execution level. These models frequently hallucinate non-existent libraries or call invalid functions. Qwen3-VL-235B, for instance, was observed generating invalid API calls, such as a nonexistent Matplotlib style parameter, in approximately 20 percent of instances. Furthermore, even when the code executed, layout problems—like overlapping subplots or broken grid structures—were common.
The benchmark also tested a critical, real-world skill: iterative refinement. This task simulates a developer workflow where the model must fix broken code through a back-and-forth dialogue. While proprietary models showed an advantage here, the overall difficulty of maintaining state and correcting complex, multi-layered errors remains a significant challenge for all tested architectures.
The Implication of the "Complexity Gap"
The most telling finding is the dramatic drop-off observed across all models when moving from simple to complex tasks. Gemini 3 Pro Preview, for example, scored over 96 percent (normalized) on the simpler ChartMimic benchmark but dropped to around 50 percent on RealChart2Code. For open-weight models, the decline was even steeper: Qwen3-VL-235B scored roughly 85 percent on ChartMimic but fell below 25 percent on the comprehensive RealChart2Code test.
This "complexity gap" suggests that current AI models are highly effective at interpolation—filling in known patterns—but struggle with extrapolation and deep structural reasoning. Generating a complex chart is not merely a matter of translating pixels to code; it requires understanding the underlying data relationships, managing dependencies between multiple data streams, and maintaining logical consistency across an entire visualization ecosystem.


