DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Above is the live view of all agent task logs and reports. Select a chat file to explore details.

⚠️ Loading 7000+ tasks may take around 20+ seconds. Please wait a moment...

Abstract

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this issue, we introduce DSAEval, a benchmark consisting of 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured modalities such as images and text. DSAEval includes three key features: (1) Multi-modal Environment Perception, enabling agents to interpret observations from both textual and visual modalities; (2) Multi-Query Interactions, reflecting the iterative and cumulative workflow of real-world data science projects; and (3) Multi-Dimensional Evaluation, providing comprehensive assessment across reasoning processes, code generation, and final analytical results. We systematically evaluate 13 advanced agentic LLMs on DSAEval. Experimental results show that Claude-Sonnet-4.5 achieves the strongest overall performance, while MiMo-V2-Pro and GPT-5.2 demonstrate the highest efficiency in execution duration and reasoning steps, respectively. In addition, MiMo-V2-Flash emerges as the most cost-effective model. We further show that multimodal perception consistently enhances performance on vision-related tasks, with improvements ranging from 2.04% to 11.30%. Overall, although current data science agents perform strongly on structured data and routine analytical workflows, substantial challenges remain in unstructured domains. Finally, we discuss critical insights and outline future research directions for autonomous data science agents.

Benchmark Statistics

Dataset Distribution

Figure 2: Distribution of DSAEval Benchmark. Covering diverse Data Types (Left), Domains (Center), and Task Types (Right).

Comparison with Existing Benchmarks

Benchmark DataSets Questions Hetero. Data Vision-modal Obs Multi-step Deep Learning Eval. Format
DS-1000 1,000 Close (CUT)
Infiagent-DABench 52 257 Close (EM)
DA-Code 500 500 Close (MC)
MLAgentBench 13 13 Close (MC)
MLE-Bench 75 75 Close (MC)
DSEval 294 825 Close (EM)
DSCodeBench 1,000 Close (CUT)
DSBench 112 540 Close (MCQ, MC)
DABstep - 450 Close (EM)
DSAEval (Ours) 285 641 Open (Reason, Code, Report)

CUT = Code Unit Tests, EM = Exact/Structured Matching Against Ground Truth, MC = Metric Comparison, MCQ = Multiple Choice Question.

🏆 Leaderboard & Overall Performance

Overall Performance

Figure 3: Overall Model Performance. Claude-Sonnet-4.5 leads the benchmark.

Rank Model Total Score Reasoning Code Result
1 Claude-sonnet-4.5 8.164 8.970 8.590 7.240
2 Mimo-v2-pro 7.912 8.350 8.730 6.970
3 GPT-5.2 7.713 8.270 8.400 6.780
4 Minimax-m2.7 7.699 8.200 8.610 6.640
5 Mimo-v2-flash 7.644 8.140 8.540 6.600
6 Minimax-m2 7.642 8.100 8.440 6.700
7 Gemini-3-pro 7.309 7.960 8.310 6.070
8 Grok-4.1-fast 7.254 7.870 8.070 6.180
9 GPT-5-nano 7.069 7.700 7.850 6.010
10 DeepSeek-v3.2 7.030 7.470 7.830 6.100
11 GLM-4.6v 6.874 7.500 7.800 5.710
12 Qwen3-VL-30B-A3B-Thinking 5.324 6.560 5.320 4.400
13 Ministral-14b-2512 5.182 5.880 5.740 4.240

📊 In-Depth Analysis

Fine-Grained Capabilities

Radar Analysis

Figure 4: Performance breakdown by Domain (Left) and Task Type (Right).


Efficiency & Cost-Effectiveness

Efficiency Analysis

Figure 5: Efficiency & Cost-Effectiveness

Flag Counter