DSAEval: Evaluate Data Science Agent By Large Scale Real World Data Science Problems

Above is the live view of all agent task logs and reports. Select a chat file to explore details.

⚠️ Loading 7000+ tasks may take around 20+ seconds. Please wait a moment...

Abstract

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this issue, we introduce DSAEval, a benchmark consisting of 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured modalities such as images and text. DSAEval includes three key features: (1) Multi-modal Environment Perception, enabling agents to interpret observations from both textual and visual modalities; (2) Multi-Query Interactions, reflecting the iterative and cumulative workflow of real-world data science projects; and (3) Multi-Dimensional Evaluation, providing comprehensive assessment across reasoning processes, code generation, and final analytical results. We systematically evaluate 13 advanced agentic LLMs on DSAEval. Experimental results show that Claude-Sonnet-4.5 achieves the strongest overall performance, while MiMo-V2-Pro and GPT-5.2 demonstrate the highest efficiency in execution duration and reasoning steps, respectively. In addition, MiMo-V2-Flash emerges as the most cost-effective model. We further show that multimodal perception consistently enhances performance on vision-related tasks, with improvements ranging from 2.04% to 11.30%. Overall, although current data science agents perform strongly on structured data and routine analytical workflows, substantial challenges remain in unstructured domains. Finally, we discuss critical insights and outline future research directions for autonomous data science agents.

Benchmark Statistics

Figure 2: Distribution of DSAEval Benchmark. Covering diverse Data Types (Left), Domains (Center), and Task Types (Right).

Comparison with Existing Benchmarks

Benchmark	DataSets	Questions	Eval. Format
DS-1000		1,000	Close (CUT)
Infiagent-DABench	52	257	Close (EM)
DA-Code	500	500	Close (MC)
MLAgentBench	13	13	Close (MC)
MLE-Bench	75	75	Close (MC)
DSEval	294	825	Close (EM)
DSCodeBench		1,000	Close (CUT)
DSBench	112	540	Close (MCQ, MC)
DABstep	-	450	Close (EM)
DSAEval (Ours)	285	641	Open (Reason, Code, Report)

CUT = Code Unit Tests, EM = Exact/Structured Matching Against Ground Truth, MC = Metric Comparison, MCQ = Multiple Choice Question.

🏆 Leaderboard & Overall Performance

Figure 3: Overall Model Performance. Claude-Sonnet-4.5 leads the benchmark.

Rank	Model	Total Score	Reasoning	Code	Result
1	Claude-sonnet-4.5	8.164	8.970	8.590	7.240
2	Mimo-v2-pro	7.912	8.350	8.730	6.970
3	GPT-5.2	7.713	8.270	8.400	6.780
4	Minimax-m2.7	7.699	8.200	8.610	6.640
5	Mimo-v2-flash	7.644	8.140	8.540	6.600
6	Minimax-m2	7.642	8.100	8.440	6.700
7	Gemini-3-pro	7.309	7.960	8.310	6.070
8	Grok-4.1-fast	7.254	7.870	8.070	6.180
9	GPT-5-nano	7.069	7.700	7.850	6.010
10	DeepSeek-v3.2	7.030	7.470	7.830	6.100
11	GLM-4.6v	6.874	7.500	7.800	5.710
12	Qwen3-VL-30B-A3B-Thinking	5.324	6.560	5.320	4.400
13	Ministral-14b-2512	5.182	5.880	5.740	4.240

📊 In-Depth Analysis

Fine-Grained Capabilities

Figure 4: Performance breakdown by Domain (Left) and Task Type (Right).

Efficiency & Cost-Effectiveness

Figure 5: Efficiency & Cost-Effectiveness