Above is the live view of all agent task logs and reports. Select a chat file to explore details.
⚠️ Loading 7000+ tasks may take around 20+ seconds. Please wait a moment...
Figure 2: Distribution of DSAEval Benchmark. Covering diverse Data Types (Left), Domains (Center), and Task Types (Right).
| Benchmark | DataSets | Questions | Hetero. Data | Vision-modal Obs | Multi-step | Deep Learning |
|---|---|---|---|---|---|---|
| DS-1000 (Lai et al., 2023) | 1,000 | |||||
| Infiagent-DABench (Hu et al., 2024) | 52 | 257 | ||||
| DA-Code (Huang et al., 2024b) | 500 | 500 | ||||
| MLAgentBench (Huang et al., 2024a) | 13 | 13 | ||||
| DSEval (Zhang et al., 2024) | 299 | 825 | ||||
| DSCodeBench (Ouyang et al., 2025) | 1,000 | |||||
| DABstep (Egg et al., 2025) | - | 450 | ||||
| DSAEval (Ours) | 285 | 641 |
Figure 3: Overall Model Performance. Claude-Sonnet-4.5 leads the benchmark.
| Rank | Model | Total Score | Reasoning | Code | Result |
|---|---|---|---|---|---|
| 1 | Claude-sonnet-4.5 | 8.164 | 8.970 | 8.590 | 7.240 |
| 2 | GPT-5.2 | 7.713 | 8.270 | 8.400 | 6.780 |
| 3 | Mimo-v2-flash | 7.644 | 8.140 | 8.540 | 6.600 |
| 4 | Minimax-m2 | 7.642 | 8.100 | 8.440 | 6.700 |
| 5 | Gemini-3-pro | 7.309 | 7.960 | 8.310 | 6.070 |
| 6 | Grok-4.1-fast | 7.254 | 7.870 | 8.070 | 6.180 |
| 7 | GPT-5-nano | 7.069 | 7.700 | 7.850 | 6.010 |
| 8 | DeepSeek-v3.2 | 7.030 | 7.470 | 7.830 | 6.100 |
| 9 | GLM-4.6v | 6.874 | 7.500 | 7.800 | 5.710 |
| 10 | Qwen3-VL-30B-Thinking | 5.324 | 6.560 | 5.320 | 4.400 |
| 11 | Ministral-14b-2512 | 5.182 | 5.880 | 5.740 | 4.240 |
Figure 4: Performance breakdown by Domain (Left) and Task Type (Right).
Figure 5: Efficiency Analysis. (Left) Total Score vs. Tokens. (Right) Total Score vs. Price.