Above is the live view of all agent task logs and reports. Select a chat file to explore details.
⚠️ Loading 7000+ tasks may take around 20+ seconds. Please wait a moment...
Figure 2: Distribution of DSAEval Benchmark. Covering diverse Data Types (Left), Domains (Center), and Task Types (Right).
| Benchmark | DataSets | Questions | Hetero. Data | Vision-modal Obs | Multi-step | Deep Learning | Eval. Format |
|---|---|---|---|---|---|---|---|
| DS-1000 | 1,000 | Close (CUT) | |||||
| Infiagent-DABench | 52 | 257 | Close (EM) | ||||
| DA-Code | 500 | 500 | Close (MC) | ||||
| MLAgentBench | 13 | 13 | Close (MC) | ||||
| MLE-Bench | 75 | 75 | Close (MC) | ||||
| DSEval | 294 | 825 | Close (EM) | ||||
| DSCodeBench | 1,000 | Close (CUT) | |||||
| DSBench | 112 | 540 | Close (MCQ, MC) | ||||
| DABstep | - | 450 | Close (EM) | ||||
| DSAEval (Ours) | 285 | 641 | Open (Reason, Code, Report) |
CUT = Code Unit Tests, EM = Exact/Structured Matching Against Ground Truth, MC = Metric Comparison, MCQ = Multiple Choice Question.
Figure 3: Overall Model Performance. Claude-Sonnet-4.5 leads the benchmark.
| Rank | Model | Total Score | Reasoning | Code | Result |
|---|---|---|---|---|---|
| 1 | Claude-sonnet-4.5 | 8.164 | 8.970 | 8.590 | 7.240 |
| 2 | Mimo-v2-pro | 7.912 | 8.350 | 8.730 | 6.970 |
| 3 | GPT-5.2 | 7.713 | 8.270 | 8.400 | 6.780 |
| 4 | Minimax-m2.7 | 7.699 | 8.200 | 8.610 | 6.640 |
| 5 | Mimo-v2-flash | 7.644 | 8.140 | 8.540 | 6.600 |
| 6 | Minimax-m2 | 7.642 | 8.100 | 8.440 | 6.700 |
| 7 | Gemini-3-pro | 7.309 | 7.960 | 8.310 | 6.070 |
| 8 | Grok-4.1-fast | 7.254 | 7.870 | 8.070 | 6.180 |
| 9 | GPT-5-nano | 7.069 | 7.700 | 7.850 | 6.010 |
| 10 | DeepSeek-v3.2 | 7.030 | 7.470 | 7.830 | 6.100 |
| 11 | GLM-4.6v | 6.874 | 7.500 | 7.800 | 5.710 |
| 12 | Qwen3-VL-30B-A3B-Thinking | 5.324 | 6.560 | 5.320 | 4.400 |
| 13 | Ministral-14b-2512 | 5.182 | 5.880 | 5.740 | 4.240 |
Figure 4: Performance breakdown by Domain (Left) and Task Type (Right).
Figure 5: Efficiency & Cost-Effectiveness