An Open-world Tool-using Environment for Scalable Agent Testing
Built on 5,571 tools across 204 real-world apps, ToolGym enables realistic evaluation of LLM agents with long-horizon workflows, wild constraints, and robustness testing.
ToolGym addresses critical gaps in existing benchmarks for tool-using agents
Automatically synthesizes long-horizon, multi-tool workflows with wild realistic constraints for scalable task generation.
Injects realistic failures (timeouts, rate limits, corrupted responses) to stress-test agent robustness under non-ideal conditions.
Separates deliberate reasoning from step-wise execution, enabling better analysis of planning vs. execution capabilities.
Semantic search over 5,571 tools using BGE-M3 embeddings, enabling agents to discover tools from a massive registry.
Multi-model evaluation with majority voting across GPT-4o, GPT-5.1, and DeepSeek-V3.2 for stable, human-aligned scores.
High-quality trajectory collection for fine-tuning. 1,170 samples outperform baselines trained on 119k samples.
Evaluation results on ToolGym benchmark with pass@3
| Rank | Model | Overall | Answer Quality | Tool Use Robustness | Constraint Following | Long-horizon | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Comp. | Grnd. | Succ.% | Recov.% | Flex.% | Order% | Info% | Fmt% | Trade% | #Calls | #Turns | Prog. | Decomp. | |||
| 🥇 1 | gemini-3-pro-preview Proprietary | 5.87 | 4.75 | 2.58 | 88.8 | 89.0 | 68.8 | 53.8 | 97.8 | 53.9 | 13.3 | 47.86 | 3.20 | 7.60 | 8.66 |
| 🥈 2 | claude-opus-4.5 Proprietary | 5.42 | 4.70 | 2.93 | 92.7 | 83.7 | 60.8 | 65.4 | 73.3 | 51.0 | 33.7 | 45.16 | 4.01 | 6.41 | 7.72 |
| 🥉 3 | deepseek-v3.2 Open-weight | 4.97 | 4.00 | 2.18 | 87.5 | 90.6 | 72.4 | 73.1 | 70.0 | 39.5 | 17.9 | 21.73 | 4.92 | 6.46 | 8.04 |
| 4 | glm-4.6v Open-weight | 4.86 | 4.01 | 1.18 | 84.8 | 71.5 | 57.3 | 75.6 | 52.2 | 34.2 | 11.5 | 18.03 | 3.27 | 7.20 | 8.50 |
| 5 | grok-4 Proprietary | 4.78 | 3.80 | 1.95 | 87.8 | 89.0 | 63.6 | 64.1 | 92.2 | 68.3 | 35.5 | 27.37 | 2.55 | 6.02 | 8.28 |
| 6 | gpt-oss-120b Open-weight | 4.66 | 3.42 | 1.28 | 86.3 | 72.7 | 59.7 | 87.2 | 38.9 | 35.8 | 13.3 | 14.40 | 3.14 | 6.53 | 8.10 |
| 7 | gpt-5.2 Proprietary | 4.43 | 3.42 | 3.80 | 85.5 | 79.3 | 55.4 | 71.6 | 37.2 | 12.4 | 12.5 | 29.20 | 2.30 | 5.62 | 7.73 |
| 8 | qwen3-235b-a22b Open-weight | 3.53 | 2.56 | 1.17 | 87.9 | 88.1 | 66.1 | 80.8 | 43.3 | 31.3 | 8.6 | 11.15 | 4.41 | 6.93 | 8.51 |
| 9 | gpt-4o-mini Proprietary | 3.07 | 1.13 | 0.85 | 87.5 | 50.6 | 39.7 | 85.9 | 46.7 | 3.3 | 0.0 | 51.71 | 6.45 | 6.00 | 7.71 |
Insights from evaluating state-of-the-art LLMs on ToolGym
All LLMs exhibit strong planning ability (Goal Decomposition: 7.7-8.6), but execution abilities vary significantly, causing large gaps in task success rate.
Constraint following, rather than tool invocation, is the dominant failure mode. Models struggle with multi-source verification and explicit trade-offs.
Recovery rates range from 50.6% to 90.6% across models. Open-weight models like DeepSeek-v3.2 show competitive robustness against proprietary alternatives.
If you find ToolGym useful, please cite our paper
@inproceedings{toolgym2025,
title={ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation},
author={Anonymous},
booktitle={ACL},
year={2025}
}
ToolGym is built on real-world MCP tools curated from Smithery, a comprehensive registry for Model Context Protocol servers. We thank Henry Mao and the Smithery team for providing access to their platform and tool ecosystem.
Our codebase builds upon MCP-Universe, another MCP testing benchmark built for evaluating LLMs with real-world tools. We greatly appreciate their open-source contributions which laid the foundation for ToolGym.