ToolGym - Open-world Tool-using Environment

Key Features

ToolGym addresses critical gaps in existing benchmarks for tool-using agents

Task Creation Engine

Automatically synthesizes long-horizon, multi-tool workflows with wild realistic constraints for scalable task generation.

State Controller

Injects realistic failures (timeouts, rate limits, corrupted responses) to stress-test agent robustness under non-ideal conditions.

Planner-Actor Framework

Separates deliberate reasoning from step-wise execution, enabling better analysis of planning vs. execution capabilities.

Tool Retrieval Index

Semantic search over 5,571 tools using BGE-M3 embeddings, enabling agents to discover tools from a massive registry.

LLM-as-Judge Evaluation

Multi-model evaluation with majority voting across GPT-4o, GPT-5.1, and DeepSeek-V3.2 for stable, human-aligned scores.

Data Curation Engine

High-quality trajectory collection for fine-tuning. 1,170 samples outperform baselines trained on 119k samples.

🏆 Leaderboard

Evaluation results on ToolGym benchmark with pass@3

Model Rankings

Rank	Model	Overall	Answer Quality		Tool Use Robustness			Constraint Following				Long-horizon
Rank	Model	Overall	Comp.	Grnd.	Succ.%	Recov.%	Flex.%	Order%	Info%	Fmt%	Trade%	#Calls	#Turns	Prog.	Decomp.
🥇 1	gemini-3-pro-preview Proprietary	5.87	4.75	2.58	88.8	89.0	68.8	53.8	97.8	53.9	13.3	47.86	3.20	7.60	8.66
🥈 2	claude-opus-4.5 Proprietary	5.42	4.70	2.93	92.7	83.7	60.8	65.4	73.3	51.0	33.7	45.16	4.01	6.41	7.72
🥉 3	deepseek-v3.2 Open-weight	4.97	4.00	2.18	87.5	90.6	72.4	73.1	70.0	39.5	17.9	21.73	4.92	6.46	8.04
4	glm-4.6v Open-weight	4.86	4.01	1.18	84.8	71.5	57.3	75.6	52.2	34.2	11.5	18.03	3.27	7.20	8.50
5	grok-4 Proprietary	4.78	3.80	1.95	87.8	89.0	63.6	64.1	92.2	68.3	35.5	27.37	2.55	6.02	8.28
6	gpt-oss-120b Open-weight	4.66	3.42	1.28	86.3	72.7	59.7	87.2	38.9	35.8	13.3	14.40	3.14	6.53	8.10
7	gpt-5.2 Proprietary	4.43	3.42	3.80	85.5	79.3	55.4	71.6	37.2	12.4	12.5	29.20	2.30	5.62	7.73
8	qwen3-235b-a22b Open-weight	3.53	2.56	1.17	87.9	88.1	66.1	80.8	43.3	31.3	8.6	11.15	4.41	6.93	8.51
9	gpt-4o-mini Proprietary	3.07	1.13	0.85	87.5	50.6	39.7	85.9	46.7	3.3	0.0	51.71	6.45	6.00	7.71

Key Findings

Insights from evaluating state-of-the-art LLMs on ToolGym

Planning ≠ Execution

All LLMs exhibit strong planning ability (Goal Decomposition: 7.7-8.6), but execution abilities vary significantly, causing large gaps in task success rate.

Constraint Following is the Bottleneck

Constraint following, rather than tool invocation, is the dominant failure mode. Models struggle with multi-source verification and explicit trade-offs.

Robustness Varies Widely

Recovery rates range from 50.6% to 90.6% across models. Open-weight models like DeepSeek-v3.2 show competitive robustness against proprietary alternatives.

Citation

If you find ToolGym useful, please cite our paper

@inproceedings{toolgym2025,
  title={ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation},
  author={Anonymous},
  booktitle={ACL},
  year={2025}
}

Acknowledgments

ToolGym is built on real-world MCP tools curated from Smithery, a comprehensive registry for Model Context Protocol servers. We thank Henry Mao and the Smithery team for providing access to their platform and tool ecosystem.

Our codebase builds upon MCP-Universe, another MCP testing benchmark built for evaluating LLMs with real-world tools. We greatly appreciate their open-source contributions which laid the foundation for ToolGym.