Logo ToolEval Leaderboard

An Automatic Evaluation for Tool Learning

Win Rate of Methods vs.

Win Rate of Methods vs.
on Each Subsets

About ToolEval

ToolEval is an automatic evaluator build for tool learning. which incorporates two evaluation metrics, Pass Rate and Win Rate(Preference). Pass Rate: Calculates the proportion of successfully completing an instruction within limited OpenAI API calls. Win Rate(Preference): Measured by comparing two answers (action sequences) for a given instruction. We pre-define a set of criteria for a better answer, which are organized as prompts for ChatGPT. We provide the test instruction and two candidate answers to the evaluator and obtain its preference. We evaluate each answer pair multiple times to improve the reliability of our system. Then we calculate the **Win Rate** (percentage of being preferred by the evaluator). More details can be found in our paper.

To validate the reliability of ChatGPT evaluator in both pass rate and win rate, we sample among four different methods (ChatGPT+ReACT, ChatGPT+DFSDT, ToolLLaMA+DFSDT and GPT4+DFSDT) to obtain solution pairs for 300 test instructions for each method. Then we engage humans to annotate the pass rate for ChatGPT+DFSDT, ToolLLaMA+DFSDT and GPT4+DFSDT, and the win rate among ChatGPT+ReACT and ChatGPT+DFSDT. Our ChatGPT evaluator demonstrates a high agreement of **87.1%** in pass rate and **80.3%** in win rate with human annotators. This result shows that our evaluator generates highly similar evaluation results to humans and can be viewed as a credible evaluator who simulates human evaluation on pass rate and win rate.

Adding new methods or evaluators

We welcome new method contributions to the leaderboard from the community. Please follow the steps in ToolEval to get more information.

ToolEval limitations

ToolEval is not a comprehensive evaluation of methods' abilities. It can only reflect the methods' abilities on utilizing tools to solve problems. The automatic evaluators are not perfect and can be wrong under certain circumstance. We encourage the community to contribute more robust, safe and ethical evaluators to the project.