APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.
Hey Product Hunt,
I’m Abhishek, CEO of KushoAI.
We built APIEval-20 because API testing is now a common claim across AI agents, but there was no reliable way to verify it.
The evaluations we found usually had one of three gaps. They assumed source code access, depended on detailed documentation, or checked whether the output looked valid instead of measuring actual bugs found.
That felt far from how most teams test APIs in practice.
So we built a black-box benchmark.
Schema and payload in. Nothing else.
The agent generates a test suite. We run those tests against live reference APIs with planted bugs. The score comes from what the agent actually catches: bug detection, API coverage, and efficiency.
No LLM judges. No subjective calls. A bug is either caught or missed.
The part I’m most proud of is the complexity taxonomy. Sending nulls to every field is easy. The real test is whether an agent can reason about field relationships, auth behavior, pagination, error handling, schema constraints, and multi-step flows. That is where stronger agents start to separate from weaker ones.
APIEval-20 is open on Hugging Face. We are also putting together a leaderboard comparing major AI agents in a separate research report. If you run your agent on the benchmark before then, we would love to include your results.
Two questions for the community:
1. What domains or API patterns should we add next?
2. If you are building a testing tool or agent, would you want your results included in the leaderboard?
I’ll be here all day. Drop a comment or reach us at [email protected]
About APIEval-20 on Product Hunt
“An open benchmark for AI agents that test APIs”
APIEval-20 launched on Product Hunt on May 8th, 2026 and earned 117 upvotes and 10 comments, placing #12 on the daily leaderboard. APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.
On the analytics side, APIEval-20 competes within API, Developer Tools and Artificial Intelligence — topics that collectively have 1.1M followers on Product Hunt. The dashboard above tracks how APIEval-20 performed against the three products that launched closest to it on the same day.
Who hunted APIEval-20?
APIEval-20 was hunted by Abhishek Saikia. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.