APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.
Hey Product Hunt,
I’m Abhishek, CEO of KushoAI.
We built APIEval-20 because API testing is now a common claim across AI agents, but there was no reliable way to verify it.
The evaluations we found usually had one of three gaps. They assumed source code access, depended on detailed documentation, or checked whether the output looked valid instead of measuring actual bugs found.
That felt far from how most teams test APIs in practice.
So we built a black-box benchmark.
Schema and payload in. Nothing else.
The agent generates a test suite. We run those tests against live reference APIs with planted bugs. The score comes from what the agent actually catches: bug detection, API coverage, and efficiency.
No LLM judges. No subjective calls. A bug is either caught or missed.
The part I’m most proud of is the complexity taxonomy. Sending nulls to every field is easy. The real test is whether an agent can reason about field relationships, auth behavior, pagination, error handling, schema constraints, and multi-step flows. That is where stronger agents start to separate from weaker ones.
APIEval-20 is open on Hugging Face. We are also putting together a leaderboard comparing major AI agents in a separate research report. If you run your agent on the benchmark before then, we would love to include your results.
Two questions for the community:
1. What domains or API patterns should we add next?
2. If you are building a testing tool or agent, would you want your results included in the leaderboard?
I’ll be here all day. Drop a comment or reach us at [email protected]
the black box scoring is the right call, been skeptical of llm-as-judge for anything that has an objective answer. curious about the multi step flows though, if a bug only shows up at step 3 does the agent get credit for catching it or does it need to find it proactively from the schema alone?
Really like the black-box setup. Feels much closer to how teams actually test APIs than benchmarks that assume source code access. Curious how you’re thinking about the planted bugs: do auth, pagination, schema issues, multi-step flows, etc. all count the same, or are you planning to weight them by severity/commonness?
Nice. I thought LLMs as a judge is what we need in some cases.
Do you have a classifier to pick one vs another?
Do you publish per bug breakdowns so people can see exactly what types of failures each agent misses?
About APIEval-20 on Product Hunt
“An open benchmark for AI agents that test APIs”
APIEval-20 launched on Product Hunt on May 8th, 2026 and earned 117 upvotes and 10 comments, placing #12 on the daily leaderboard. APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.
APIEval-20 was featured in API (98.1k followers), Developer Tools (512.4k followers) and Artificial Intelligence (468.5k followers) on Product Hunt. Together, these topics include over 171.5k products, making this a competitive space to launch in.
Who hunted APIEval-20?
APIEval-20 was hunted by Abhishek Saikia. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Want to see how APIEval-20 stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.