idea : implementing a lean and mean evaluation tool #18195

ggerganov · 2025-12-19T07:49:26Z

ggerganov
Dec 19, 2025
Maintainer

Recently I looked at NVIDIA's Evaluator library. The idea is to be able to run various evaluations (AIME, MMLU, GSM8K, etc.) against an OpenAI compatible endpoint. As an idea and user interface I think it is good.

However, I tried to get this running on my Mac and quickly gave up. Among having to provide 3 or 4 API keys for various stuff, the library requires to install docker and download some containers. And then the runs get queued in some sort of manager/scheduler. Overall it looks extremely over-engineered and hard to get running. Eventually, I was not able to make it run so I gave up.

Still, I think the idea for such a tool is good and I think we should build one. It would be very basic python script, with near-zero dependencies, that runs most of the evals against a specified endpoint.

Requirements:

No dependencies beyond what's needed for sending HTTP requests
No Docker
No configuration files
Run the most popular evals with a single one-line command

The best version of this that I know of is the gpt-oss evals, though it is also quite encumbered and over-engineered for my taste. Plus it has some issues.

So I think there is a room for such a basic tool. If anyone is interested in implementing it, I can help guiding.

JohannesGaessler · 2025-12-19T08:58:01Z

JohannesGaessler
Dec 19, 2025
Collaborator

I am in principle still working on something along those lines: https://github.com/JohannesGaessler/elo_hellm . The intent is to have quality control for the training code but there are other things that I need to take care of first: The benchmarks took a long time even with small models on 6x RTX 4090 so I decided that I need to first work on better server throughput (to reduce GPU time, particularly for multi GPU use) and on better memory allocation (to reduce person time).

From my end I would design it like this:

Convert benchmark problems to some common spec.
Distribute problems to OAI-compatible servers and collect outputs.
Pass problems + outputs to servers with the llama.cpp API running a small model, ask it to extract the answer from the output, and use a regex to force the output to be some simple format that you can analyze programatically (e.g. "A", "B", "C", or "D").
Analyze the results.

Steps 1-3 could very well live within llama.cpp and I would be happy to upstream them. The ETA from my end would be several months at the very least (but I would also be happy to help with reviewing if someone else is interested in working on it in the meantime).

Also, particularly for step 2 I think we have very different ideas of what should be implemented. One could in principle simply send all HTTP requests to a single server. But in my experience that just takes way too long because some of the benchmarks have 10000+ questions. So I would be using a batch system like HTCondor for that part in order to make it scalable but that would very much go against the idea of making the tool "simple". It may make sense to define specs for intermediate formats and to make the way of how to evaluate models modular.

2 replies

JohannesGaessler Dec 19, 2025
Collaborator

I forgot: in principle scripts/server-bench.py already has the bare-bones structure for sending HTTP requests to a server but as of right now the outputs are simply discarded.

ggerganov Dec 19, 2025
Maintainer Author

Yes, server-bench.py is already half-way there and it's pretty close to what I imagine. The distribution to multiple servers is probably nice-to-have, though I don't consider it a hard requirement. Some evals do take a lot of time, but I've been able to run them overnight on my devices. So even without a distribution functionality, it would still be useful.

Pt. 3 is a good idea. It seems most evals send the answer to some old GPT model in the cloud (a.k.a. "judge"). We can simply use a local model instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

idea : implementing a lean and mean evaluation tool #18195

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

idea : implementing a lean and mean evaluation tool #18195

Uh oh!

ggerganov Dec 19, 2025 Maintainer

Replies: 1 comment · 2 replies

Uh oh!

JohannesGaessler Dec 19, 2025 Collaborator

Uh oh!

Uh oh!

JohannesGaessler Dec 19, 2025 Collaborator

Uh oh!

Uh oh!

ggerganov Dec 19, 2025 Maintainer Author

ggerganov
Dec 19, 2025
Maintainer

Replies: 1 comment 2 replies

JohannesGaessler
Dec 19, 2025
Collaborator

JohannesGaessler Dec 19, 2025
Collaborator

ggerganov Dec 19, 2025
Maintainer Author