Replies: 1 comment 2 replies
-
|
I am in principle still working on something along those lines: https://github.com/JohannesGaessler/elo_hellm . The intent is to have quality control for the training code but there are other things that I need to take care of first: The benchmarks took a long time even with small models on 6x RTX 4090 so I decided that I need to first work on better server throughput (to reduce GPU time, particularly for multi GPU use) and on better memory allocation (to reduce person time). From my end I would design it like this:
Steps 1-3 could very well live within llama.cpp and I would be happy to upstream them. The ETA from my end would be several months at the very least (but I would also be happy to help with reviewing if someone else is interested in working on it in the meantime). Also, particularly for step 2 I think we have very different ideas of what should be implemented. One could in principle simply send all HTTP requests to a single server. But in my experience that just takes way too long because some of the benchmarks have 10000+ questions. So I would be using a batch system like HTCondor for that part in order to make it scalable but that would very much go against the idea of making the tool "simple". It may make sense to define specs for intermediate formats and to make the way of how to evaluate models modular. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently I looked at NVIDIA's Evaluator library. The idea is to be able to run various evaluations (AIME, MMLU, GSM8K, etc.) against an OpenAI compatible endpoint. As an idea and user interface I think it is good.
However, I tried to get this running on my Mac and quickly gave up. Among having to provide 3 or 4 API keys for various stuff, the library requires to install docker and download some containers. And then the runs get queued in some sort of manager/scheduler. Overall it looks extremely over-engineered and hard to get running. Eventually, I was not able to make it run so I gave up.
Still, I think the idea for such a tool is good and I think we should build one. It would be very basic python script, with near-zero dependencies, that runs most of the evals against a specified endpoint.
Requirements:
The best version of this that I know of is the gpt-oss evals, though it is also quite encumbered and over-engineered for my taste. Plus it has some issues.
So I think there is a room for such a basic tool. If anyone is interested in implementing it, I can help guiding.
Beta Was this translation helpful? Give feedback.
All reactions