Skip to content

Commit a1ae99e

Browse files
Add results for LiveCodeBench and SimpleQA
1 parent d845eef commit a1ae99e

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,12 @@ This project is a work in progress, and the provided code is in an early experim
2929

3030
### Comparison of CePO with default settings and base model
3131

32-
| Method | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX |
33-
| -------------------------- | ------- | --------------- | ---- | ---- |
34-
| Llama 3.1 70B | 41.6 | 72.9 | 41.7 | 64.2 |
35-
| Llama 3.3 70B | 51.0 | 78.6 | 49.1 | 72.6 |
36-
| Llama 3.1 405B | 49.8 | 79.2 | 50.7 | 73.0 |
37-
| CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 55.5 | 80.1 |
32+
| Method | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | LiveCodeBench (pass@1) | Simple QA |
33+
| -------------------------: | :-----: | :-------------: | :--: | :--: | :--------------------: | :-------: |
34+
| Llama 3.1 70B | 41.6 | 72.9 | 41.7 | 64.2 | 24.5 | 14.7 |
35+
| Llama 3.3 70B | 51.0 | 78.6 | 49.1 | 72.6 | 27.1 | 20.9 |
36+
| Llama 3.1 405B | 49.8 | 79.2 | 50.7 | 73.0 | 31.8 | 13.5 |
37+
| CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 55.5 | 80.1 | 31.9 | 22.6 |
3838

3939
### Ablation studies
4040

@@ -43,7 +43,7 @@ We conducted ablation studies to evaluate the impact of various hyperparameters
4343
Interestingly, the self-critique and quality improvement capabilities of existing off-the-shelf models do not always scale proportionally with increased inference compute. Addressing this limitation remains a key focus, and we plan to explore custom model fine-tuning as a potential solution in the future.
4444

4545
| bestofn_n | planning_n | planning_m | bestofn_rating_type | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | Comments |
46-
| --------- | ---------- | ---------- | ------------------- | ------- | --------------- | ----- | ----- | -------------- |
46+
| :-------: | :--------: | :--------: | :-----------------: | :-----: | :-------------: | :---: | :---: | :------------- |
4747
| 3 | 3 | 6 | absolute | 69.6 | 84.8 | 55.5 | 80.1 | Default config |
4848
| 3 | 3 | 6 | pairwise | 67.7 | 83.5 | 55.6 | 79.8 | |
4949
| 3 | 2 | 5 | absolute | 67.1 | 85.1 | 55.1 | 79.0 | |

0 commit comments

Comments
 (0)