A comprehensive toolkit for evaluating Large Language Models (LLMs) on programming tasks using structured benchmarks. Test multiple providers (OpenAI, Claude, Gemini, AWS Bedrock, Grok) with a unified interface.
- 🎯 Multi-Provider Support - Test OpenAI, Claude, Gemini, Bedrock, and Grok models
- 📊 Comprehensive Benchmarks - HumanEval, ExtendedEval, and AppliedEval datasets
- 🔍 Automated Evaluation - Test LLM-generated code against structured test cases
- 📈 Difficulty Analysis - Classify problems by complexity metrics
- 🧪 Coverage Analysis - Assess edge case coverage in test suites
This toolkit consists of four main components:
- Model Execution - Run multiple LLM providers on programming benchmarks
- Result Evaluation - Automatically test LLM-generated code against test cases
- Difficulty Analysis - Classify problems by complexity metrics
- Coverage Analysis - Assess test case comprehensiveness
example_usage_multiple_models.py- Execute multiple LLM models on benchmark datasetsE1_evaluate_results_of_AppliedEval.py- Evaluate LLM outputs for AppliedEval.jsonlE2_evaluate_results_of_ExtendedEval.py- Evaluate LLM outputs for ExtendedEval.jsonlM5_DifficultyLevel.py- Analyze and classify problem difficultyM4_TestCoverage.py- Analyze test coverage and edge case handling
git clone https://github.com/AuthEceSoftEng/llm-code-benchmark.git
cd llm-code-benchmark
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies (includes Multi-LLM SDK)
pip install -r requirements.txtNote: This project uses the Multi-LLM SDK for unified LLM provider access.
The easiest way is to use a .env file:
# Copy the example env file
cp .env.example .env
# Edit .env and add your API keys
nano .env # or use your preferred editorYour .env file should look like:
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=...
# Add other keys as neededThe script will automatically load these keys when it runs!
Classify problems by complexity using multiple metrics:
python M5_DifficultyLevel.py \
--input AppliedEval.jsonl \
--out difficulty_analysis.csvThis script computes a difficulty score and difficulty category (Easy / Medium / Hard / Challenging) for each task in a HumanEval-style dataset. It analyzes:
- The canonical solution using AST (branching, loops, recursion, depth…)
- The prompt (length + constraint words)
- The tests (assert count + pattern features)
What it produces:
- A CSV file where each row includes:
- code features
- prompt features
- test features
- difficulty score
- difficulty bucket
Metrics Analyzed:
- Code complexity (branching, loops, nesting depth)
- Recursion usage
- Exception handling
- Test case characteristics
- Prompt complexity and constraints
Assess edge case coverage in test suites:
python M4_TestCoverage.py \
--input AppliedEval.jsonl \
--out coverage_report.csv \
--markdown coverage_summary.md \
--summary
This script analyzes a HumanEval-style JSONL dataset and produces coverage metrics for the test cases. It checks patterns such as negatives, zeros, floats, large integers, Unicode, exceptions, empty lists/strings, and more.
What it produces:
- A CSV file summarizing per-task test coverage
- An optional Markdown report
- A command-line summary with coverage statistics
Coverage Categories:
- Negative numbers
- Zero values
- Large integers (5+ digits)
- Floating-point numbers
- Empty collections (lists, strings)
- None/null values
- Unicode characters
- Exception handling
- Whitespace/escape sequences
python3 example_usage_multiple_models.py \
--dataset AppliedEval.jsonl \
--field prompt \
--out results.jsonl# Configure which models to test in example_usage_multiple_models.py
# Then run:
python3 example_usage_multiple_models.py \
--dataset AppliedEval.jsonl \
--field prompt \
--out results.jsonl
# Evaluate the results
python E1_evaluate_results_of_AppliedEval.py \
--results results.jsonl \
--out evaluation_report.jsonl \
--verbose# Basic evaluation
#### For AppliedEval.jsonl
```bash
python evaluate_results_of_AppliedEval.py \
--results results.jsonl \
--out evaluation_report.jsonl \
--verbosepython E2_evaluate_results_of_ExtendedEval.py \
--results results.jsonl \
--out evaluation_report.jsonl \
--verboseOutput: The evaluation script will:
- Extract Python code from LLM responses
- Run code against structured test cases
- Handle both function-based and class-based tasks
- Generate a detailed report with pass/fail status
- Display final accuracy metrics
# Reinstall dependencies
pip install -r requirements.txt# Make sure .env file exists and has your keys
cp .env.example .env
# Edit .env with your actual API keys