Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
52bc77f
feat(serverless): add worker fitness check system
deanq Dec 12, 2025
3b3a27a
feat(serverless): add worker fitness check system
deanq Dec 12, 2025
757f334
docs: moved serverless architecture doc
deanq Dec 12, 2025
d607f4c
docs: document fitness check system
deanq Dec 13, 2025
65d08c8
feat(build): add GPU test binary build infrastructure
deanq Dec 17, 2025
f39e7fc
feat(serverless): implement GPU fitness check system
deanq Dec 17, 2025
87635e5
build(dist): package GPU test binary for distribution
deanq Dec 17, 2025
74cf3c1
test(serverless): add GPU fitness check tests
deanq Dec 17, 2025
473f10c
test(performance): disable GPU check in cold start benchmarks
deanq Dec 17, 2025
9e77a15
docs(serverless): document GPU fitness check system
deanq Dec 17, 2025
edf4694
fix(fitness): defer GPU check registration to avoid circular imports
deanq Dec 18, 2025
3fbc0ba
fix(logging): use warn() instead of warning() for RunPodLogger
deanq Dec 18, 2025
d8833ea
fix(logging): fix RunPodLogger.warning() call in rp_scale
deanq Dec 18, 2025
4e6ab53
fix(gpu-fitness): correct import path for rp_cuda
deanq Dec 18, 2025
bf5cb38
fix(test): correct mock patch target for binary path resolution test
deanq Dec 18, 2025
349cf43
build(gpu-binary): replace ARM binary with x86-64 compiled version
deanq Dec 18, 2025
f764971
feat(system-fitness): add system resource fitness checks
deanq Dec 18, 2025
83d68ee
refactor(fitness): integrate system fitness checks auto-registration
deanq Dec 18, 2025
06cdb94
build(deps): add psutil for system resource checking
deanq Dec 18, 2025
309e9c1
test(system-fitness): add comprehensive test suite for system fitness…
deanq Dec 18, 2025
9238ec1
test(fitness): update fixtures to handle system checks auto-registration
deanq Dec 18, 2025
4fc7c16
docs: document built-in system fitness checks with configuration
deanq Dec 18, 2025
e957958
feat(cuda-init): add CUDA device initialization fitness check
deanq Dec 18, 2025
d81e81a
docs: document CUDA device initialization fitness check
deanq Dec 18, 2025
5dde5cf
chore: reduce minimum disk space requirement to 1GB
deanq Dec 18, 2025
0e16910
fix(cuda): suppress nvidia-smi stderr on CPU-only workers
deanq Dec 18, 2025
1f7e83d
fix(cuda): parse actual CUDA version from nvidia-smi, not driver version
deanq Dec 18, 2025
5e35792
refactor(disk-check): use percentage-based disk space validation
deanq Dec 19, 2025
4b1cf9c
docs: update disk space check documentation for percentage-based vali…
deanq Dec 19, 2025
6c02a98
fix(disk-check): remove redundant /tmp check in containers
deanq Dec 19, 2025
bd1d464
fix(tests): update CUDA tests to match implementation
deanq Dec 19, 2025
3c69761
fix(fitness): address PR feedback on fitness checks system
deanq Dec 19, 2025
766807b
fix(fitness): resolve CodeQL code quality issues
deanq Dec 19, 2025
7a6774f
refactor(gpu-fitness): remove redundant is_available() call in fallback
deanq Dec 19, 2025
dcabce4
fix(fitness): resolve unresolved PR feedback comments
deanq Dec 19, 2025
7f31c04
fix(fitness): resolve CodeQL and Copilot feedback comments
deanq Dec 19, 2025
c4ccc0c
fix(fitness): remove unused mock variable assignments in tests
deanq Dec 19, 2025
bc1a809
fix(ruff): resolve all remaining linting errors
deanq Dec 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 265 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview
Runpod Python is a dual-purpose library: a GraphQL API wrapper for Runpod cloud services and a serverless worker SDK for custom endpoint development. The project supports both synchronous and asynchronous programming patterns.

## Development Environment
- **Python versions**: 3.8-3.11 (3.8+ required)
- **Build system**: setuptools with setuptools_scm for automatic versioning from git tags
- **Dependency management**: uv with uv.lock for deterministic builds
- **Package installation**: `uv sync --group test` for development dependencies
- **Lock file**: `uv.lock` ensures reproducible dependency resolution

## Build & Development Commands

### Environment Setup
```bash
# Install package with development dependencies
uv sync --group test

# Install all dependency groups (includes dev and test)
uv sync --all-groups

# Install from source (editable) - automatically done by uv sync
uv sync --group test

# Install latest development version
uv pip install git+https://github.com/runpod/runpod-python.git
```

### Testing
```bash
# Run full test suite with 90% coverage requirement
uv run pytest

# Run tests with coverage report (matches CI configuration)
uv run pytest --durations=10 --cov=runpod --cov-report=xml --cov-report=term-missing --cov-fail-under=90

# Run specific test modules
uv run pytest tests/test_api/
uv run pytest tests/test_serverless/
uv run pytest tests/test_cli/

# Test with timeout (120s max per test) - configured in pytest.ini
uv run pytest --timeout=120 --timeout_method=thread
```

### CLI Development & Testing
```bash
# Test CLI commands (entry point: runpod.cli.entry:runpod_cli)
uv run runpod --help
uv run runpod config # Configuration wizard
uv run runpod pod # Pod management
uv run runpod project # Serverless project scaffolding
uv run runpod ssh # SSH connection management
uv run runpod exec # Remote execution

# Local serverless worker testing
uv run python worker.py --rp_serve_api # Start local test server for worker development
```

### Package Building
```bash
# Build distributions (uses setuptools_scm for versioning)
uv build

# Verify package
uv run twine check dist/*

# Version is automatically determined from git tags
# No manual version updates needed in code
```

## Code Architecture

### Dual-Mode Operation Pattern
The library operates in two distinct modes:
1. **API Mode** (`runpod.api.*`): GraphQL wrapper for Runpod web services
2. **Worker Mode** (`runpod.serverless.*`): SDK for building serverless functions

### Key Modules Structure

#### `/runpod/api/` - GraphQL API Wrapper
- `ctl_commands.py`: High-level API functions (pods, endpoints, templates, users)
- `graphql.py`: Core GraphQL query execution engine
- `mutations/`: GraphQL mutations (create/update/delete operations)
- `queries/`: GraphQL queries (read operations)

#### `/runpod/serverless/` - Worker SDK
- `worker.py`: Main worker orchestration and job processing loop
- `modules/rp_handler.py`: Request/response handling for serverless functions
- `modules/rp_fastapi.py`: Local development server (FastAPI-based)
- `modules/rp_scale.py`: Auto-scaling and concurrency management
- `modules/rp_ping.py`: Health monitoring and heartbeat system

#### `/runpod/cli/` - Command Line Interface
- `entry.py`: Main CLI entry point using Click framework
- `groups/`: Modular command groups (config, pod, project, ssh, exec)
- Uses Click framework with rich terminal output and progress bars

#### `/runpod/endpoint/` - Client SDK
- `runner.py`: Synchronous endpoint interaction
- `asyncio/asyncio_runner.py`: Asynchronous endpoint interaction
- Supports both sync and async programming patterns

### Async/Sync Duality Pattern
The codebase maintains both synchronous and asynchronous interfaces throughout:
- Endpoint clients: `endpoint.run()` (async) vs `endpoint.run_sync()` (sync)
- Worker processing: Async job handling with sync compatibility
- HTTP clients: aiohttp for async, requests for sync operations

## Testing Requirements

### Test Coverage Standards
- **Minimum coverage**: 90% (enforced by pytest.ini configuration)
- **Test timeout**: 120 seconds per test (configured in pytest.ini)
- **Test structure**: Mirrors source code organization exactly
- **Async mode**: Auto-enabled via pytest.ini for seamless async testing
- **Coverage configuration**: Defined in pyproject.toml with omit patterns

### Local Serverless Testing
The project includes sophisticated local testing capabilities:
- `tests/test_serverless/local_sim/`: Mock Runpod environment
- Local development server via `python worker.py --rp_serve_api`
- Integration testing with worker state simulation

### Async Testing
- Uses `pytest-asyncio` for async test support
- `asynctest` for advanced async mocking
- Comprehensive coverage of both sync and async code paths

## Development Patterns

### Worker Development Workflow
```python
# Basic serverless worker pattern
import runpod

def handler_function(job):
job_input = job["input"]
# Process input...
return {"output": result}

# Start worker (production)
runpod.serverless.start({"handler": handler_function})

# Local testing
# python worker.py --rp_serve_api
```

### API Usage Pattern
```python
import runpod

# Set API key
runpod.api_key = "your_api_key"

# Async endpoint usage
endpoint = runpod.Endpoint("ENDPOINT_ID")
run_request = endpoint.run({"input": "data"})
result = run_request.output() # Blocks until complete

# Sync endpoint usage
result = endpoint.run_sync({"input": "data"})
```

### Error Handling Architecture
- Custom exceptions in `runpod/error.py`
- GraphQL error handling in API wrapper
- Worker error handling with job state management
- HTTP client error handling with retry logic (aiohttp-retry)

## CI/CD Pipeline

### GitHub Actions Workflows
- **CI-pytests.yml**: Unit tests across Python 3.8, 3.9, 3.10.15, 3.11.10 matrix using uv
- **CI-e2e.yml**: End-to-end integration testing
- **CI-codeql.yml**: Security analysis
- **CD-publish_to_pypi.yml**: Production PyPI releases with release-please automation
- **CD-test_publish_to_pypi.yml**: Test PyPI releases
- **vhs.yml**: VHS demo recording workflow
- **Manual workflow dispatch**: Available for force publishing without release-please

### Version Management
- Uses `setuptools_scm` for automatic versioning from git tags
- No manual version updates required in source code
- Version file generated at `runpod/_version.py`
- **Release-please automation**: Automated releases based on conventional commits
- **Worker notification**: Automatically notifies runpod-workers repositories on release

## Key Dependencies

### Production Dependencies (requirements.txt)
- `aiohttp[speedups]`: Async HTTP client (primary)
- `fastapi[all]`: Local development server and API framework
- `click`: CLI framework
- `boto3`: AWS S3 integration for file operations
- `paramiko`: SSH client functionality
- `requests`: Sync HTTP client (fallback/compatibility)

### Development Dependencies (pyproject.toml dependency-groups)
- **test group**: `pytest`, `pytest-asyncio`, `pytest-cov`, `pytest-timeout`, `faker`, `nest_asyncio`
- **dev group**: `build`, `twine` for package building and publishing
- **Lock file**: `uv.lock` provides deterministic dependency resolution across environments
- **Dynamic dependencies**: Production deps loaded from `requirements.txt` via pyproject.toml

## Build System Configuration

### pyproject.toml as Primary Configuration
- **Project metadata**: Name, version, description, authors defined in pyproject.toml
- **Build system**: Uses setuptools with setuptools_scm backend
- **Dependency management**: Hybrid approach with requirements.txt for production deps
- **CLI entry points**: Defined in `[project.scripts]` section
- **Tool configurations**: pytest coverage settings, setuptools_scm configuration

### Legacy Compatibility
- **setup.py**: Maintained for backward compatibility but not primary configuration
- **requirements.txt**: Still used for production dependencies, loaded dynamically
- **Version management**: Automated via setuptools_scm, no manual updates needed

## Project-Specific Conventions

### GraphQL Integration
- All Runpod API interactions use GraphQL exclusively
- Mutations and queries are separated into distinct modules
- GraphQL client handles authentication and error responses

### CLI Design Philosophy
- Modular command groups using Click
- Rich terminal output with progress indicators
- Configuration wizard for user onboarding
- SSH integration for pod access

### Serverless Worker Architecture
- Auto-scaling based on job queue depth
- Health monitoring with configurable intervals
- Structured logging throughout worker lifecycle
- Local development server mirrors production environment

### File Organization Principles
- Source code mirrors API/functional boundaries
- Tests mirror source structure exactly
- Clear separation between API wrapper and worker SDK
- CLI commands grouped by functional area

## Testing Strategy Notes

When working with this codebase:
- Always run full test suite before major changes (`uv run pytest`)
- Use local worker testing for serverless development (`--rp_serve_api` flag)
- Integration tests require proper mocking of Runpod API responses
- Async tests require careful setup of event loops and timeouts
- **Lock file usage**: `uv.lock` ensures reproducible test environments
- **CI/CD integration**: Tests run automatically on PR with uv for consistent results

## Modern Development Workflow

### Key Improvements
- **uv adoption**: Faster dependency resolution and installation
- **Lock file management**: `uv.lock` ensures deterministic builds across environments
- **Release automation**: release-please handles versioning and changelog generation
- **Worker ecosystem**: Automated notifications to dependent worker repositories
- **Manual override**: Workflow dispatch allows manual publishing when needed
- **Enhanced CI**: Python version matrix testing with uv for improved reliability
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include runpod/serverless/binaries/gpu_test
include runpod/serverless/binaries/README.md
include build_tools/gpu_test.c
include build_tools/compile_gpu_test.sh
43 changes: 43 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,49 @@ You can also test your worker locally before deploying it to Runpod. This is use
python my_worker.py --rp_serve_api
```

### Worker Fitness Checks

Fitness checks allow you to validate your worker environment at startup before processing jobs. If any check fails, the worker exits immediately, allowing your orchestrator to restart it.

```python
# my_worker.py

import runpod
import torch

# Register fitness checks using the decorator
@runpod.serverless.register_fitness_check
def check_gpu_available():
"""Verify GPU is available."""
if not torch.cuda.is_available():
raise RuntimeError("GPU not available")

@runpod.serverless.register_fitness_check
def check_disk_space():
"""Verify sufficient disk space."""
import shutil
stat = shutil.disk_usage("/")
free_gb = stat.free / (1024**3)
if free_gb < 10:
raise RuntimeError(f"Insufficient disk space: {free_gb:.2f}GB free")

def handler(job):
job_input = job["input"]
# Your handler code here
return {"output": "success"}

# Fitness checks run before handler initialization (production only)
runpod.serverless.start({"handler": handler})
```

**Key Features:**
- Supports both synchronous and asynchronous check functions
- Checks run only once at worker startup (production mode)
- Runs before handler initialization and job processing begins
- Any check failure exits with code 1 (worker marked unhealthy)

See [Worker Fitness Checks](https://github.com/runpod/runpod-python/blob/main/docs/serverless/worker_fitness_checks.md) documentation for more examples and best practices.

## 📚 | API Language Library (GraphQL Wrapper)

When interacting with the Runpod API you can use this library to make requests to the API.
Expand Down
49 changes: 49 additions & 0 deletions build_tools/compile_gpu_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash
# Compile gpu_test binary for Linux x86_64 with CUDA support
# Usage: ./compile_gpu_test.sh
# Output: ../runpod/serverless/binaries/gpu_test

set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
OUTPUT_DIR="$SCRIPT_DIR/../runpod/serverless/binaries"
CUDA_VERSION="${CUDA_VERSION:-11.8.0}"
UBUNTU_VERSION="${UBUNTU_VERSION:-ubuntu22.04}"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

echo "Compiling gpu_test binary..."
echo "CUDA Version: $CUDA_VERSION"
echo "Ubuntu Version: $UBUNTU_VERSION"
echo "Output directory: $OUTPUT_DIR"

# Build in Docker container with NVIDIA CUDA development environment
docker run --rm \
-v "$SCRIPT_DIR:/workspace" \
"nvidia/cuda:${CUDA_VERSION}-devel-${UBUNTU_VERSION}" \
bash -c "
cd /workspace && \
nvcc -O3 \
-arch=sm_70 \
-gencode=arch=compute_70,code=sm_70 \
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-o gpu_test \
gpu_test.c -lnvidia-ml -lcudart_static && \
echo 'Compilation successful' && \
file gpu_test
"

# Move binary to output directory
if [ -f "$SCRIPT_DIR/gpu_test" ]; then
mv "$SCRIPT_DIR/gpu_test" "$OUTPUT_DIR/gpu_test"
chmod +x "$OUTPUT_DIR/gpu_test"
echo "Binary successfully created at: $OUTPUT_DIR/gpu_test"
echo "Binary info:"
file "$OUTPUT_DIR/gpu_test"
else
echo "Error: Compilation failed, binary not found"
exit 1
fi
Loading
Loading