Skip to content

Commit 5f9c46b

Browse files
mprycweshayutin
andauthored
PR to analyze failures based on the Tigers PR. (#2038)
* PR to analyze failures based on the Tigers PR. Original PR: #2034 Author: Tiger Kaovilai <tkaovila@redhat.com> Date: Fri Nov 28 09:17:23 2025 -0500 refactor: update runtime permissions in failure analysis documentation and scripts - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script. - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions. - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation. These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * update to ignore hypershift error * remove parks app for now, to stablize --------- Co-authored-by: Wesley Hayutin <weshayutin@gmail.com>
1 parent f75a38d commit 5f9c46b

File tree

8 files changed

+2119
-43
lines changed

8 files changed

+2119
-43
lines changed

.claude/config.json

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Read",
5+
"Glob",
6+
"Grep",
7+
"Bash(ls:*)",
8+
"Bash(cat:*)",
9+
"Bash(head:*)",
10+
"Bash(tail:*)",
11+
"Bash(grep:*)",
12+
"Bash(sed:*)",
13+
"Bash(awk:*)",
14+
"Bash(find:*)",
15+
"Bash(tree:*)",
16+
"Bash(wc:*)",
17+
"Bash(sort:*)",
18+
"Bash(uniq:*)",
19+
"Bash(cut:*)",
20+
"Bash(tr:*)",
21+
"Bash(jq:*)",
22+
"Bash(less:*)",
23+
"Bash(more:*)",
24+
"Bash(file:*)",
25+
"Bash(du:*)",
26+
"Bash(stat:*)",
27+
"Bash(zcat:*)",
28+
"Bash(gunzip:*)",
29+
"Bash(tar:*)"
30+
],
31+
"deny": [
32+
"Write",
33+
"Edit",
34+
"Bash(rm:*)",
35+
"Bash(curl:*)",
36+
"Bash(wget:*)",
37+
"Bash(git:push*)",
38+
"Bash(docker:*)",
39+
"Bash(kubectl:delete*)",
40+
"Bash(kubectl:apply*)",
41+
"Bash(make:*)",
42+
"WebFetch",
43+
"WebSearch"
44+
]
45+
}
46+
}

CLAUDE.md

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
OADP (OpenShift API for Data Protection) is a Kubernetes operator that installs and manages Velero for backup and restore operations in OpenShift clusters. It extends Velero with OpenShift-specific features like Security Context Constraints (SCC), cloud credential management, and monitoring integration.
8+
9+
## Prerequisites
10+
11+
**Go Version**: Go 1.24.0 (with toolchain go1.24.5)
12+
13+
**macOS Users**: Install GNU sed (required for bundle generation and other targets)
14+
15+
```bash
16+
brew install gnu-sed
17+
```
18+
19+
**Container Tool**: Docker or Podman (auto-detected, defaults to Docker if available)
20+
21+
- Override with: `CONTAINER_TOOL=podman make <target>`
22+
23+
**Tool Version Checking**: Run `make versions` to check all tool versions and detect mismatches
24+
25+
## Development Commands
26+
27+
### Essential Commands
28+
29+
```bash
30+
# Discovery and validation
31+
make help # Display all available targets with descriptions
32+
make versions # Check tool versions and detect mismatches
33+
34+
# Development workflow
35+
make test # Run unit tests, linting, and validation (recommended before commits)
36+
make build # Build manager binary
37+
make deploy-olm # Deploy for testing via OLM (recommended for PR testing)
38+
make undeploy-olm # Remove OLM deployment
39+
40+
# Code generation (run after API changes)
41+
make generate # Generate DeepCopy methods
42+
make manifests # Generate CRDs and RBAC manifests
43+
make bundle # Generate OLM bundle
44+
make api-isupdated # Check if API is up to date
45+
make bundle-isupdated # Check if bundle is up to date
46+
47+
# Linting and formatting
48+
make lint # Run golangci-lint
49+
make lint-fix # Fix linting issues automatically
50+
make fmt # Format code with go fmt
51+
52+
# Special targets
53+
make update-non-admin-manifests # Update NAC manifests from external repo
54+
```
55+
56+
### Testing Commands
57+
58+
```bash
59+
make test-e2e # Run end-to-end tests (requires setup)
60+
make test-e2e-setup # Setup E2E test environment
61+
make test-e2e-cleanup # Cleanup after E2E tests
62+
63+
# Test variations
64+
TEST_VIRT=true make test-e2e # Run virtualization tests
65+
TEST_UPGRADE=true make test-e2e # Run upgrade tests
66+
TEST_CLI=true make test-e2e # Run CLI-based tests
67+
68+
# Run focused tests
69+
GINKGO_ARGS="--ginkgo.focus='test name'" make test-e2e
70+
```
71+
72+
### Cloud Authentication Deployment
73+
74+
Deploy OADP with cloud-native authentication (STS, Workload Identity, WIF):
75+
76+
```bash
77+
make deploy-olm-stsflow # Deploy with standardized flow UI (interactive)
78+
make deploy-olm-stsflow-aws # Deploy with AWS STS
79+
make deploy-olm-stsflow-gcp # Deploy with GCP Workload Identity Federation
80+
make deploy-olm-stsflow-azure # Deploy with Azure Workload Identity
81+
```
82+
83+
These targets automate cloud credential setup using cloud-native identity providers instead of manual credential files. The standardized flow provides an interactive UI for configuration.
84+
85+
### E2E Test Setup Requirements
86+
87+
E2E tests require these environment variables:
88+
89+
- `OADP_CRED_FILE`: Path to backup location credentials
90+
- `OADP_BUCKET`: S3 bucket name for backups
91+
- `CI_CRED_FILE`: Path to snapshot location credentials
92+
- `VSL_REGION`: Volume snapshot location region
93+
- `BSL_REGION`: Backup storage location region (optional, defaults to us-east-1)
94+
95+
**Test Labels**: Tests are filtered by cloud provider labels: `aws`, `gcp`, `azure`, `ibmcloud`, `virt`, `hcp`, `cli`, `upgrade`
96+
97+
**Common Test Issues**:
98+
99+
- ttl.sh images expire after TTL_DURATION (default 1h), which may cause test failures if running tests long after initial deployment
100+
101+
## Important Environment Variables
102+
103+
**Operator Configuration**:
104+
105+
- `IMG`: Custom operator image (default: `quay.io/konveyor/oadp-operator:latest`)
106+
- `VERSION`: Override version (default: `99.0.0`)
107+
- `OADP_TEST_NAMESPACE`: Namespace for operator (default: `openshift-adp`)
108+
109+
**Image Build and Registry**:
110+
111+
- `CONTAINER_TOOL`: Container tool to use (`docker` or `podman`, auto-detected)
112+
- `TTL_DURATION`: ttl.sh image expiry time (default: `1h`, max: `24h`)
113+
- `BUNDLE_IMG`: Custom bundle image
114+
115+
**Cloud Provider Credentials** (for E2E tests):
116+
117+
- `OADP_CRED_FILE`, `OADP_BUCKET`, `CI_CRED_FILE`: Backup/snapshot credentials
118+
- `VSL_REGION`, `BSL_REGION`: Cloud regions for volume/backup storage locations
119+
120+
## Git Repository Information
121+
122+
**Upstream Repository**: `openshift/oadp-operator`
123+
124+
**IMPORTANT - Pull Request Target**: Always target `oadp-dev` branch for PRs, NOT `main`
125+
126+
**Branch Structure**:
127+
128+
- Development branch: `oadp-dev` (target for all PRs)
129+
- Release branches: `oadp-major.minor` (e.g., `oadp-1.4`, `oadp-1.5`)
130+
- Many remote branches from various contributors exist
131+
132+
You can verify the current default branch with `git ls-remote --symref upstream HEAD`.
133+
134+
## Architecture Overview
135+
136+
### Core APIs (Custom Resources)
137+
138+
- **DataProtectionApplication (DPA)**: Primary resource that configures the entire OADP/Velero stack
139+
- **CloudStorage**: Manages cloud storage configurations for backup locations
140+
- **DataProtectionTest**: Framework for testing backup/restore operations
141+
- **Non-Admin resources**: Enable multi-tenant backup scenarios (NonAdminBackup, NonAdminRestore)
142+
143+
### Key Controllers
144+
145+
- **DataProtectionApplicationReconciler**: Main controller that orchestrates Velero deployment and configuration
146+
- **CloudStorageReconciler**: Manages cloud storage backend setup
147+
- **DataProtectionTestReconciler**: Handles data protection testing workflows
148+
149+
### Package Structure
150+
151+
- `api/v1alpha1/`: CRD type definitions and API schemas
152+
- `internal/controller/`: Controller implementations and reconciliation logic
153+
- `pkg/credentials/`: Cloud credential management and authentication flows
154+
- `pkg/velero/`: Velero-specific utilities and integration code
155+
- `pkg/cloudprovider/`: Multi-cloud provider abstractions (AWS, Azure, GCP, IBM)
156+
- `tests/e2e/`: Comprehensive end-to-end test suites using Ginkgo
157+
158+
### Integration Points
159+
160+
The operator manages these key integrations:
161+
162+
- **Velero**: Core backup/restore engine with OpenShift-specific patches
163+
- **Cloud Providers**: AWS (including STS), Azure (Workload Identity), GCP (WIF), IBM Cloud, OpenStack
164+
- **OpenShift**: SCC management, monitoring integration, image registry
165+
- **Storage**: CSI snapshots, data mover functionality for cross-cluster scenarios
166+
167+
### Development Workflow
168+
169+
1. Use `make deploy-olm` for testing code changes (builds and deploys current branch)
170+
2. Always run `make test` before committing to validate code quality
171+
3. For API changes: run `make generate && make manifests && make bundle`
172+
4. E2E tests require cloud credentials and should be run in appropriate test environments
173+
5. The operator follows standard controller-runtime patterns with comprehensive validation and status reporting
174+
175+
### Special Features
176+
177+
- **Multi-cloud standardized authentication**: Supports cloud-native identity (STS, WIF, Workload Identity)
178+
- **Non-admin backup**: Multi-tenant backup capabilities for namespace-scoped users
179+
- **Data mover**: Cross-cluster backup/restore using VolSync integration
180+
- **OpenShift Virtualization**: Backup/restore support for KubeVirt VMs
181+
- **Must-gather integration**: Diagnostic collection for troubleshooting
182+
183+
### Bundle and Release Management
184+
185+
- Uses OLM (Operator Lifecycle Manager) for deployment and upgrades
186+
- Bundle generation includes multiple service accounts (velero, non-admin-controller)
187+
- Supports multiple channels (dev, stable) for different release streams
188+
- Version compatibility matrix maintained in `PARTNERS.md`
189+
190+
When making changes, always consider the multi-cloud nature of the operator and test against the comprehensive E2E suite that covers various cloud providers and backup scenarios.
191+
192+
## CI/Prow Testing
193+
194+
E2E tests in presubmit CI are automatically triggered via OpenShift's Prow infrastructure:
195+
196+
**CI Configuration**: Tests are defined in the [openshift/release](https://github.com/openshift/release) repository at:
197+
- `ci-operator/config/openshift/oadp-operator/openshift-oadp-operator-oadp-dev__4.20.yaml`
198+
199+
**Test Container Image**: The `test-oadp-operator` image is built from [build/ci-Dockerfile](build/ci-Dockerfile), which:
200+
- Uses `quay.io/konveyor/builder` as the base image
201+
- Installs kubectl for cluster operations
202+
- Downloads Go dependencies and prepares the build environment
203+
- Provides the runtime environment for executing E2E tests in CI
204+
205+
**How it works**:
206+
1. When a PR is opened against `oadp-dev`, Prow automatically triggers configured test jobs
207+
2. The ci-Dockerfile builds a test container with all necessary dependencies
208+
3. E2E tests run inside this container against a provisioned OpenShift cluster
209+
4. Test results are reported back to the PR
210+
211+
**Viewing test results**: Check the PR's "Checks" tab or visit [prow.ci.openshift.org](https://prow.ci.openshift.org) for detailed test logs.
212+
213+
### Automated Failure Analysis with Claude
214+
215+
When E2E tests fail in Prow CI, Claude Code automatically analyzes the failures and generates a comprehensive report.
216+
217+
**How it works**:
218+
219+
1. After test execution completes with failures, the analysis script (`tests/e2e/scripts/analyze_failures.sh`) is invoked
220+
2. Claude runs in headless mode (`--print` flag) for non-interactive CI automation via Vertex AI
221+
3. Claude analyzes artifacts written by the E2E test code: JUnit reports, must-gather diagnostics, and per-test pod logs
222+
4. A detailed markdown report is generated at `${ARTIFACT_DIR}/claude-failure-analysis.md`
223+
5. The report includes root cause analysis, known flake detection, and actionable recommendations
224+
225+
**Important**: Claude analyzes only artifacts generated during test execution (JUnit, must-gather, per-test logs). Prow's build-log.txt is written by CI infrastructure after tests complete and is not available during analysis.
226+
227+
**Accessing the analysis**:
228+
229+
- Find `claude-failure-analysis.md` in the Prow artifacts directory alongside other test outputs
230+
- URL pattern: `https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oadp-operator/<PR>/<job-name>/<build-id>/artifacts/claude-failure-analysis.md`
231+
232+
**Configuration**:
233+
234+
- Analysis requires Vertex AI credentials configured in the CI environment
235+
- Gracefully skips if credentials are not available (no impact on test execution)
236+
- Can be disabled by setting `SKIP_CLAUDE_ANALYSIS=true`
237+
- **Automatic secret redaction**: API keys, tokens, passwords, and credentials are automatically redacted from output
238+
239+
For more details, see the [design document](docs/design/claude-prow-failure-analysis_design.md).

Makefile

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -870,7 +870,20 @@ test-e2e: test-e2e-setup install-ginkgo ## Run E2E tests against OADP operator i
870870
-kvm_emulation=$(KVM_EMULATION) \
871871
-hco_upstream=$(HCO_UPSTREAM) \
872872
-skipMustGather=$(SKIP_MUST_GATHER) \
873-
$(HCP_EXTERNAL_ARGS)
873+
$(HCP_EXTERNAL_ARGS) \
874+
|| EXIT_CODE=$$?; \
875+
if [ "$(OPENSHIFT_CI)" = "true" ]; then \
876+
if [ -f /var/run/oadp-credentials/gcp-claude-code-credentials ]; then \
877+
export GOOGLE_APPLICATION_CREDENTIALS=/var/run/oadp-credentials/gcp-claude-code-credentials; \
878+
export CLAUDE_CODE_USE_VERTEX=1; \
879+
export CLOUD_ML_REGION=global; \
880+
if [ -f /var/run/oadp-credentials/gcp-claude-code-project-id ]; then \
881+
export ANTHROPIC_VERTEX_PROJECT_ID=$$(cat /var/run/oadp-credentials/gcp-claude-code-project-id); \
882+
fi; \
883+
fi; \
884+
./tests/e2e/scripts/analyze_failures.sh $${EXIT_CODE:-0}; \
885+
fi; \
886+
exit $${EXIT_CODE:-0}
874887

875888
.PHONY: test-e2e-cleanup
876889
test-e2e-cleanup: login-required

build/ci-Dockerfile

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,22 @@ WORKDIR /go/src/github.com/openshift/oadp-operator
55

66
COPY ./ .
77

8-
# Install kubectl
9-
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && \
8+
# Make analysis script executable for CI execution
9+
RUN chmod +x tests/e2e/scripts/analyze_failures.sh
10+
11+
# Install kubectl (multi-arch)
12+
ARG TARGETARCH
13+
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/${TARGETARCH}/kubectl" && \
1014
chmod +x kubectl && \
1115
mv kubectl /usr/local/bin/
1216

17+
# Install Node.js and Claude CLI
18+
# Using NodeSource setup script for RHEL-based images
19+
RUN curl -fsSL https://rpm.nodesource.com/setup_20.x | bash - && \
20+
dnf install -y nodejs && \
21+
npm install -g @anthropic-ai/claude-code && \
22+
dnf clean all
23+
1324
RUN go mod download && \
1425
mkdir -p $(go env GOCACHE) && \
1526
chmod -R 777 ./ $(go env GOCACHE) $(go env GOPATH)

0 commit comments

Comments
 (0)