-
Notifications
You must be signed in to change notification settings - Fork 5
Add archive verification command #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds archive verification functionality to validate that archived fastq files match their recorded MD5 checksums. The implementation includes a new verify_archive helper function, a command-line entry point verify_main, and corresponding tests for both success and failure scenarios.
Key Changes:
- Added
verify_archivefunction to validate MD5 checksums of archived files - Exposed verification functionality via new
verify_backupCLI command - Added tests covering successful verification and corruption detection
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| seqBackupLib/backup.py | Implements verify_archive function for MD5 validation and verify_main CLI entry point |
| test/test_backup.py | Adds tests for successful verification and detection of file corruption |
| pyproject.toml | Registers new verify_backup script entry point |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for line in md5_file: | ||
| expected = line.strip().split("\t") | ||
| if len(expected) != 2: | ||
| raise ValueError(f"Invalid md5 line in {md5_fp}: {line}") |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message could be more helpful. When an invalid MD5 line is encountered, the error includes the entire line in the message, but doesn't indicate which specific field or format issue was detected.
Consider improving the error message to be more specific:
raise ValueError(
f"Invalid md5 line in {md5_fp}: expected 2 tab-separated values, "
f"got {len(expected)} in line: {line.strip()}"
)| raise ValueError(f"Invalid md5 line in {md5_fp}: {line}") | |
| raise ValueError( | |
| f"Invalid md5 line in {md5_fp}: expected 2 tab-separated values, " | |
| f"got {len(expected)} in line: {line.strip()}" | |
| ) |
|
|
||
|
|
||
| def verify_archive(archive_dir: Path) -> bool: | ||
| md5_files = list(archive_dir.glob("*.md5")) | ||
| if not md5_files: | ||
| raise FileNotFoundError(f"No md5 file found in {archive_dir}") | ||
|
|
||
| if len(md5_files) > 1: | ||
| warnings.warn( | ||
| f"Multiple md5 files found in {archive_dir}. Using {md5_files[0].name}." | ||
| ) | ||
|
|
||
| md5_fp = md5_files[0] | ||
| missing_files = [] | ||
| mismatched_hashes = [] | ||
|
|
||
| with open(md5_fp) as md5_file: | ||
| for line in md5_file: | ||
| expected = line.strip().split("\t") | ||
| if len(expected) != 2: | ||
| raise ValueError(f"Invalid md5 line in {md5_fp}: {line}") | ||
| filename, expected_md5 = expected | ||
| file_fp = archive_dir / filename | ||
|
|
||
| if not file_fp.is_file(): | ||
| missing_files.append(filename) | ||
| continue | ||
|
|
||
| computed_md5 = return_md5(file_fp) | ||
| if computed_md5 != expected_md5: |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Potential performance issue: The function processes all files before reporting any errors. For large archives with many files, this means the entire verification must complete even if the first file fails.
While collecting all errors provides comprehensive feedback, consider whether early termination on first failure would be more appropriate for this use case. If comprehensive reporting is desired, the current approach is acceptable but could benefit from progress logging for large archives.
|
|
||
| with open(md5_fp) as md5_file: | ||
| for line in md5_file: | ||
| expected = line.strip().split("\t") |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MD5 file parsing doesn't handle empty lines or lines with only whitespace. If the MD5 file contains blank lines, they will cause a ValueError when len(expected) != 2.
Consider adding a check to skip empty lines:
for line in md5_file:
line = line.strip()
if not line: # Skip empty lines
continue
expected = line.split("\t")
if len(expected) != 2:
raise ValueError(f"Invalid md5 line in {md5_fp}: {line}")| expected = line.strip().split("\t") | |
| line = line.strip() | |
| if not line: # Skip empty or whitespace-only lines | |
| continue | |
| expected = line.split("\t") |
Summary
Testing
Codex Task