Skip to content

Conversation

@Ulthran
Copy link
Contributor

@Ulthran Ulthran commented Nov 18, 2025

Summary

  • add a verification command and helper to validate archived files against recorded md5 sums
  • expose the new verification entry point via pyproject scripts
  • cover verification success and failure cases with new tests

Testing

  • pytest

Codex Task

Copilot AI review requested due to automatic review settings November 18, 2025 16:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds archive verification functionality to validate that archived fastq files match their recorded MD5 checksums. The implementation includes a new verify_archive helper function, a command-line entry point verify_main, and corresponding tests for both success and failure scenarios.

Key Changes:

  • Added verify_archive function to validate MD5 checksums of archived files
  • Exposed verification functionality via new verify_backup CLI command
  • Added tests covering successful verification and corruption detection

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
seqBackupLib/backup.py Implements verify_archive function for MD5 validation and verify_main CLI entry point
test/test_backup.py Adds tests for successful verification and detection of file corruption
pyproject.toml Registers new verify_backup script entry point

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for line in md5_file:
expected = line.strip().split("\t")
if len(expected) != 2:
raise ValueError(f"Invalid md5 line in {md5_fp}: {line}")
Copy link

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message could be more helpful. When an invalid MD5 line is encountered, the error includes the entire line in the message, but doesn't indicate which specific field or format issue was detected.

Consider improving the error message to be more specific:

raise ValueError(
    f"Invalid md5 line in {md5_fp}: expected 2 tab-separated values, "
    f"got {len(expected)} in line: {line.strip()}"
)
Suggested change
raise ValueError(f"Invalid md5 line in {md5_fp}: {line}")
raise ValueError(
f"Invalid md5 line in {md5_fp}: expected 2 tab-separated values, "
f"got {len(expected)} in line: {line.strip()}"
)

Copilot uses AI. Check for mistakes.
Comment on lines 117 to +146


def verify_archive(archive_dir: Path) -> bool:
md5_files = list(archive_dir.glob("*.md5"))
if not md5_files:
raise FileNotFoundError(f"No md5 file found in {archive_dir}")

if len(md5_files) > 1:
warnings.warn(
f"Multiple md5 files found in {archive_dir}. Using {md5_files[0].name}."
)

md5_fp = md5_files[0]
missing_files = []
mismatched_hashes = []

with open(md5_fp) as md5_file:
for line in md5_file:
expected = line.strip().split("\t")
if len(expected) != 2:
raise ValueError(f"Invalid md5 line in {md5_fp}: {line}")
filename, expected_md5 = expected
file_fp = archive_dir / filename

if not file_fp.is_file():
missing_files.append(filename)
continue

computed_md5 = return_md5(file_fp)
if computed_md5 != expected_md5:
Copy link

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Potential performance issue: The function processes all files before reporting any errors. For large archives with many files, this means the entire verification must complete even if the first file fails.

While collecting all errors provides comprehensive feedback, consider whether early termination on first failure would be more appropriate for this use case. If comprehensive reporting is desired, the current approach is acceptable but could benefit from progress logging for large archives.

Copilot uses AI. Check for mistakes.

with open(md5_fp) as md5_file:
for line in md5_file:
expected = line.strip().split("\t")
Copy link

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MD5 file parsing doesn't handle empty lines or lines with only whitespace. If the MD5 file contains blank lines, they will cause a ValueError when len(expected) != 2.

Consider adding a check to skip empty lines:

for line in md5_file:
    line = line.strip()
    if not line:  # Skip empty lines
        continue
    expected = line.split("\t")
    if len(expected) != 2:
        raise ValueError(f"Invalid md5 line in {md5_fp}: {line}")
Suggested change
expected = line.strip().split("\t")
line = line.strip()
if not line: # Skip empty or whitespace-only lines
continue
expected = line.split("\t")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants