-
-
Notifications
You must be signed in to change notification settings - Fork 70
Add Zenodo data source with fetch implementation #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add zenodo_fetch.py with REST API integration - Implements CC license filtering and standardization - Uses shared retry mechanism for robust API handling
- Add pycountry dependency to Pipfile - Add get_language_name() function to shared.py - Supports language code and name standardization for zenodo_fetch.py
- Add pycountry to Pipfile for language standardization - Update Pipfile.lock with resolved dependencies - Required for zenodo_fetch.py language mapping functionality
- Required for language standardization in zenodo_fetch.py
- Ensure scripts can be executed directly on remote systems - Required for proper script execution in CI/CD environments
scripts/1-fetch/zenodo_fetch.py
Outdated
| # Other CC licenses | ||
| "cc-nc": "CC NC", | ||
| "cc-sa": "CC SA", | ||
| "cc-nd": "CC ND", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not valid legal tools (missing version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimidRobot, Thanks, I have updated the script to use valid licenses with version numbers and organised them to fit the sections where they belong in the script
scripts/1-fetch/zenodo_fetch.py
Outdated
| # CC Certification | ||
| "cc-certification-1.0-us": "CC CERTIFICATION 1.0 US", | ||
| # Legacy CC licenses | ||
| "cc-publicdomain": "CC Public Domain", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which legal tools does this refer to?
- CERTIFICATION 1.0 US
- (former URL:
https://creativecommons.org/licenses/publicdomain/)
- (former URL:
- PDM 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimidRobot, I have gone over these selections for licenses and have updated them to use valid version numbers. I have also removed the legacy and retired licenses since the data returned had none of these licenses also
scripts/1-fetch/zenodo_fetch.py
Outdated
| license_id = license_data.get("id", "") # API returns lowercase IDs | ||
|
|
||
| # Focus on Creative Commons licenses - map to standardized names | ||
| cc_license_mapping = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overwhelming majority of this mapping could be replaced with .upper().replace("-", " ")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimidRobot, Thanks for pointing this out, it has been updated.
scripts/1-fetch/zenodo_fetch.py
Outdated
|
|
||
| try: | ||
| response = session.get( | ||
| ZENODO_API_BASE_URL, params=params, timeout=60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60s seems too long for a timeout, especially with a the records capped at 300.
scripts/1-fetch/zenodo_fetch.py
Outdated
| except requests.RequestException as e: | ||
| LOGGER.error(f"Error fetching Zenodo records: {e}") | ||
| raise | ||
| except json.JSONDecodeError as e: | ||
| LOGGER.error(f"Error parsing JSON response: {e}") | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please raise shared.QuantifyingException instead of logging and raising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already gave you feedback about using this exception:
The pattern is also throughout the fetch scripts in the main branch.
Please focus on the quality of submissions, not quantity.
- Change MAX_RECORDS_PER_REQUEST from 300 to 25 for unauthenticated calls - Fix query parameter handling to only include when not empty - Change default sort from 'bestmatch' to 'newest' - Update base_query from '*' to empty string for all records
Fixes
Problem Statement
The Quantifying the Commons project currently lacks coverage of academic and research outputs that are openly licensed. Zenodo, operated by CERN, hosts over 2 million research artifacts including publications, datasets, software, and presentations with Creative Commons licensing. This represents a significant gap in our quantification of the commons, particularly in academic and scientific domains.
Solution Overview
Integrate Zenodo's REST API to fetch and process Creative Commons licensed works, expanding our coverage to include:
Technical Implementation
API Details
Primary API: Zenodo REST API v1
https://zenodo.org/api/recordsAlternative APIs Considered:
https://zenodo.org/oai2d(rejected due to limited metadata)Query Strategy
Primary Query Approach:
Pagination Strategy:
pageandsizeparameters (max 1000 records per page)links.nextfor continuationFiltering Criteria:
access_right:open- Only open access workscreated:[2000-01-01 TO *]publication,dataset,software,presentationData Processing Strategy
Metadata Extraction:
Deduplication Logic:
API Limitations & Implementation Strategy
The Zenodo REST API implementation addresses operational constraints through specific technical approaches implemented in the fetch script.
Rate Limiting Handling: The implementation uses the project's standardized retry mechanism via
shared.STATUS_FORCELISTthrough anHTTPAdapterwithRetrystrategy (3 retries, backoff_factor=2). This handles all transient HTTP errors including 504 Gateway Timeouts that commonly occur during sustained Zenodo API usage. A 2-second delay between requests (time.sleep(2.0)) ensures compliance with rate limiting while maintaining reasonable throughput.Query Result Management: The implementation uses a broad query approach (
"*") due to validated limitations with Zenodo's search infrastructure. Testing confirms that license-specific queries (e.g.,license:"cc-by-4.0") consistently result in 503/504 errors, while simple queries work reliably. Optional date filtering (publication_date:[YYYY-01-01 TO *]when--dates-backis specified) is supported. Creative Commons license filtering occurs by examining the structuredmetadata.license.idfield from each record's API response, mapping CC license identifiers to standardized names. Records are processed in batches of 100 per page with configurable fetch limits.Data Quality Considerations: The implementation acknowledges that license information may be incomplete for older records by filtering out records with "Unknown" or "No License" classifications during processing. Geographic data extraction relies on available author affiliation metadata without attempting to supplement missing information.
Timeout and Error Handling: The script sets a 60-second timeout for API requests and relies on the standardized retry mechanism for robust error handling. Failed extractions are logged but don't halt the entire process, ensuring robust operation even with inconsistent metadata quality.
Implementation Details
Dependencies
requests- HTTP client with retry logicpygments- Syntax highlighting for error outputurllib3- HTTP adapter and retry utilitiesError Handling Strategy
Useful Links
Zenodo Resources:
License Information:
Checklist
Update index.md).mainormaster).Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin