Skip to content

Conversation

@Goziee-git
Copy link
Contributor

@Goziee-git Goziee-git commented Nov 18, 2025

Fixes

Problem Statement

The Quantifying the Commons project currently lacks coverage of academic and research outputs that are openly licensed. Zenodo, operated by CERN, hosts over 2 million research artifacts including publications, datasets, software, and presentations with Creative Commons licensing. This represents a significant gap in our quantification of the commons, particularly in academic and scientific domains.

Solution Overview

Integrate Zenodo's REST API to fetch and process Creative Commons licensed works, expanding our coverage to include:

  • Academic publications and preprints
  • Research datasets
  • Open source software releases
  • Conference presentations and posters
  • Institutional repositories content

Technical Implementation

API Details

Primary API: Zenodo REST API v1

  • Base URL: https://zenodo.org/api/records
  • Documentation: https://developers.zenodo.org/
  • Rate Limits: 100 requests/minute, 2000 requests/hour
  • Authentication: Public API (no key required for read operations)

Alternative APIs Considered:

  • OAI-PMH endpoint: https://zenodo.org/oai2d (rejected due to limited metadata)
  • Legacy API: Deprecated, not recommended

Query Strategy

Primary Query Approach:

GET /api/records?q=access_right:open AND (license:"cc-*" OR license:"CC-*" OR license:"public-domain")

Pagination Strategy:

  • Use page and size parameters (max 1000 records per page)
  • Implement cursor-based pagination for large datasets
  • Track links.next for continuation

Filtering Criteria:

  1. access_right:open - Only open access works
  2. License filtering: CC-BY, CC-BY-SA, CC-BY-NC, CC-BY-ND, CC0, Public Domain
  3. Date range filtering: created:[2000-01-01 TO *]
  4. Resource type inclusion: publication, dataset, software, presentation

Data Processing Strategy

Metadata Extraction:

  • DOI and Zenodo record ID
  • License type and version
  • Creation/publication date
  • Geographic metadata (author affiliations)
  • Subject classification (keywords, communities)
  • File formats and sizes
  • Citation counts and download metrics

Deduplication Logic:

  • Primary: DOI matching across sources
  • Secondary: Title + author fuzzy matching
  • Handle Zenodo-specific versioning (concept DOIs vs version DOIs)

API Limitations & Implementation Strategy

The Zenodo REST API implementation addresses operational constraints through specific technical approaches implemented in the fetch script.

Rate Limiting Handling: The implementation uses the project's standardized retry mechanism via shared.STATUS_FORCELIST through an HTTPAdapter with Retry strategy (3 retries, backoff_factor=2). This handles all transient HTTP errors including 504 Gateway Timeouts that commonly occur during sustained Zenodo API usage. A 2-second delay between requests (time.sleep(2.0)) ensures compliance with rate limiting while maintaining reasonable throughput.

Query Result Management: The implementation uses a broad query approach ("*") due to validated limitations with Zenodo's search infrastructure. Testing confirms that license-specific queries (e.g., license:"cc-by-4.0") consistently result in 503/504 errors, while simple queries work reliably. Optional date filtering (publication_date:[YYYY-01-01 TO *] when --dates-back is specified) is supported. Creative Commons license filtering occurs by examining the structured metadata.license.id field from each record's API response, mapping CC license identifiers to standardized names. Records are processed in batches of 100 per page with configurable fetch limits.

Data Quality Considerations: The implementation acknowledges that license information may be incomplete for older records by filtering out records with "Unknown" or "No License" classifications during processing. Geographic data extraction relies on available author affiliation metadata without attempting to supplement missing information.

Timeout and Error Handling: The script sets a 60-second timeout for API requests and relies on the standardized retry mechanism for robust error handling. Failed extractions are logged but don't halt the entire process, ensuring robust operation even with inconsistent metadata quality.

Implementation Details

Dependencies

  • requests - HTTP client with retry logic
  • pygments - Syntax highlighting for error output
  • urllib3 - HTTP adapter and retry utilities
  • Standard library modules (json, csv, datetime, collections)

Error Handling Strategy

  • Retry logic for transient failures (network, rate limits)
  • Graceful degradation for incomplete metadata
  • Comprehensive logging for debugging
  • Checkpoint/resume capability for large fetches

Useful Links

Zenodo Resources:

License Information:

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

- Add zenodo_fetch.py with REST API integration
- Implements CC license filtering and standardization
- Uses shared retry mechanism for robust API handling
@Goziee-git Goziee-git requested review from a team as code owners November 18, 2025 19:58
@Goziee-git Goziee-git requested review from Shafiya-Heena and TimidRobot and removed request for a team November 18, 2025 19:58
- Add pycountry dependency to Pipfile
- Add get_language_name() function to shared.py
- Supports language code and name standardization for zenodo_fetch.py
- Add pycountry to Pipfile for language standardization
- Update Pipfile.lock with resolved dependencies
- Required for zenodo_fetch.py language mapping functionality
- Required for language standardization in zenodo_fetch.py
- Ensure scripts can be executed directly on remote systems
- Required for proper script execution in CI/CD environments
Comment on lines 267 to 270
# Other CC licenses
"cc-nc": "CC NC",
"cc-sa": "CC SA",
"cc-nd": "CC ND",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not valid legal tools (missing version).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot, Thanks, I have updated the script to use valid licenses with version numbers and organised them to fit the sections where they belong in the script

# CC Certification
"cc-certification-1.0-us": "CC CERTIFICATION 1.0 US",
# Legacy CC licenses
"cc-publicdomain": "CC Public Domain",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which legal tools does this refer to?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot, I have gone over these selections for licenses and have updated them to use valid version numbers. I have also removed the legacy and retired licenses since the data returned had none of these licenses also

license_id = license_data.get("id", "") # API returns lowercase IDs

# Focus on Creative Commons licenses - map to standardized names
cc_license_mapping = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overwhelming majority of this mapping could be replaced with .upper().replace("-", " ")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot, Thanks for pointing this out, it has been updated.


try:
response = session.get(
ZENODO_API_BASE_URL, params=params, timeout=60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60s seems too long for a timeout, especially with a the records capped at 300.

Comment on lines 315 to 320
except requests.RequestException as e:
LOGGER.error(f"Error fetching Zenodo records: {e}")
raise
except json.JSONDecodeError as e:
LOGGER.error(f"Error parsing JSON response: {e}")
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please raise shared.QuantifyingException instead of logging and raising

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already gave you feedback about using this exception:

The pattern is also throughout the fetch scripts in the main branch.

Please focus on the quality of submissions, not quantity.

@TimidRobot TimidRobot self-assigned this Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Add Zenodo as Data Source for Commons Quantification

3 participants