Add Zenodo data source with fetch implementation #250

Goziee-git · 2025-11-18T19:58:29Z

Fixes

Fixes Add Zenodo as Data Source for Commons Quantification #249 by @Goziee-git

Problem Statement

The Quantifying the Commons project currently lacks coverage of academic and research outputs that are openly licensed. Zenodo, operated by CERN, hosts over 2 million research artifacts including publications, datasets, software, and presentations with Creative Commons licensing. This represents a significant gap in our quantification of the commons, particularly in academic and scientific domains.

Solution Overview

Integrate Zenodo's REST API to fetch and process Creative Commons licensed works, expanding our coverage to include:

Academic publications and preprints
Research datasets
Open source software releases
Conference presentations and posters
Institutional repositories content

Technical Implementation

API Details

Primary API: Zenodo REST API v1

Base URL: https://zenodo.org/api/records
Documentation: https://developers.zenodo.org/
Rate Limits: 100 requests/minute, 2000 requests/hour
Authentication: Public API (no key required for read operations)

Alternative APIs Considered:

OAI-PMH endpoint: https://zenodo.org/oai2d (rejected due to limited metadata)
Legacy API: Deprecated, not recommended

Query Strategy

Primary Query Approach:

GET /api/records?q=access_right:open AND (license:"cc-*" OR license:"CC-*" OR license:"public-domain")

Pagination Strategy:

Use page and size parameters (max 1000 records per page)
Implement cursor-based pagination for large datasets
Track links.next for continuation

Filtering Criteria:

access_right:open - Only open access works
License filtering: CC-BY, CC-BY-SA, CC-BY-NC, CC-BY-ND, CC0, Public Domain
Date range filtering: created:[2000-01-01 TO *]
Resource type inclusion: publication, dataset, software, presentation

Data Processing Strategy

Metadata Extraction:

DOI and Zenodo record ID
License type and version
Creation/publication date
Geographic metadata (author affiliations)
Subject classification (keywords, communities)
File formats and sizes
Citation counts and download metrics

Deduplication Logic:

Primary: DOI matching across sources
Secondary: Title + author fuzzy matching
Handle Zenodo-specific versioning (concept DOIs vs version DOIs)

API Limitations & Implementation Strategy

The Zenodo REST API implementation addresses operational constraints through specific technical approaches implemented in the fetch script.

Rate Limiting Handling: The implementation uses the project's standardized retry mechanism via shared.STATUS_FORCELIST through an HTTPAdapter with Retry strategy (3 retries, backoff_factor=2). This handles all transient HTTP errors including 504 Gateway Timeouts that commonly occur during sustained Zenodo API usage. A 2-second delay between requests (time.sleep(2.0)) ensures compliance with rate limiting while maintaining reasonable throughput.

Query Result Management: The implementation uses a broad query approach ("*") due to validated limitations with Zenodo's search infrastructure. Testing confirms that license-specific queries (e.g., license:"cc-by-4.0") consistently result in 503/504 errors, while simple queries work reliably. Optional date filtering (publication_date:[YYYY-01-01 TO *] when --dates-back is specified) is supported. Creative Commons license filtering occurs by examining the structured metadata.license.id field from each record's API response, mapping CC license identifiers to standardized names. Records are processed in batches of 100 per page with configurable fetch limits.

Data Quality Considerations: The implementation acknowledges that license information may be incomplete for older records by filtering out records with "Unknown" or "No License" classifications during processing. Geographic data extraction relies on available author affiliation metadata without attempting to supplement missing information.

Timeout and Error Handling: The script sets a 60-second timeout for API requests and relies on the standardized retry mechanism for robust error handling. Failed extractions are logged but don't halt the entire process, ensuring robust operation even with inconsistent metadata quality.

Implementation Details

Dependencies

requests - HTTP client with retry logic
pygments - Syntax highlighting for error output
urllib3 - HTTP adapter and retry utilities
Standard library modules (json, csv, datetime, collections)

Error Handling Strategy

Retry logic for transient failures (network, rate limits)
Graceful degradation for incomplete metadata
Comprehensive logging for debugging
Checkpoint/resume capability for large fetches

Useful Links

Zenodo Resources:

[Main site]: https://zenodo.org/
[API Documentation]: https://developers.zenodo.org/
[REST API Reference]: https://zenodo.org/api/
[OAI-PMH endpoint]: https://zenodo.org/oai2d
[Communities]: https://zenodo.org/communities/

License Information:

[Zenodo License Policy]: https://about.zenodo.org/policies/
[Supported Licenses]: https://zenodo.org/account/settings/github/

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

- Add zenodo_fetch.py with REST API integration - Implements CC license filtering and standardization - Uses shared retry mechanism for robust API handling

- Add pycountry dependency to Pipfile - Add get_language_name() function to shared.py - Supports language code and name standardization for zenodo_fetch.py

- Add pycountry to Pipfile for language standardization - Update Pipfile.lock with resolved dependencies - Required for zenodo_fetch.py language mapping functionality

- Required for language standardization in zenodo_fetch.py

- Ensure scripts can be executed directly on remote systems - Required for proper script execution in CI/CD environments

TimidRobot · 2025-11-24T09:45:45Z