OSF Preprints: External Version of Record Analysis

Author

Alex Jakubow

Published

February 14, 2026

Introduction

This report analyzes the extent to which preprints hosted on the Open Science Framework (OSF) link to external versions of record (VOR). Understanding these linkages is important for tracking the scholarly publishing pathway and assessing how preprints transition to formal publications. It will also help identify the proportion of OSF preprints that have a presence outside of the OSF ecosystem, which can inform strategies for improving discoverability, integration with other scholarly platforms, and navigating any potential plans to deprecate preprints on OSF.

Research Question

How many preprints on OSF point to an external version of record that lives somewhere other than OSF?

These external versions could be:

  • Carbon copies (or near copies) hosted on other preprint servers
  • Versions of the same preprint eventually published in academic journals
  • Other forms of the work published in different venues

Methodology

Data Source

The analysis uses preprint data from the OSF database, specifically focusing on preprints that meet the following criteria:

  • Public: Preprint is publicly accessible
  • Not deleted: Preprint has not been marked as deleted
  • Accepted: Machine state is “accepted”
  • Published: Preprint is marked as published
  • No existing article DOI: The preprint does not already have an article_doi field populated

Sample

A random sample of 10,000 preprints was selected from the OSF database using a fixed random seed (8675309) for reproducibility.

DOI Lookup Process

For each preprint DOI, we queried the OpenAlex API to retrieve:

  • Location information (where the work appears)
  • Open access status
  • Version information (preprint, accepted manuscript, published version)

The lookup process involved:

  1. Extracting preprint DOIs from the OSF database
  2. Making parallel API requests to OpenAlex (throttled to respect rate limits)
  3. Saving successful and failed requests to log files
  4. Processing the JSON responses to extract location data
  5. Filtering out locations that point back to OSF itself

Results

Sample Size and API Success Rate

The first step in our analysis is understanding how many DOI lookups were successful.

Table 1: DOI Lookup Success Rate
Metric Value
Total preprint DOIs attempted 10,000
Successful lookups 9,939 (99.4%)
Failed lookups 61 (0.6%)
Success rate 99.4%

Key Finding: The OpenAlex API successfully returned data for 99.4% of the queried DOIs, indicating excellent data availability.

External Location Funnel Analysis

This analysis tracks how many preprints have external locations (non-OSF) at each stage:

Table 2: External Location Funnel
Stage Count % of Previous Stage % of Total Sample
1. Successful OpenAlex lookups 9,939 NA 99.4
2. With at least one external location (non-OSF) 364 3.7 3.6
3. With at least one external open access version 173 47.5 1.7

Key Findings:

  • 3.7% of successfully looked-up preprints have at least one external location (non-OSF)
  • Of those with external locations, 47.5% have at least one open access version
  • Overall, 1.7% of the original sample has an external open access version

Distribution of External Locations

Let’s examine how many external locations each preprint has:

Table 3: Distribution of External Locations per Preprint
Number of External Locations Number of Preprints Percentage Cumulative %
0 9575 96.3 96.3
1 285 2.9 99.2
2 58 0.6 99.8
3 10 0.1 99.9
4 7 0.1 100.0
6 1 0.0 100.0
7 1 0.0 100.0
8 1 0.0 100.0
11 1 0.0 100.0

Key Finding: The vast majority of preprints (96.3%) have no external locations, while 3.7% have at least one external location.

Types of External Locations

Let’s examine what types of external locations were found:

Table 4: Types of External Locations Found
Version Type Access Status Count Percentage
submittedVersion Not Open Access 211 43.0
submittedVersion Open Access 166 33.8
NA Open Access 61 12.4
acceptedVersion Not Open Access 18 3.7
NA Not Open Access 12 2.4
acceptedVersion Open Access 10 2.0
publishedVersion Open Access 7 1.4
publishedVersion Not Open Access 6 1.2

The Version Type column indicates whether the external location is a preprint, accepted manuscript, or published version, while the Access Status column indicates whether it is open access. These fields are from the OpenAlex API response and their definitions can be found in the OpenAlex documentation.

Key Findings Summary

Based on the analysis of 10,000 OSF preprint DOIs:

  1. High API Success Rate: 99.4% of DOIs were successfully looked up in OpenAlex

  2. Limited External Presence: Only 3.7% of preprints have at least one external location (non-OSF)

  3. Open Access Availability: Among preprints with external locations, 47.5% have at least one open access version available

  4. Overall External OA Rate: Just 1.7% of the total sample has an external open access version

Next Steps and Recommendations

Further Analyses

  1. Temporal Analysis
    • Examine whether the rate of external VOR linkages has changed over time
    • Analyze the time lag between preprint posting and external publication
    • Use the created date field from the preprint metadata
  2. Domain-Specific Analysis
    • Investigate whether certain research fields have higher rates of external VOR
    • Consider adding discipline/subject classifications from OpenAlex
    • Compare patterns across different OSF preprint servers
  3. Publisher and Venue Analysis
    • Identify which journals/publishers most commonly publish OSF preprints
    • Analyze patterns by venue type (journal vs. preprint server)
    • Examine the raw_type field in the detailed results
  4. Version Progression Analysis
    • Track the progression from preprint to accepted manuscript to published version
    • Analyze what percentage reach each stage of the publishing process
    • Use the version, is_accepted, and is_published fields
  5. Failed Lookup Investigation
    • Examine the 61 failed lookups to understand why they failed
    • Determine if failed lookups represent a biased sample
    • Consider alternative methods for these cases

Data Quality Considerations

  1. DOI Coverage: Investigate whether the preprints without external locations truly have no external versions, or if they’re simply not indexed in OpenAlex

  2. Manual Validation: Consider manually reviewing a random sample of cases to validate the automated findings

  3. Missing Data: Explore patterns in the 61 failed lookups to ensure they don’t represent a systematic bias

  4. OSF Filtering: Verify that the filtering logic correctly excludes only OSF locations and doesn’t inadvertently remove legitimate external locations

Potential Research Questions

  1. What factors predict whether an OSF preprint will have an external VOR?
    • Preprint quality indicators (downloads, citations)
    • Author characteristics (reputation, institution)
    • Field of study
    • Preprint server
  2. How does OSF compare to other preprint servers in terms of VOR linkages?
    • Compare OSF rates to arXiv, bioRxiv, medRxiv, etc.
    • Analyze whether OSF’s multidisciplinary nature affects VOR rates
  3. What is the relationship between preprint visibility and eventual publication?
    • Do preprints with more downloads/views have higher publication rates?
    • Does open access status of the preprint affect subsequent publication?

Implementation Recommendations

  1. Expand Sample Size: Consider running the analysis on the full dataset rather than just 10,000 preprints to get more robust estimates

  2. Longitudinal Tracking: Set up periodic re-runs of this analysis to track how VOR linkages evolve over time

  3. Integration with OSF: Consider ways to surface these external VOR linkages in the OSF interface to help users discover related versions

Technical Notes

The analysis pipeline consists of:

  • Data extraction: SQL queries against OSF database tables
  • API queries: Parallel requests to OpenAlex API with throttling
  • Data processing: JSON parsing and filtering of location data
  • Output: Summary CSV files for further analysis

All code is available in openalex.r and can be reproduced using the same random seed (8675309).

Appendix: Data Files

The analysis generates the following output files:

  • data/doi_lookup_summary.csv: Summary statistics for each preprint
  • data/doi_lookup_results.csv: Detailed location data for external VORs
  • logs/doi_lookup_status.csv: API request success/failure log

The data files are not included in this report (or the codebase) but can be referenced on Google Drive for further exploration.