| Metric | Value |
|---|---|
| Total preprint DOIs attempted | 10,000 |
| Successful lookups | 9,939 (99.4%) |
| Failed lookups | 61 (0.6%) |
| Success rate | 99.4% |
OSF Preprints: External Version of Record Analysis
Introduction
This report analyzes the extent to which preprints hosted on the Open Science Framework (OSF) link to external versions of record (VOR). Understanding these linkages is important for tracking the scholarly publishing pathway and assessing how preprints transition to formal publications. It will also help identify the proportion of OSF preprints that have a presence outside of the OSF ecosystem, which can inform strategies for improving discoverability, integration with other scholarly platforms, and navigating any potential plans to deprecate preprints on OSF.
Research Question
How many preprints on OSF point to an external version of record that lives somewhere other than OSF?
These external versions could be:
- Carbon copies (or near copies) hosted on other preprint servers
- Versions of the same preprint eventually published in academic journals
- Other forms of the work published in different venues
Methodology
Data Source
The analysis uses preprint data from the OSF database, specifically focusing on preprints that meet the following criteria:
- Public: Preprint is publicly accessible
- Not deleted: Preprint has not been marked as deleted
- Accepted: Machine state is “accepted”
- Published: Preprint is marked as published
- No existing article DOI: The preprint does not already have an
article_doifield populated
Sample
A random sample of 10,000 preprints was selected from the OSF database using a fixed random seed (8675309) for reproducibility.
DOI Lookup Process
For each preprint DOI, we queried the OpenAlex API to retrieve:
- Location information (where the work appears)
- Open access status
- Version information (preprint, accepted manuscript, published version)
The lookup process involved:
- Extracting preprint DOIs from the OSF database
- Making parallel API requests to OpenAlex (throttled to respect rate limits)
- Saving successful and failed requests to log files
- Processing the JSON responses to extract location data
- Filtering out locations that point back to OSF itself
Results
Sample Size and API Success Rate
The first step in our analysis is understanding how many DOI lookups were successful.
Key Finding: The OpenAlex API successfully returned data for 99.4% of the queried DOIs, indicating excellent data availability.
External Location Funnel Analysis
This analysis tracks how many preprints have external locations (non-OSF) at each stage:
| Stage | Count | % of Previous Stage | % of Total Sample |
|---|---|---|---|
| 1. Successful OpenAlex lookups | 9,939 | NA | 99.4 |
| 2. With at least one external location (non-OSF) | 364 | 3.7 | 3.6 |
| 3. With at least one external open access version | 173 | 47.5 | 1.7 |
Key Findings:
- 3.7% of successfully looked-up preprints have at least one external location (non-OSF)
- Of those with external locations, 47.5% have at least one open access version
- Overall, 1.7% of the original sample has an external open access version
Distribution of External Locations
Let’s examine how many external locations each preprint has:
| Number of External Locations | Number of Preprints | Percentage | Cumulative % |
|---|---|---|---|
| 0 | 9575 | 96.3 | 96.3 |
| 1 | 285 | 2.9 | 99.2 |
| 2 | 58 | 0.6 | 99.8 |
| 3 | 10 | 0.1 | 99.9 |
| 4 | 7 | 0.1 | 100.0 |
| 6 | 1 | 0.0 | 100.0 |
| 7 | 1 | 0.0 | 100.0 |
| 8 | 1 | 0.0 | 100.0 |
| 11 | 1 | 0.0 | 100.0 |
Key Finding: The vast majority of preprints (96.3%) have no external locations, while 3.7% have at least one external location.
Types of External Locations
Let’s examine what types of external locations were found:
| Version Type | Access Status | Count | Percentage |
|---|---|---|---|
| submittedVersion | Not Open Access | 211 | 43.0 |
| submittedVersion | Open Access | 166 | 33.8 |
| NA | Open Access | 61 | 12.4 |
| acceptedVersion | Not Open Access | 18 | 3.7 |
| NA | Not Open Access | 12 | 2.4 |
| acceptedVersion | Open Access | 10 | 2.0 |
| publishedVersion | Open Access | 7 | 1.4 |
| publishedVersion | Not Open Access | 6 | 1.2 |
The Version Type column indicates whether the external location is a preprint, accepted manuscript, or published version, while the Access Status column indicates whether it is open access. These fields are from the OpenAlex API response and their definitions can be found in the OpenAlex documentation.
Key Findings Summary
Based on the analysis of 10,000 OSF preprint DOIs:
High API Success Rate: 99.4% of DOIs were successfully looked up in OpenAlex
Limited External Presence: Only 3.7% of preprints have at least one external location (non-OSF)
Open Access Availability: Among preprints with external locations, 47.5% have at least one open access version available
Overall External OA Rate: Just 1.7% of the total sample has an external open access version
Next Steps and Recommendations
Further Analyses
- Temporal Analysis
- Examine whether the rate of external VOR linkages has changed over time
- Analyze the time lag between preprint posting and external publication
- Use the
createddate field from the preprint metadata
- Domain-Specific Analysis
- Investigate whether certain research fields have higher rates of external VOR
- Consider adding discipline/subject classifications from OpenAlex
- Compare patterns across different OSF preprint servers
- Publisher and Venue Analysis
- Identify which journals/publishers most commonly publish OSF preprints
- Analyze patterns by venue type (journal vs. preprint server)
- Examine the
raw_typefield in the detailed results
- Version Progression Analysis
- Track the progression from preprint to accepted manuscript to published version
- Analyze what percentage reach each stage of the publishing process
- Use the
version,is_accepted, andis_publishedfields
- Failed Lookup Investigation
- Examine the 61 failed lookups to understand why they failed
- Determine if failed lookups represent a biased sample
- Consider alternative methods for these cases
Data Quality Considerations
DOI Coverage: Investigate whether the preprints without external locations truly have no external versions, or if they’re simply not indexed in OpenAlex
Manual Validation: Consider manually reviewing a random sample of cases to validate the automated findings
Missing Data: Explore patterns in the 61 failed lookups to ensure they don’t represent a systematic bias
OSF Filtering: Verify that the filtering logic correctly excludes only OSF locations and doesn’t inadvertently remove legitimate external locations
Potential Research Questions
- What factors predict whether an OSF preprint will have an external VOR?
- Preprint quality indicators (downloads, citations)
- Author characteristics (reputation, institution)
- Field of study
- Preprint server
- How does OSF compare to other preprint servers in terms of VOR linkages?
- Compare OSF rates to arXiv, bioRxiv, medRxiv, etc.
- Analyze whether OSF’s multidisciplinary nature affects VOR rates
- What is the relationship between preprint visibility and eventual publication?
- Do preprints with more downloads/views have higher publication rates?
- Does open access status of the preprint affect subsequent publication?
Implementation Recommendations
Expand Sample Size: Consider running the analysis on the full dataset rather than just 10,000 preprints to get more robust estimates
Longitudinal Tracking: Set up periodic re-runs of this analysis to track how VOR linkages evolve over time
Integration with OSF: Consider ways to surface these external VOR linkages in the OSF interface to help users discover related versions
Technical Notes
The analysis pipeline consists of:
- Data extraction: SQL queries against OSF database tables
- API queries: Parallel requests to OpenAlex API with throttling
- Data processing: JSON parsing and filtering of location data
- Output: Summary CSV files for further analysis
All code is available in openalex.r and can be reproduced using the same random seed (8675309).
Appendix: Data Files
The analysis generates the following output files:
data/doi_lookup_summary.csv: Summary statistics for each preprintdata/doi_lookup_results.csv: Detailed location data for external VORslogs/doi_lookup_status.csv: API request success/failure log
The data files are not included in this report (or the codebase) but can be referenced on Google Drive for further exploration.