A description of Overton’s bulk data snapshot format
Data snapshots give you access to the Overton database in machine readable form so that you can import it into your own database or BI system.
Note that data snapshots are not publicly available – we have a list of customers who we generate them for. If you’re unsure if you should have access to them or not please contact the account holder at your organization. If your organization doesn’t yet have a license then please do reach out via our subscriptions page.
Depending on your license the snapshot may not contain all of the fields listed below. If you think you’re missing something please contact support.
Each snapshot is a tar.gz file containing a set of JSONL files containing document metadata and a single text file containing the citation graph (this is provided for convenience: the graph can also be rebuild directly from the JSON metadata of each document).
Document metadata
The JSONL files are named overton_docs_xxxxxxxxx.json (where xxxxxxxxx is a 0 padded sequence number starting at 1), are UTF-8 encoded and contain a JSON format record on each line.
Other than the final JSONL file each one should contain 999 records.
Each JSON record contains the metadata for a different policy PDF indexed in Overton.
IMPORTANT: a single policy document may have multiple PDFs associated with it, e.g. an executive summary, an appendix or different language versions. Overton indexes each one separately but aggregates them in the web interface. You can do this too, by grouping on the policy_document_id key.
Metadata schema
Be careful as it’s not guaranteed that every record contains every field, and empty fields may contain null or empty strings / arrays. Fields may appear in any order inside a record.
pdf_document_id
This is the primary key of the record.
policy_document_id
Every PDF belongs to exactly one policy document in Overton. Each policy document has a unique ID shown in this field.
title
snippet
authors
published_on
policy_document_url
These are the title, abstract (where available), publication date and web address of the relevant parent policy document. The publication date uses YYYY-MM-DD format.
translated_title
language
Overton tries to detect languages automatically, but falls back to English. The language codes are in ISO 639-2 format (three letter codes, “eng” is English etc.). Where the language isn’t English we provide a machine translated version of the title in the translated_title field.
policy_document_series
overton_policy_document_series
classifications
topics
entities
Often a policy source will group documents into a series (“Commodity Market Reports” “Working Papers” etc.) – this is stored where available in the policy_document_series field.
Because languages, names and spellings of these series types vary across and even within sources Overton maps some common series types (working papers, blogs, transcripts and clinical guidelines) in a low cardinality overton_policy_document_series field.
Classifications (subject areas), topics and entities are JSON arrays and are described in more detail here.
pdf_url
pdf_title
pdf_thumbnail
This is PDF specific metadata – the URL where Overton found the PDF, its title (if available: usually this field is empty) and a thumbnail image (these can be provided in a separate snapshot file if required: please don’t hotlink thumbnails on overton.io)
policy_source_id
policy_source_title
policy_source_type
policy_source_region
policy_source_country
These fields contain more information about the source of the policy document – a unique ID (policy_source_id), its title, type and the country and region it is from.
policy_document_ids_cited
mentions_people
dois_cited
This is the citation graph.
policy_document_ids_cited is a JSON array containing the set of policy_document_id keys representing policy documented cited by this PDF (note: these are not pdf_document_id keys, they are the keys of the parent policy document)
dois_cited is a JSON array containing the DOIs that are cited by this PDF.
mentions_people is a JSON array containing any academic name mentions that we’ve found in the document. Note that this isn’t the same as entity extraction – you can read more about our name finding process on the relevant help page.
overton_document_url
This is the web address of the policy document on overton.io.
Citation graph
You can build the citation graph directly from the JSONL files but we also include a text file for convenience.
The text file is tab delimited and contains one citing document -> cited document pair per line.
The file has three columns:
Citing document ID
Citation type
Cited document ID
The citation type is either “doc” (for a policy document to policy document citation) or “doi” (for a policy document to scholarly document citation).
When the type is “doi” then the cited document ID is a DOI (note: these are not all Crossref DOIs, many will be DataCite or EU Publications Office minted).
See the data gotcha below about entries in the mapping not appearing in the snapshot files.
Data gotchas
Odd character encoding & the UTF-8 replacement character in titles
Overton makes a best effort guess at policy document titles from the available metadata but sometimes has to fall back on parsing text from the body of a PDF or on OCR.
This works poorly on non-English language documents and where the document has already been OCRed at source by an older system (where the text produced is often enough for searching, but not really suitable for humans).
As a result you may encounter UTF-8 encoding errors in strings.
We currently strip out the UTF-8 replacement character in the data dump to make parsing easier.
Missing document IDs in policy_document_ids_cited
Occasionally the policy_document_ids_cited
field may contain policy document IDs that aren’t in the dump or the Overton web app.
This happens when there’s a solid citation – usually a link – to a document that we have in the database but whose metadata we don’t trust (usually because its title or date fail our data sanity checks, or because we couldn’t fetch it completely from the source website).
These “ghost” documents aren’t included in the database dumps and don’t appear in the web interface. We leave these in as the citation is valid, we just can’t show you the policy document being cited.