Overton data snapshots

Contents

A description of Overton’s bulk data snapshot format

Data snapshots provide access to the Overton database in a machine-readable format, allowing you to import it into your own database or BI system.

Data snapshots are not publicly available. We generate them for a defined list of customers. If you are unsure whether you should have access, please contact your organization’s account holder. If your organization does not yet have a subscription to Overton Index, please reach out via our subscriptions page.

Each snapshot is a tar.gz file that contains a set of JSONL files with document metadata and a single text file with the citation graph. We provide the text file for convenience, but you can also rebuild the graph directly from the JSON metadata of each document.

Document metadata

The JSONL files are named overton_docs_xxxxxxxxx.json (where xxxxxxxxx is a 0 padded sequence number starting at 1), are UTF-8 encoded and contain a JSON format record on each line.

Other than the final JSONL file each one should contain 999 records.

Each JSON record contains the metadata for a different policy PDF indexed in Overton.

IMPORTANT: a single policy document may have multiple PDFs associated with it, e.g. an executive summary, an appendix or different language versions. Overton indexes each one separately but aggregates them in the web interface. You can do this too, by grouping on the policy_document_id key.

Metadata schema

Be careful as it’s not guaranteed that every record contains every field, and empty fields may contain null or empty strings / arrays. Fields may appear in any order inside a record.

pdf_document_id

This is the primary key of the record.

policy_document_id

Every PDF belongs to exactly one policy document in Overton. Each policy document has a unique ID shown in this field.

title

snippet

authors

published_on

policy_document_url

These are the title, abstract (where available), publication date and web address of the relevant parent policy document. The publication date uses YYYY-MM-DD format.

translated_title

language

Overton tries to detect languages automatically, but falls back to English. The language codes are in ISO 639-2 format (three letter codes, “eng” is English etc.). Where the language isn’t English we provide a machine translated version of the title in the translated_title field.

policy_document_series

overton_policy_document_series

classifications

topics

entities

Often a policy source will group documents into a series (“Commodity Market Reports” “Working Papers” etc.) – this is stored where available in the policy_document_series field.

Because languages, names, and spellings of these series types vary across and even within sources, Overton maps common series types (working papers, blogs, transcripts, and clinical guidelines) to a low-cardinality overton_policy_document_series field.

Classifications (subject areas), topics and entities are JSON arrays and are described in more detail here.

pdf_url

pdf_title

pdf_thumbnail

This metadata is specific to PDFs. It includes the URL where Overton found the PDF, its title (if available—this field is usually empty), and a thumbnail image. We can provide thumbnails in a separate snapshot file if required, but please do not hotlink them from overton.io.

policy_source_id

policy_source_title

policy_source_type

policy_source_region

policy_source_country

These fields contain more information about the source of the policy document – a unique ID (policy_source_id), its title, type and the country and region it is from.

policy_document_ids_cited

mentions_people

dois_cited

This is the citation graph.

policy_document_ids_cited is a JSON array containing the set of policy_document_id keys representing policy documented cited by this PDF (note: these are not pdf_document_id keys, they are the keys of the parent policy document)

dois_cited is a JSON array containing the DOIs that are cited by this PDF.

mentions_people is a JSON array containing any academic name mentions that we’ve found in the document. Note that this isn’t the same as entity extraction – you can read more about our name finding process on the relevant help page.

overton_document_url

This is the web address of the policy document on overton.io.

Citation graph

You can build the citation graph directly from the JSONL files but we also include a text file for convenience.

The text file is tab delimited and contains one citing document -> cited document pair per line.

The file has three columns:

Citing document ID

Citation type

Cited document ID

The citation type is either ‘doc’ (a policy document citing another policy document) or ‘doi’ (a policy document citing a scholarly document).

When the type is ‘doi,’ the cited document ID is a DOI. Note that these are not all Crossref DOIs—many come from DataCite or the EU Publications Office.

See the data gotcha below for details on entries in the mapping that do not appear in the snapshot files.

Data gotchas

Odd character encoding & the UTF-8 replacement character in titles

Overton makes a best-effort guess at policy document titles from the available metadata, but sometimes falls back on parsing text from the body of a PDF or using an OCR.

This method works poorly on non-English documents and on files already OCRed at the source by an older system, where the text is often sufficient for searching but not suitable for humans.

As a result you may encounter UTF-8 encoding errors in strings.

We currently strip out the UTF-8 replacement character in the data dump to make parsing easier.

Missing document IDs in policy_document_ids_cited

Occasionally the policy_document_ids_cited field may contain policy document IDs that aren’t in the dump or the Overton web app.

This occurs when a solid citation—usually a link—points to a document in our database whose metadata we do not trust. This often happens because its title or date fails our data sanity checks, or because we could not fully fetch it from the source website.

We do not include these ‘ghost’ documents in database dumps, and they do not appear in the web interface. We keep them because the citation is valid, but we cannot display the policy document being cited.

Tagged: data snapshots JSONL

Was this article helpful?

Yes No

Document metadata

Metadata schema

Citation graph

Data gotchas

Was this article helpful?

Related Articles

Leave a Comment Cancel