Document vs PDF metadata in the API

Contents

How “PDF” and “Policy document” objects in the snapshots and APIs are related and what metadata is shared between them

In the Overton web interface every Policy document object is made up of one or more PDF objects.

How Overton parses publication landing pages

Imagine a publication landing page representing a report on a government website.

Even though it’s all related to the one report that landing page may link out to a number of different documents:

An executive summary
The actual report
An appendix containing tables and figures

Upon scraping this website Overton would create a Policy document – the entire report – and then three PDF objects, one for the three different child documents.

Alternatively, one landing page may link out to the “same” document in different languages:

English
French

Here Overton would still create a single Policy document and two PDF objects, one for each language.

Despite their name PDF objects can represent content in other formats too – e.g. Word or HTML.

Metadata shared between vs unique to individual PDFs

Usually the basic metadata for the Policy document – its title, publication date, snippet and so on – comes from the landing page itself. You can learn more about the ways we look for this data in the guidelines for publishers.

In some cases we derive this from the first PDF (for example when there’s no clear publication date or title in the landing page metadata).

Those basic metadata fields include:

title
translated_title
policy_document_url
policy_document_series
authors
snippet
published_on
policy_source information

Each PDF “inherits” the metadata from the parent Policy document.

Each PDF has some unique metadata, and it moves through Overton’s data processing pipeline more is added:

pdf_url
pdf_title (where available)
pdf_thumbnail (where available)
Language
Topics, entities and subject areas
Outgoing references to scholarly or other policy documents
SDG classifications
Lists of people / institution pairs mentioned in the text

In turn, in the app each Policy document is considered to contain the union of all of its child PDF‘s subject areas, clasiffications, outgoing references etc.

Overton’s processing pipeline works on PDFs, not Policy documents, and the snapshots reflect this: each item in the snapshot is a PDF.

However, in the overton.io web application & API we group PDFs by their inherited policy_document_id to make things friendlier for end users, and so each item is a Policy document: if the same query appeared in both the executive summary and the appendix of the example report at the very top of this help page the application & API would return only one result (the common parent Policy document).

Relationship and identifiers

Every Policy document object has at least one PDF child object (if there are no documents linked from the landing page Overton will either not create a Policy document or scrape the HTML of the landing page itself, as appropriate)

Policy documents are identified by their policy_document_id field in the API and snapshot data.

PDFs are identified by their pdf_document_id field.

In general the pdf_document_id contains the parent policy_document_id e.g. the PDFs:

committee_house-2048b061d68144d56dd7bbb006b50f8b-1acf210163a36c9d273b9c703003f8fa
committee_house-2048b061d68144d56dd7bbb006b50f8b-b0f68ff2a09f76d54daf94cce0dff991

are both child PDFs of Policy document:

committee_house-2048b061d68144d56dd7bbb006b50f8b

Updated on February 8, 2024

Was this article helpful?

Yes No

How Overton parses publication landing pages

Metadata shared between vs unique to individual PDFs

Relationship and identifiers

Was this article helpful?

Related Articles