How “PDF” and “Policy document” objects in the snapshots and APIs are related and what metadata is shared between them
In the Overton web interface every Policy document
object is made up of one or more PDF
objects.
How Overton parses publication landing pages
Imagine a publication landing page representing a report on a government website.
Even though it’s all related to the one report that landing page may link out to a number of different documents:
- An executive summary
- The actual report
- An appendix containing tables and figures
Upon scraping this website Overton would create a Policy document
– the entire report – and then three PDF
objects, one for the three different child documents.
Alternatively, one landing page may link out to the “same” document in different languages:
- English
- French
Here Overton would still create a single Policy document
and two PDF
objects, one for each language.
Despite their name PDF
objects can represent content in other formats too – e.g. Word or HTML.
Metadata shared between vs unique to individual PDFs
Usually the basic metadata for the Policy document
– its title, publication date, snippet and so on – comes from the landing page itself. You can learn more about the ways we look for this data in the guidelines for publishers.
In some cases we derive this from the first PDF
(for example when there’s no clear publication date or title in the landing page metadata).
Those basic metadata fields include:
title
translated_title
policy_document_url
policy_document_series
authors
snippet
published_on
policy_source information
Each PDF
“inherits” the metadata from the parent Policy document.
Each PDF
has some unique metadata, and it moves through Overton’s data processing pipeline more is added:
pdf_url
pdf_title
(where available)pdf_thumbnail
(where available)Language
- Topics, entities and subject areas
- Outgoing references to scholarly or other policy documents
- SDG classifications
- Lists of people / institution pairs mentioned in the text
In turn, in the app each Policy document
is considered to contain the union of all of its child PDF
‘s subject areas, clasiffications, outgoing references etc.
Overton’s processing pipeline works on PDF
s, not Policy documents
, and the snapshots reflect this: each item in the snapshot is a PDF
.
However, in the overton.io web application & API we group PDF
s by their inherited policy_document_id
to make things friendlier for end users, and so each item is a Policy document
: if the same query appeared in both the executive summary and the appendix of the example report at the very top of this help page the application & API would return only one result (the common parent Policy document
).
Relationship and identifiers
Every Policy document
object has at least one PDF
child object (if there are no documents linked from the landing page Overton will either not create a Policy document
or scrape the HTML of the landing page itself, as appropriate)
Policy documents
are identified by their policy_document_id field in the API and snapshot data.
PDFs
are identified by their pdf_document_id field.
In general the pdf_document_id contains the parent policy_document_id e.g. the PDFs:
- committee_house-2048b061d68144d56dd7bbb006b50f8b-1acf210163a36c9d273b9c703003f8fa
- committee_house-2048b061d68144d56dd7bbb006b50f8b-b0f68ff2a09f76d54daf94cce0dff991
are both child PDFs of Policy document:
- committee_house-2048b061d68144d56dd7bbb006b50f8b