How “PDF” and “Policy document” objects in the snapshots and APIs are related and what metadata is shared between them
In the Overton web interface every Policy document object is made up of one or more PDF objects.
How Overton parses publication landing pages
Imagine a publication landing page representing a report on a government website.
Even though it’s all related to the one report that landing page may link out to a number of different documents:
- An executive summary
- The actual report
- An appendix containing tables and figures
Upon scraping this website Overton would create a Policy document – the entire report – and then three PDF objects, one for the three different child documents.
Alternatively, one landing page may link out to the “same” document in different languages:
- English
- French
Here Overton would still create a single Policy document and two PDF objects, one for each language.
Despite their name PDF objects can represent content in other formats too – e.g. Word or HTML.
Metadata shared between vs unique to individual PDFs
Usually the basic metadata for the Policy document – its title, publication date, snippet and so on – comes from the landing page itself. You can learn more about the ways we look for this data in the guidelines for publishers.
In some cases we derive this from the first PDF (for example when there’s no clear publication date or title in the landing page metadata).
Those basic metadata fields include:
titletranslated_titlepolicy_document_urlpolicy_document_seriesauthorssnippetpublished_onpolicy_source information
Each PDF “inherits” the metadata from the parent Policy document.
Each PDF has some unique metadata, and it moves through Overton’s data processing pipeline more is added:
pdf_urlpdf_title(where available)pdf_thumbnail(where available)Language- Topics, entities and subject areas
- Outgoing references to scholarly or other policy documents
- SDG classifications
- Lists of people / institution pairs mentioned in the text
In turn, in the app each Policy document is considered to contain the union of all of its child PDF‘s subject areas, clasiffications, outgoing references etc.
Overton’s processing pipeline works on PDFs, not Policy documents, and the snapshots reflect this: each item in the snapshot is a PDF.
However, in the overton.io web application & API we group PDFs by their inherited policy_document_id to make things friendlier for end users, and so each item is a Policy document: if the same query appeared in both the executive summary and the appendix of the example report at the very top of this help page the application & API would return only one result (the common parent Policy document).
Relationship and identifiers
Every Policy document object has at least one PDF child object (if there are no documents linked from the landing page Overton will either not create a Policy document or scrape the HTML of the landing page itself, as appropriate)
Policy documents are identified by their policy_document_id field in the API and snapshot data.
PDFs are identified by their pdf_document_id field.
In general the pdf_document_id contains the parent policy_document_id e.g. the PDFs:
- committee_house-2048b061d68144d56dd7bbb006b50f8b-1acf210163a36c9d273b9c703003f8fa
- committee_house-2048b061d68144d56dd7bbb006b50f8b-b0f68ff2a09f76d54daf94cce0dff991
are both child PDFs of Policy document:
- committee_house-2048b061d68144d56dd7bbb006b50f8b