How “PDF” and “Policy document” objects in the snapshots and APIs are related and what metadata is shared between them
In the Overton web interface every
Policy document object is made up of one or more
How Overton parses publication landing pages
Imagine a publication landing page representing a report on a government website.
Even though it’s all related to the one report that landing page may link out to a number of different documents:
- An executive summary
- The actual report
- An appendix containing tables and figures
Upon scraping this website Overton would create a
Policy document – the entire report – and then three
PDF objects, one for the three different child documents.
Alternatively, one landing page may link out to the “same” document in different languages:
Here Overton would still create a single
Policy document and two
PDF objects, one for each language.
Despite their name
Metadata shared between vs unique to individual PDFs
Usually the basic metadata for the
Policy document – its title, publication date, snippet and so on – comes from the landing page itself. You can learn more about the ways we look for this data in the guidelines for publishers.
In some cases we derive this from the first
PDF (for example when there’s no clear publication date or title in the landing page metadata).
Those basic metadata fields include:
PDF “inherits” the metadata from the parent Policy document.
- Topics, entities and subject areas
- Outgoing references to scholarly or other policy documents
- SDG classifications
- Lists of people / institution pairs mentioned in the text
In turn, in the app each
Policy document is considered to contain the union of all of its child
Overton’s processing pipeline works on
Policy documents, and the snapshots reflect this: each item in the snapshot is a
However, in the overton.io web application & API we group
policy_document_id to make things friendlier for end users, and so each item is a
Policy document: if the same query appeared in both the executive summary and the appendix of the example report at the very top of this help page the application & API would return only one result (the common parent
Relationship and identifiers
Policy document object has at least one
PDF child object (if there are no documents linked from the landing page Overton will either not create a
Policy document or scrape the HTML of the landing page itself, as appropriate)
Policy documents are identified by their policy_document_id field in the API and snapshot data.
PDFs are identified by their pdf_document_id field.
In general the pdf_document_id contains the parent policy_document_id e.g. the PDFs:
are both child PDFs of Policy document: