How we disambiguate policy documents

How Overton tries to avoid collecting the same document multiple times

Policy documents usually lack identifiers like ISBNs or DOIs that can be used to uniquely identify them, no matter where they are hosted.

This can pose a problem when government websites change and documents are moved to different web addresses, or when the same document appears in multiple places on one site.

To avoid collecting the same document multiple times we regularly run a diambiguation process, where for each individual policy source we go through each document to check:

  • Their title
  • Their URL, and variations on that URL (https instead of http, without a “www.” at the front, with or without a backslash character at the end etc.) for both the policy document landing page and any PDFs associated with it

… and compare them to all the other documents from the same source. If we find matches to one or both fields then these are potentially duplicates, so we look deeper at:

  • Their publication date
  • Their “content hash” of the PDF files associated with the match
  • Their “content hash” of the front cover (the first page of the first PDF associated with the match)

… to decide if it’s a real duplicate – in which case it is removed – or not.

Real examples of matches found that are not real duplications include sources having multiple documents called “Annual Report” but with different publication years, or a set of documents all simply called “Memorandum”, some published on the same day, but that have different contents.

Same document, different language

Some policy sources publish in multiple languages – for example, UN agencies may produce a report in English, French, Spanish and Arabic.

Overton doesn’t automatically detect that two documents are the same, just translated. Instead it relies on cues from the original policy source.

If documents in different languages have the same landing page (all of the different language options are listed on a web page representing that document) then Overton will merge them and treat them as a single document.

If the different language versions have separate landing pages / are different entries in the source’s publication catalog then Overton will also treat them as separate publications.

This is both helpful and unhelpful depending on your use case! Some users are keen to see which language versions are collecting citations, while others would prefer to merge citations into one object.

We’re actively working on possible solutions to this issue.

Duplicates across policy sources

We don’t currently disambiguate across policy sources: the same document can appear twice as long as two different organizations host it in different places.

This is by design as we often see, for example, think tanks commissioned to write reports for government departments (both groups then host a copy of the report), or documents authored by IGOs on government websites in developing countries. Leaving them in place makes browsing and reporting on a source by source basis easier.

By default any citations we see for duplicate documents are associated with the version hosted by whichever organization is mentioned or linked to in the citing reference.

This does raise issues with some sources like the European Union Publications Office, PubMed Central and APO, which are primarily aggregators of content and rarely appear in the reference text: the citation numbers for documents on these sources are lower than they should be. We’re aware of this issue and plan to address it in a future update.

Updated on February 7, 2024

Was this article helpful?

Related Articles