How Overton tries to avoid collecting the same document multiple times
Policy documents usually lack identifiers like ISBNs or DOIs that can be used to uniquely identify them, no matter where they are hosted.
This can pose a problem when government websites change and documents are moved to different web addresses, or when the same document appears in multiple places on one site.
To avoid collecting the same document multiple times we regularly run a diambiguation process, where for each individual policy source we go through each document to check:
- Their title
- Their URL, and variations on that URL (https instead of http, without a “www.” at the front, with or without a backslash character at the end etc.) for both the policy document landing page and any PDFs associated with it
… and compare them to all the other documents from the same source. If we find matches to one or both fields then these are potentially duplicates, so we look deeper at:
- Their publication date
- Their “content hash” of the PDF files associated with the match
- Their “content hash” of the front cover (the first page of the first PDF associated with the match)
… to decide if it’s a real duplicate – in which case it is removed – or not.
Real examples of matches found that are not real duplications include sources having multiple documents called “Annual Report” but with different publication years, or a set of documents all simply called “Memorandum”, some published on the same day, but that have different contents.
Duplicates across policy sources
We don’t currently disambiguate across policy sources: the same document can appear twice as long as two different organizations host it in different places.
This is by design as we often see, for example, think tanks commissioned to write reports for government departments (both groups then host a copy of the report), or documents authored by IGOs on government websites in developing countries. Leaving them in place makes browsing and reporting on a source by source basis easier.
By default any citations we see for duplicate documents are associated with the version hosted by whichever organization is mentioned or linked to in the citing reference.
This does raise issues with some sources like the European Union Publications Office, PubMed Central and APO, which are primarily aggregators of content and rarely appear in the reference text: the citation numbers for documents on these sources are lower than they should be. We’re aware of this issue and plan to address it in a future update.