A description of how Overton finds scholarly references in full text
Policy documents don’t always – or even often – have a clearly laid out references section or bibliography, and typically don’t stick to a single referencing style like a more academic work would.
This means that Overton has to be flexible in where citations might be found, and how they might be formatted.
We do this by breaking up the full text into chunks, typically paragraphs. Then, for each one:
- We build a set of “features” – characterizations of the text. Does it contain any italics? Does it look like it contains author names? Can we spot any journal names in it, or phrases common in reference strings (et al., op. cit. …)?
- The features are scored individually and then those scores are summed
- If the total score is higher than a specific threshold we try to identify and extract the different parts of the reference string – the source, the title, the year and so on
- We use these to search CrossRef for scholarly works, or the Overton database itself for policy documents. We score the relevant search results from either system by similarity with the original paragraph, and if the similarity score is over a given threshold then we consider it a match
By adjusting the different score thresholds we’re able to control the precision (how accurate matches are, in aggregate) and recall (how often we miss matches) of the system.
There’s a delicate balance between these two characteristics. Typically we can either be extremely accurate but miss more references, or match all references at the expense of getting matches wrong more often. Overton errs on the side of accuracy, so we’re more likely to miss a reference than to get it wrong.
We target a minimum accuracy of >= 98% and minimum recall of >= 80% for scholarly documents across the entire database. In practice these numbers vary based on each source’s citation style and norms and the observed recall is much higher (>= 95%) for most English language policy sources citing journal articles.
In general Overton performs better than systems like Altmetric when references are less formal e.g. references don’t list volume and issue numbers, the title or authors are misspelled, or the citation is part of a sentence (“see Alice Smith in the Journal of X”).
Some types of references pose issues:
- Scholarly papers not indexed by Crossref – these cannot currently be matched by Overton, unless they are indexed separately as policy documents (e.g. because they were authored by a think tank or IGO)
- Scholarly papers in languages other than English – these make up a relatively small minority of scholarly papers indexed by Crossref, which makes similarity scoring much harder for various reasons. To remain accurate we use higher thresholds for non-English reference paragraphs, but this means that we miss more papers.
- Papers belonging to a series – occasionally a series of policy documents will be published with the same title: an example might be “Quarterly Report”. Overton uses authors and publication years in these cases to differentiate possible matches but sometimes citations will accrue to the wrong version of the document.