Overton is largely language agnostic
Overton is largely language agnostic (with two main caveats - see below) and documents are indexed, analyzed and made available for search regardless of the language or alphabet that they're written in.
To make browsing easier titles are translated into English in the web application and API:
That said there are two main caveats.
Quality of reference extraction when citing local policy sources
Overton breaks up documents into paragraphs and analyzes each in turn, using a heuristic based approach to decide if it contains a valid reference.
There are dozens of different heuristics but some of them rely on matching keywords, identifying dates or otherwise spotting common referencing conventions.
These heuristics work best on Western style references so it's possible that Overton will miss some references in non-Western documents if they are citing local sources (esp. Chinese, Japanese & arabic).
Topics and classification
Overton uses machine learning techniques to identify the key topics in each document and to assign documents broad categories ("Health", "Education", "Crime, Law and Justice" etc.).
The algorithms we use support these languages: