Overton uses machine learning techniques to extract topics and entities from the full text of each policy document we index, and then tries to map them to a taxonomy to make browsing and analyzing them easier.
While Overton is generally language agnostic these techniques are not, so only documents in a subset of languages (including English, French, Spanish, Chinese, Japanese and Russian) will have topics, entities and subject areas extracted.
We use the English names for topics, subject areas and entities regardless of the source document language.
Topics are the main themes of a document. We analyze the phrases and entities used in the document and then compare them to data derived from pages on Wikipedia to find which ones have the most in common.
The titles of Wikipedia pages that have a lot of overlap in language with the policy document are chosen by Overton as topics, so the set of possible topics is very broad.
We look for subject areas the same way we do topics, but documents are matched against examples from each category in the IPTC's MediaTopics controlled vocabulary instead of against Wikipedia pages.
MediaTopics are the categories used by many newspapers and magazines to organize their articles. You can see the whole taxonomy in clickable tree format here.
Entities are the people, companies, countries and other proper nouns mentioned in a policy document
We perform entity extraction on the full text of policy documents. The system Overton uses relies on a mix of part of speech tagging (identifying with parts of a sentence are verbs, which are nouns and so on) and pattern matching with a large dictionary of entities pulled from Wikipedia.