About topics, entities, subject areas and COFOG

Information on the taxonomies that Overton uses to map topics, entities and subject areas to policy documents

Overton uses machine learning techniques to extract topics and entities from the full text of each policy document we index, and then tries to map them to a taxonomy to make browsing and analyzing them easier.

While Overton is generally language agnostic these techniques are not, so only documents in a subset of languages (including English, French, Spanish, Chinese, Japanese and Russian) will have topics, entities and subject areas extracted.

We use the English names for topics, subject areas and entities regardless of the source document language.

Topics

Topics are the main themes of a document. We analyze the phrases and entities used in the document and then compare them to data derived from pages on Wikipedia to find which ones have the most in common. 

The titles of Wikipedia pages that have a lot of overlap in language with the policy document are chosen by Overton as topics, so the set of possible topics is very broad.

Subject areas

We look for subject areas the same way we do topics, but documents are matched against examples from each category in the IPTC’s MediaTopics controlled vocabulary instead of against Wikipedia pages.

MediaTopics are the categories used by many newspapers and magazines to organize their articles. You can see the whole taxonomy in clickable tree format here.

Entities

Entities include the people, companies, countries, and other proper nouns mentioned in a policy document.

Overton performs entity extraction on the full text of policy documents. Our system uses a mix of speech tagging—which identifies which words are verbs, nouns, and so on—with pattern matching using a large dictionary of entities sourced from Wikipedia.

Classification of the Functions of Government (COFOG) topic

COFOG is a classification system to describe the broad objectives of government. We classify all policy documents to fall into at least one of the COFOG divisions. 

Our COFOG classifier uses an advanced multi-label approach that enables a single model to predict multiple categories simultaneously. It takes our AI-generated document descriptions as input, then applies ModernBERT—a powerful language model known for its contextual understanding of text—to organise the classification process hierarchically.

In order to ensure the accuracy of our classifier we are unable to apply the following 4 subcategories to any policy documents at this time: 

  • Basic research
  • Transfers of a general character between different levels of government
  • Economic affairs n.e.c.
  • R&D Housing and community amenities
Updated on July 30, 2025

Was this article helpful?

Related Articles