About topics, entities, subject areas and COFOG

Information on the taxonomies that Overton uses to map topics, entities and subject areas to policy documents

Overton uses machine learning techniques to extract topics and entities from the full text of each policy document we index, and then tries to map them to a taxonomy to make browsing and analyzing them easier.

While Overton is generally language agnostic, these techniques are not, so only documents in a subset of languages (including English, French, Spanish, Chinese, Japanese and Russian) will have topics, entities and subject areas extracted.

We use the English names for topics, subject areas and entities regardless of the source document language.

Topics

Topics are the main themes of a document. We analyze the phrases and entities used in the document and then compare them to data derived from pages on Wikipedia to find which ones have the most in common. 

The titles of Wikipedia pages that have a lot of overlap in language with the policy document are chosen by Overton as topics, so the set of possible topics is very broad. In total, Overton currently has over 650,000 topics connected to policy documents.

Topics intend to capture all possibilities/variants within a set of results (which can be very relevant for some types of analyses like topic clustering).

Browse Topics

Browse Topics by using the “Explore the Data” menu at the top of any Overton page and selecting “Browse all topics.”

You can browse by ‘Subject Area’ or only view ‘Overrepresented topics.’

The page also provides a Topic Map. This is a map of the topics that Overton thinks are most often associated with the documents in your results set. Topics with darker backgrounds appear more frequently than we’d expect.

Searching with Topics

You might use the Topics filter as a first step approach when running queries to discover a large set of policy documents on a given subject, and then apply additional filters if this is required for the type of analysis that needs to be run.

Due to the exhaustive and fluid nature of topics, we developed new filters that rely on defined taxonomies and allow us to filters assign policy documents to multiple categories. Users can also view all filter options in the web interface and assign multiple options as needed. These filters are:

  • the COFOG framework filter
  • the SDG filter (which offers SDG target data on policy document records which can then be exported in csv or excel)

For more tips, visit our Search using Topics page.

Subject areas

We look for subject areas the same way we do topics, but documents are matched against examples from each category in the IPTC’s MediaTopics controlled vocabulary instead of against Wikipedia pages.

MediaTopics are the categories used by many newspapers and magazines to organize their articles. You can see the whole taxonomy in clickable tree format here.

Entities

Entities include the people, organisations, countries, and other proper nouns mentioned in a policy document.

Overton performs entity extraction on the full text of policy documents. Our system uses a mix of speech tagging—which identifies which words are verbs, nouns, and so on—with pattern matching using a large dictionary of entities sourced from Wikipedia.

Classification of the Functions of Government (COFOG) topic

COFOG is a classification system to describe the broad objectives of government. We classify the majority of our policy documents to fall into at least one of the COFOG divisions. 

Our COFOG classifier uses an advanced multi-label approach that enables a single model to predict multiple categories simultaneously. It takes our AI-generated document descriptions as input, then applies ModernBERT—a powerful language model known for its contextual understanding of text—to organise the classification process hierarchically.

In order to ensure the accuracy of our classifier we are unable to apply the following 4 subcategories to any policy documents at this time: 

  • Basic research
  • Transfers of a general character between different levels of government
  • Economic affairs n.e.c.
  • R&D Housing and community amenities

Was this article helpful?

Related Articles

Leave a Comment