What are overrepresented topics and how are they found?

Overton automatically extracts topics from policy documents, and you can see these on individual document pages and in the API or Excel output.

But you might notice that when we list topics in the filter boxes on the left hand side they aren’t in order: some topics appear lower down the listing even though they are associated with more documents.

Similarly in the topics view you may have noticed we refer to “unusually frequent” topics.

Both the filter box and the unusually frequent topics list are actually what we call “overrepresented topics”. These aren’t the same as the most popular topics in your result set: rather it’s the set of topics that appear more often in your set that we might expect compared to some other random sample of documents.

We find these by comparing all of the topics in a given set of results (the “foreground” set) with the topics found across the database as a whole (the “background” set).

Specifically we use a scoring algorithm called JLH. In essence JLH gives each topic a score based on this equation:

JLH score = 
(foregroundPercent - backgroundPercent) ) * (foregroundPercent/backgroundPercent)

Where foregroundPercent is the percentage of documents in the foreground about the given topic, and backgroundPercent is the same but in the background set.

Updated on November 23, 2021

Was this article helpful?

Related Articles