Overton generates descriptions and themes for all documents within our database using a large language model (LLM).
We run these LLMs locally, on servers we own and host. Occasionally we might supplement our local models with cloud services: in these cases we ensure that any text we send to third parties isn’t to be kept or used for the purpose of training new models.
The descriptions are generated through a three-step process:
1. Parsing and cleaning text: The text of the document is parsed and cleaned to remove unnecessary elements such as new lines, extra spaces, and lines containing only numbers or symbols. Documents with fewer than 500 characters (approximately 3-4 paragraphs of text) after cleaning are are skipped, as they don’t contain enough information to be accurately summarized.
2. Getting a document description from the LLM: The cleaned text, capped at 15,000 characters, is then passed to a large language model for processing. A structured prompt guides the model to generate a focused summary in English, excluding non-informative sections like disclaimers, formatting notes, and editorial content. References are omitted, and the model is instructed to focus on the document’s theme and description.
Note that we’re telling the LLM to generate a description of the document and what it’s about, rather than to summarize key points & takeaways.
3. Quality checking: The generated description undergoes an automatic quality review to ensure it meets some minimum standards (is it a clearly readable paragraph? Are there any odd LLM generated artifacts, like repeated phrases etc.?). Summaries that do not pass this quality check are discarded to maintain consistency and clarity in output.