An important caveat – we’re still cleaning up the data
While the ARI database is live we’re still actively working on / curating the metadata for some questions. Wrangling data previously kept in Word files and PDFs by lots of different stakeholders is often messy; the ARI data is no exception!
- The topics and fields of research for each question were assigned using an automated classifier and we are still reviewing each one manually with individual departments. We also haven’t yet curated any of the metadata for questions marked as being “archived” (e.g. older ARIs that have been superseded by newer questions). It has been difficult to establish a workflow that helps users assign the correct set of topics to questions in a consistent way…. but we’re getting there.
- Adapting the questions to fit a consistent data model has sometimes been difficult – the phrasing is often important, so we cannot unilaterally edit any of the text. Some departments have reviewed their ARIs and tweaked text to be more informative or to read better, but many haven’t, so some questions are longer than others, some background information is more specific to the question than others and so on.
Accessing the ari.org.uk data
One of the things we’re responsible for is API and bulk data access. We’re happy to chat or to try and help out with any technical issues you might have with the dataset, please feel to reach out to us at email@example.com
You can always browse the data at the ari.org.uk website, but it is also available for download or to access through a very simple API. All of the data is free and made available under the Open Government License.
Please do let us know if you find the database useful – it helps build the case for keeping the ARIs current within the UK government departments and agencies.
The data model
The API and download share the same data model.
Departments list a set of research priorities each year – each research priority is called a “question” in our system. Questions are typically grouped by theme or topic, and often each question group will have a paragraph or two of extra background information.
Each question in the ari.org.uk database looks a bit like this:
Here’s a breakdown of each field:
|A stable identifer for the question.
|The URL for this question’s page on ari.org.uk
|The text of the question itself. Note that it won’t always make sense without additional context from questionGroup and backgroundInformation.
|A boolean – if true then this question has been superseded by a more recent question and the department is no longer actively soliciting responses to it
|The name of the department or agency asking the question
|Departments typically group their questions by topic, and this is the name they’ve given to the group this question is in
|Question groups often have an associated paragraph or two of background information. Note that this can sometimes be very broad.
|The data of publication for this question, in YYYY-MM-DD format.
|Departments may sometimes specify an expiry date for a question, after which point it because archived automatically (see isArchived above)
|A free text field containing details of who to contact as a next step if you are interested in contributing answers or data to the question.
|An array of relevant topics from the IPTC’s MediaTopics taxonomy – see iptc.org for more detail.
This taxonomy is primarily used by newspapers and magazines; it covers lots of different areas but not in an in-depth way.
|An array of relevant academic subject areas from the Fields of Research taxonomy – see abs.gov.au for more detail.
An academic subject area is “relevant” if a researcher with that background would be well suited to answering the question.
|An array of tags – freeform strings that either highlight specific keywords from the question or add extra keywords to make them more searchable.
|An array of question IDs – these are questions from other departments that the system thinks are semantically similar to this one.
|An array of projects taken from the UKRI’s Gateway to Research database that the system thinks are relevant to this question.
To be relevant the project description has to be semantically similar to the question, and/or the description suggests that the project lead will have expertise relevant to the question and may be a good candidate to contribute to it.
Download the dataset
You can download the current snapshot directly from us here: https://researchersnapshots.eu-central-1.linodeobjects.com/ari_questions_14092023.json
The dataset is one large JSON file. It’s an array, where each key is a question ID and the value is the metadata for that question.
It is licensed under the Open Government License. Where relevant please attribute it to:
The ARI Database (https://ari.org.uk) – retrieved Sep 14th, 2023
(just change the date to the date you downloaded the data on)
We’ll be copying the dataset to public repositories as soon as we can. In the meantime please don’t re-upload the dataset to figshare or Zenodo yourself! We want the data there too and we’re in the process of getting guidance about how to do this while keeping the government data license intact.
Access via the API
The API simply returns a paginated set of ARIs (“questions”) in JSON format. You can access it here:
Move to the next page using the
The JSON result you’ll get has two sections, “data” and “meta”
The meta section
This returns the total number of records in the dataset (this was e.g. 1,863 on 14th September, 2023), the page number you’re currently viewing, the total number of pages for your request and the URL to use to get to the next page (in meta -> pagination -> links -> next).
If the next page link field isn’t there then you have reached the end of the results set.
The data section
This returns a set of up to 250 questions – the format is detailed above in the “data model” section.