Dataset Curation

A detailed description of the dataset curation process

By CRISP Research

The corpus comprises questions and tasks from real-world exams, professional assessments, and domain-specific challenges. Given that the data originates from institutional sources, it is expected to maintain a high standard of quality and accuracy, as domain experts crafted it for public evaluations.

Source Data

Data Collection and Processing

The initial data was sourced from various files in PDF, HTML, DOC, and other formats published by official bodies that announce individual competitive public examinations.

Please consult the full paper for a detailed description of our curation process.

Who are the source data producers?

The dataset includes tests for admission to the Carabinieri, Penitentiary Police, Italian Army, State Police, Forestry Corps, Firefighters, Air Force, Navy, Guardia di Finanza, Italian ministries, teachers of the Italian school system of all levels, principals of the Italian school system of all levels, nurses of the national health system, and managers of the public administration from 2008 to 2024 available freely on the website of each institutional body.

Personal and Sensitive Information

The dataset does not contain confidential information. It is also free from content that could be considered offensive, insulting, threatening, or distressing. Since it solely comprises data from standardised tests and does not involve human subjects or personal data, an ethical review process was not required.

Bias, Risks, and Limitations

Potential risks of misuse include using the benchmark results to justify or argue against the need to develop native LLMs specifically tailored for the Italian language. This possibility should be considered to avoid misinterpretations or unintended consequences when leveraging the evaluation outcomes.

Maintenance

ITALIC is designed to be robust and fully operational upon release, with no need for routine maintenance. However, as language and cultural norms evolve, periodic updates will be required to ensure the benchmark remains relevant. A new dataset version will be created and made available in such cases.

Share: X (Twitter) Facebook LinkedIn