Tasks

Home	Call for papers	Important dates	Tasks	Tools
Program	Publications	Organizers	Contact	CLEF 2026
SimpleText 2026	SimpleText 2025	SimpleText 2024	SimpleText 2023	SimpleText 2022

How to participate

In order to participate, you should sign up at the CLEF website: https://clef-labs-registration.dipintra.it/.

All team members should join the SimpleText mailing list: https://groups.google.com/g/simpletext.

The data will be made available to all registered participants.

Task 1: Text Simplification: Simplify scientific text

The CLEF SimpleText track introduced the Cochrane-auto corpus, derived from biomedical literature abstracts and lay summaries from Cochrane systematic reviews. This corpus represents a significant expansion into the biomedical domain, building on methodologies used in datasets such as Wiki-auto and Newsela-auto.

Cochrane-auto provides authentic parallel data produced by the same authors, enabling true document-level simplification. It incorporates advanced simplification techniques such as sentence merging, reordering, and alignment with discourse structure. This approach contrasts with more standard simplification corpora in that it realigns data at the paragraph, sentence, and document levels.

We have crawled newly published Cochrane systematic reviews over the last year, and this new data will be the main test data for evaluation in 2026, focusing on two subtasks:

Task 1.1 - Sentence-level Scientific Text Simplification

The goal of this task is to simplify whole sentences extracted from the Cochrane-auto dataset

Task 1.2 - Document-level Scientific Text Simplification

The goal of this task is to simplify whole documents extracted from the Cochrane-auto dataset

Evaluation

To evaluate results, we will use standard automatic evaluation measures (SARI, BLEU, LENS, BERTscore, etc.) in combination with human assessment of samples of the submissions by translation students and professionals.

Task 2: Controlled Creativity: Identify and Avoid Hallucination

Task 2 focuses on identifying and evaluating creative generation and information distortion in text simplification. We have annotated CLEF 2025 submissions to the track for potential overgeneration and other generated content that we cannot attribute to tokens in the source data. Perhaps surprisingly, overgeneration is not uncommon and ranges from a text completion model that continues after correctly simplifying a source sentence to LLM comments or notes on how the text is simplified, and from obvious extraction noise, such as left-in prompts, to conversational comments from the model used.

Task 2.1 - Identify Creative Generation at Document Level

This task aims to detect creative generation at the abstract or document level. The task is to detect what sentences are fully grounded on source input (a) without and (b) with access to the source sentences, and thereby also label those introducing significant new content. Task 2.1 is a post-hoc identification or explanation task.

For 2026, we focus on significant new content (typically an entire sentence) introduced in the prediction.

Task 2.2 - Detect and Classify Information Distortion Errors in Simplified Sentences

This task tries to “mimic” traditional manual human evaluation by LLMs. focuses on detecting information distortion in simplified sentences and classifying the types of errors. As text-generation-based automatic evaluation measures (such as BLEU and SARI), which rely on text overlap, are notoriously imprecise, we need to conduct human evaluations of fluency, simplicity, and factuality. Usually, these annotations are not reusable. However, we use them as the training and test data for systems to automatically detect and annotate such information distortion in real CLEF submissions. The manual annotations of CLEF 2025 Task 1 submissions can be reused as ground truth data for information distortion classification in CLEF 2026 Task 2.

For 2026, we focus on identifying the type of overgeneration.

Task 2.3 - Avoid Creative Generation and Perform Grounded Generation by Design

This task introduces a text alignment challenge, emphasizing grounded generation over creative generation. This task mirrors Task 1 on text simplification and requires submissions in paired runs, both with and without explicit source attribution.

Evaluation

Task 2.1 is essentially a sentence label task, evaluated in the standard way (Precision, Recall, F1). For token-level evaluation, we use the standard Jaccard.
Task 2.2 is evaluated using standard automatic classification measures.
Task 2.3 will be evaluated by both standard automatic measures and human evaluation, similar to Task 1 on Text Simplification above. The paired runs enable us to sample differences at the sentence and phrase levels and evaluate them efficiently, using tools like MT Unbabel.

Task 3: Research Area Classification

Task 3 is a new task that asks for the classification of scientific articles by research area.

The task promotes the development of automatic methods for classifying scientific articles using a given taxonomy. This greatly extends the scope of traditional subject classification and can accelerate the adoption of consistent metadata across collections (arXiv, Openaire, MADOC), as well as facilitate consistent reporting and monitoring of research output and impact. In 2026, we focus on the DFG subject classification as the target classification.

Data

For datasets, we will use scientific publications crawled from arXiv. arXiv has provided open access to scholarly articles across disciplines such as physics, computer science, mathematics, and economics. The collected dataset is further annotated by DFG subject classification.

Evaluation

LLMs might produce some “near misses” with their results when, for instance, a research area like “Computer Science & Engineering” might be predicted as just “Computer Science”. To overcome these challenges, we used several approaches to determine the actual matches:

Exact Match (EM): a binary measure comparing the output of the model directly with the ground truth.
String Distance (SD): a normalized Levenshtein distance between the prediction and reference.
Embedding Distance (ED): a semantic similarity measure based on the BERT-embeddings of the prediction and reference.

The evaluation will focus on EM-accuracy as the main overall performance measure.

Task 4: SimpleText Revisited

We encourage revisiting tasks from earlier years of the CLEF SimpleText Track. The Codabench of CLEF 2025 SimpleText Task 1 and the Codabench of CLEF 2025 SimpleText Task 2 are still operational, and participants are invited to report on experiments on all tasks from earlier years in their CLEF 2026 paper submissions.

CLEF 2026 SimpleText Track

SimpleText is a track organized as a part of CLEF 2026, initiated by CLEF initiative.

Tasks

How to participate

Task 1: Text Simplification: Simplify scientific text

Task 1.1 - Sentence-level Scientific Text Simplification

Task 1.2 - Document-level Scientific Text Simplification

Evaluation

Task 2: Controlled Creativity: Identify and Avoid Hallucination

Task 2.1 - Identify Creative Generation at Document Level

Task 2.2 - Detect and Classify Information Distortion Errors in Simplified Sentences

Task 2.3 - Avoid Creative Generation and Perform Grounded Generation by Design

Evaluation

Task 3: Research Area Classification

Data

Evaluation

Task 4: SimpleText Revisited