View on GitHub

SimpleText@CLEF-2022

SimpleText is a new lab organised as a part of CLEF-2022 conference, initiated by CLEF initiative.

SimpleText@CLEF-2022 Tasks

Home	Call for papers	Important dates	Tasks	Tools
Program	Publications	Organisers	Contact	CLEF-2023

SimpleText Task Guidelines

We invite you to submit both automatic and manual runs! Manual intervention should be reported.

Task 1: What is in (or out)? Select passages to include in a simplified summary, given a query.

The task aims at finding references in computer science that could be inserted as citations in original press articles of general audience for illustration, fact checking or actualization. For each of the selected references, more relevant sentences need to be extracted. These passages can be complex and require further simplification to be carried out in tasks 2 and 3. Task 1 focus on content retrieval.

Corpus: DBLP + abstracts

We use the Citation Network Dataset: DBLP+Citation, ACM Citation network. An ElasticSearch index is provided to participants accessible through a GUI API. A json dump of the index is also available for participants.

Topics: Press articles

Topics are a selection of 40 press article. 20 from a major international newspaper for a general audience and 20 from Tech Xplore, enriched with queries manually extracted from the content of the article. It has been checked that at least 5 relevant abstracts can be found for for each query.

Output format:

Results should be provided in a TREC style tabulated format (with a “.csv” extention). The following columns are required (include these as the first line):

run_id: Run ID starting with teamid, followed by “task1” and run_name
manual: Whether the run is manual {0,1}
topic_id: Topic ID
query_id: Query ID used to retrieve the document (if one of the queries provided for the topic was used; 0 otherwise)
doc_id: ID of the retrieved document (to be extracted from the json output)
passage: Text of the selected passage

For each topic, the maximum number of distinct DBLP references (_id json field) is 100 and the total length of passages should not exceed 1000 tokens.

Output example:

run_id	topic_id	query_id	doc_id	passage
ST1_task1_1	G01	G01.1	1564531496	A CDA is a mobile user device, similar to a Personal Digital Assistant (PDA). It supports the citizen when dealing with public authorities and proves his rights - if desired, even without revealing his identity.
ST1_task1_1	G01	G01.1	3000234933	People are becoming increasingly comfortable using Digital Assistants (DAs) to interact with services or connected objects
ST1_task1_1	G01	G01.2	1448624402	As extensive experimental research has shown individuals suffer from diverse biases in decision-making.

Output format checker

You can use this python3 script to check the output format. The script requires Python 3 and the Pandas library: Download python output checker

Disclaimer: By downloading and using these data, you agree to the terms of use. Any use of the data for any purpose other than academic research, would be in violation of the intended use of these data.

Therefore, by downloading and using these data you give the following assurances with respect to the SimpleText data:

You will not use nor permit others to use the data in the SimpleText datasets in any way except for classes and academic research.
You will not at any time disclose, give, or transmit (in any manner or form or for any purpose) the data (or any portion thereof) to any location or person, including but not limiting to making the data available on the Internet, and copying the data onto any cloud-based storage system.
You will not release nor permit others to release the dataset or any part of it to any person.

In case of violation of the conditions for access to the data for scientific purposes, this access may be withdrawn from the research entity and/or from the researcher. The research entity may also be liable to pay compensation for damages for third parties or asked to take disciplinary action against the offending researcher.

Evaluation

Sentence pooling and automatic metrics will be used to evaluate these results. The relevance of the source document will be evaluated as well as potential unresolved anaphora issues.

Result submission:

Participants should put their run results into the folder Documents created for their user and submit them by email to contact@simpletext-project.com.

The email subject has to be in the format [CLEF TASK 1] TEAM_ID.

Runs should be submitted as a file in a TSV format.

A confirmation email will be sent within 2 days after the submission deadline.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

Liana Ermakova, Eric SanJuan, Jaap Kamps, Stéphane Huet, Irina Ovchinnikova, Diana Nurbakova, 
Sílvia Araújo, Radia Hannachi, Elise Mathurin, and Patrice Bellot. 2022. 
Overview of the CLEF 2022 SimpleText Lab: Automatic Simplification of Scientific Texts. 
In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 13th International 
Conference of the CLEF Association, CLEF 2022, Bologna, Italy, September 5–8, 2022, Proceedings. 
Springer-Verlag, Berlin, Heidelberg, 470–494. https://doi.org/10.1007/978-3-031-13643-6_28

Paper

Dowload .BIB