Labs


CLEF promotes the systematic evaluation of information access systems, primarily through experimentation on shared tasks.

CLEF 2023 consists of a set of 13 Labs designed to test different aspects of multilingual and multimedia IR systems:

  1. BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering
  2. CheckThat!: Check-Worthiness, Subjectivity, Political Bias, Factuality, and Authority of News Articles and their Sources
  3. DocILE: Document Information Localization and Extraction
  4. eRisk: Early Risk Prediction on the Internet
  5. EXIST: sEXism Identification in Social neTworks
  6. iDPP: Intelligent Disease Progression Prediction
  7. ImageCLEF: Multimedia Retrieval Challenge
  8. JOKER: Automatic Wordplay Analysis
  9. LifeCLEF: Multimedia Retrieval in Nature
  10. LongEval: Longitudinal Evaluation of Model Performance
  11. PAN: Digital Text Forensics and Stylometry
  12. SimpleText: Automatic Simplification of Scientific Texts
  13. Touché: Argument and Causal Retrieval

Labs Publications:

  • Lab Overviews published in LNCS Proceedings
  • Labs Working Notes published in CEUR-WS Proceedings
  • Best of Lab Papers will be nominated for CLEF 2023 submission to LNCS proceedings

Labs Participation:


Important Dates:

  • Labs registration opens: 14 November 2022

BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering

The aim of the BioASQ Lab is to push the research frontier towards systems that use the diverse and voluminous information available online to respond directly to the information needs of biomedical scientists.
  • Task 1 - b: Biomedical Semantic Question Answering
    Benchmark datasets of biomedical questions, in English, along with gold standard (reference) answers constructed by a team of biomedical experts. The participants have to respond with relevant articles, and snippets from designated resources, as well as exact and "ideal" answers.
  • Task 2 - Synergy: Question Answering for developing problems
    Biomedical experts pose unanswered questions for developing problems, such as COVID-19, receive the responses provided by the participating systems, and provide feedback, together with updated questions in an iterative procedure that aims to facilitate the incremental understanding of developing problems in biomedicine and public health.
  • Task 3 - MedProcNER: Medical Procedure Text Mining and Indexing Shared Task
    Focuses on the recognition and indexing of medical procedures in clinical documents in Spanish posing subtasks on (1) indexing medical documents with controlled terminologies, (2) automatic detection indexing textual evidence, i.e. medical procedure entity mentions in text, and (3) normalization of these medical procedure mentions to terminologies.
  • http://www.bioasq.org/workshop2023
  • @BioASQ

CheckThat!: Check-Worthiness, Subjectivity, Political Bias, Factuality, and Authority of News Articles and their Sources

The CheckThat! Lab aims at producing technology to support the fight against misinformation and disinformation in social media, in political debates and in the news with a focus on check-worthiness, subjectivity, bias, factuality, and authority of the claim.
  • Task 1 - Check-worthiness in textual and multimodal content
    Determine whether an item, be it a text alone or a text plus an image deserves the attention of a journalist to be fact-checked.
  • Task 2 - Subjectivity in News Articles
    Assess whether a text snippet within a news article is subjective or objective.
  • Task 3 - Political Bias of News Articles and News Media
    Identify the political leaning of an article or media source: left, centre or right.
  • Task 4 - Factuality of Reporting of News Media
    Determine the level of factuality of both a document and a medium.
  • Task 5 - Authority Finding in Twitter
    Identify authorities that should be trusted to verify a contended claim expressed in an Arabic tweet.
  • http://checkthat.gitlab.io

DocILE: Document Information Localization and Extraction

The DocILE 2023 lab runs the largest benchmark for the tasks of Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from business documents like invoices.
  • Task 1 - Key Information Localization and Extraction (KILE)
    The goal of Key Information Localization and Extraction is to localize fields of each pre-defined category and read out their values.
  • Task 2 - Line Item Recognition (LIR)
    The goal of Line Item Recognition is to find all line items, e.g., a billed item in a table, and localize their corresponding fields in the document as in Task 1.
  • https://docile.rossum.ai/

eRisk: Early Risk Prediction on the Internet

eRisk explores the evaluation methodology, effectiveness metrics, and practical applications (particularly those related to health and safety) of early risk detection on the Internet. Early detection technologies can be employed in different areas, particularly those related to health and safety. For instance, early alerts could be sent when a predator starts interacting with a child for sexual purposes, or when a potential offender starts publishing antisocial threats on a blog, forum or social network. Our main goal is to pioneer a new interdisciplinary research area that would be potentially applicable to a wide variety of situations and to many different personal profiles. Examples include potential paedophiles, stalkers, individuals that could fall into the hands of criminal organisations, people with suicidal inclinations, or people susceptible to depression.
  • Task 1 - Search for symptoms of depression
    The challenge consists of ranking sentences from a collection of user writings according to their relevance to a depression symptom. The participants will have to provide rankings for the 21 symptoms of depression from the BDI Questionnaire. A sentence will be deemed relevant to a BDI symptom when it conveys information about the user's state concerning the symptom. That is, it may be relevant even when it indicates that the user is OK with the symptom.
  • Task 2 - Early Detection of Signs of Pathological Gambling
    The challenge consists of sequentially processing pieces of evidence and detect early traces of pathological gambling (also known as compulsive gambling or disordered gambling), as soon as possible. The task is mainly concerned about evaluating Text Mining solutions and, thus, it concentrates on texts written in Social Media.
  • Task 3 - Measuring the severity of the signs of Eating Disorders
    The task consists of estimating the level of features associated with a diagnosis of eating disorders from a thread of user submissions. For each user, the participants will be given a history of postings and the participants will have to fill a standard eating disorder questionnaire (based on the evidence found in the history of postings).
  • https://erisk.irlab.org/
  • @earlyrisk

EXIST: sEXism Identification in Social neTworks

EXIST aims to capture and categorize sexism, from explicit misogyny to other subtle behaviors, in social networks. Participants will be asked to classify tweets in English and Spanish according to the type of sexism they enclose and the intention of the persons that writes the tweets.
  • Task 1 - Sexism Identification
    The first task is a binary classification. The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).
  • Task 2 - Source Intention
    This task aims to categorize the sexist messages according to the intention of the author in one of the following categories: (i) direct sexist message, (ii) reported sexist message and (iii) judgemental message.
  • Task 3 - Sexism Categorization
    The third task is a multiclass task that aims to categorize the sexist messages according to the type or types of sexism they contain (according to the categorization proposed by experts and that takes into account the different facets of women that are undermined): (i) ideological and inequality, (ii) stereotyping and dominance, (iii) objectification, (iv) sexual violence and (v) misogyny and non-sexual violence.
  • http://nlp.uned.es/exist2023/
  • @JCAlbornozC

iDPP: Intelligent Disease Progression Prediction

Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) are chronic diseases characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, cognitive). Patients have to manage alternated periods in hospital with care at home, experiencing a constant uncertainty regarding the timing of the disease acute phases and facing a considerable psychological and economic burden that also involves their caregivers. Clinicians, on the other hand, need tools able to support them in all the phases of the patient treatment, suggest personalized therapeutic decisions, indicate urgently needed interventions. The goal of iDPP@CLEF is to design and develop an evaluation infrastructure for AI algorithms able to: 1) better describe disease mechanisms; 2) stratify patients according to their phenotype assessed all over the disease evolution; 3) predict disease progression in a probabilistic, time dependent fashion.
  • Task 1 – Predicting Risk of Disease Worsening (Multiple Sclerosis)
    It will focus on ranking subjects based on the risk of worsening, setting the problem as a survival analysis task. More specifically the risk of worsening predicted by the algorithm should reflect how early a patient experience the event ”worsening”. Worsening is defined based on the Expanded Disability Status Scale (EDSS), accordingly to clinical standards.
  • Task 2 – Predicting Probability of Worsening (Multiple Sclerosis)
    It will refine Task 1 asking participants to explicitly assign a probability of worsening at different time windows (e.g. between years 4 and 6, 6 and 8, 8 and 10 etc.)
  • Position Paper Task 3 – Impact of Exposition to Pollutants (Amyotrophic Lateral Sclerosis)
    We will evaluate proposals of different approaches to assess if exposure to different pollutants is a useful variable to predict time to Percutaneous Endoscopic Gastrostomy (PEG), Non-Invasive Ventilation (NIV) and death in ALS patients.
  • https://brainteaser.health/open-evaluation-challenges/idpp-2023/
  • @brainteaser2020

ImageCLEF: Multimedia Retrieval Challenge

ImageCLEF 2023 is set to promote the evaluation of technologies for annotation, indexing, classification and retrieval of multimodal data, with the objective of providing information access to large collections in various usage scenarios and domains.
  • Task 1 - ImageCLEFmedical
    This task continues the tradition of bringing together several initiatives for medical applications fostering cross-exchanges, namely: medical concept detection and caption prediction, synthetic medical images generated with GANs, Visual Question Answering and generation, and doctor-patient conversation summarization.
  • Task 2 - ImageCLEFaware
    The images available on social networks can be exploited in ways users are unaware of when initially shared, including situations that have serious consequences for the users’ real lives. The task addresses the development of algorithms which raise the users’ awareness about real-life impact of online image sharing.
  • Task 3 - ImageCLEFfusion
    Despite the current advances in knowledge discovery, single learners do not produce satisfactory performances when dealing with complex data, such as class imbalance, high-dimensionality, concept drift, noise, multimodality, subjective annotations, etc. This task aims to fill this gap by exploiting novel and innovative late fusion techniques for producing a powerful learner based on the expertise of a pool of classifiers.
  • Task 4 - ImageCLEFrecommendation
    This task focuses on content-recommendation for cultural heritage content in 15 broad themes that have been curated by experts in the Europeana Platform. Despite current advances, there is limited understanding how well these perform and how relevant they are for the final end-users.
  • https://www.imageclef.org/2023
  • @imageclef

JOKER: Automatic Wordplay Analysis

The goal of the JOKER track is to create reusable test collections for benchmarking and to explore new methods and evaluation metrics for the automatic processing of wordplay.
  • Task 1 - Pun detection
    Detections of puns in English, French, and Spanish.
  • Task 2 - Pun interpretation
    Interpretation of puns in English, French, and Spanish.
  • Task 3 - Pun translation
    Translation of puns from English to French and Spanish.
  • https://www.joker-project.com/
  • @joker_research

LifeCLEF: Multimedia Retrieval in Nature

LifeCLEF is dedicated to the large-scale evaluation of biodiversity identification and prediction methods based on artificial intelligence.
  • Task 1 - BirdCLEF
    Bird species recognition in audio soundscapes.
  • Task 2 - FungiCLEF
    Fungi recognition from images and metadata.
  • Task 3 - GeoLifeCLEF
    Remote sensing based prediction of species.
  • Task 4 - PlantCLEF
    Global-scale plant identification from images.
  • Task 5 - SnakeCLEF
    Snake species identification in medically important scenarios.
  • http://www.lifeclef.org/

LongEval: Longitudinal Evaluation of Model Performance

LongEval shared task is focused on evaluating the temporal persistence of information retrieval systems and text classifiers. The goal is to develop temporal information retrieval systems and longitudinal text classifiers that survive through dynamic temporal text changes, introducing time as a new dimension for ranking models performance.
  • Task 1 - LongEval-Retrieval
    The goal of Task 1 is to propose a temporal information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal persistence on Web documents. This task will have 2 sub-tasks focusing on short-term and long-term persistence.
  • Task 2 - LongEval-Classification
    The goal of Task 2 is to propose a temporal persistence classifier which can mitigate performance drop over short and long periods of time compared to a test set from the same time frame as training. This task will have 2 sub-tasks focusing on short-term and long-term persistence.
  • https://clef-longeval.github.io/
  • @long_eval

PAN: Digital Text Forensics and Stylometry

PAN is a series of scientific events and shared tasks on digital text forensics and stylometry whose goal is to advance the state of the art and provide for an objective evaluation on newly developed benchmark datasets in those areas.
  • Task 1 - Cross-Discourse Type Authorship Verification
    It focuses on (cross-discourse type) authorship verification where both written (e.g., essays, emails) and oral language (e.g., interviews, speech transcriptions) are represented in the set of discourse types.
  • Task 2 - Profiling Cryptocurrency Influencers with Few-shot Learning
    It aims to profile cryptocurrency influencers in social media (Twitter) from a low-resource perspective.
  • Task 3 - Multi-Author Writing Style Analysis
    It addresses multi-authored documents whose authorship cannot be easily determined by exploiting topic changes alone.
  • Task 4 - Trigger Detection
    It addresses the task of assigning a single trigger warning label (violence) to narratives in a corpus of fanfiction.
  • https://pan.webis.de/
  • @webis_de

SimpleText: Automatic Simplification of Scientific Texts

The SimpleText overall goal is to create a simplified summary of multiple scientific documents based on a popular science query which provides a user with an instant accessible overview on this specific topic.
  • Task 1 - What is in, or out?
    Selecting passages to include in a simplified summary.
  • Task 2 - What is unclear?
    Difficult concept identification and explanation.
  • Task 3 - Rewrite this!
    Rewriting scientific text.
  • https://simpletext-project.com/
  • @SimpletextW

Touché: Argument and Causal Retrieval

The goal of Touché is to foster and support the development of technologies for argument and causal retrieval and analysis that includes argument quality estimation, stance detection, image retrieval, and causal evidence retrieval.
  • Task 1 - Argument Retrieval for Controversial Questions
    Given a controversial topic and a collection of web documents, the task is to retrieve and rank documents by relevance to the topic, by argument quality, and to detect the document stance.
  • Task 2 - Evidence Retrieval for Causal Questions
    Given a causality-related topic and a collection of web documents, the task is to retrieve and rank documents by relevance to the topic and detect the document "causal" stance (i.e., whether a causal relationship from the topic’s title holds).
  • Task 3 - Image Retrieval for Arguments
    Given a controversial topic, the task is to retrieve images (from web pages) for each stance (pro/con) that show support for that stance.
  • Task 4 - Intra-Multilingual Multi-Target Stance Classification
    Given a proposal on a socially important issue, its title, and topic in different languages, the task is to classify whether a comment is in favor, against, or neutral towards the proposal.
  • https://touche.webis.de/
  • @webis_de