Keynote Speakers

Keynote 1: Sina Zarrieß - Universität Bielefeld

NLP Evaluation in the Face of Deceptively Fluent Models

Abstract

Evaluation has long been one of the most contested challenges in NLG and NLP research. Over the decades, the field has developed a range of paradigms serving distinct purposes — from intrinsic, hypothesis-driven approaches to extrinsic, application-driven methods. The rise of LLMs, however, poses challenges that are more fundamental than those raised by earlier task-specific systems. Put simply: the texts produced by today’s models are extraordinarily fluent. This does not mean they are error-free or fit for every real-world purpose — but their surface polish makes it difficult to detect and diagnose underlying deficiencies, even for trained human evaluators.

In this talk, I argue that this situation demands a new evaluation paradigm — one that shifts focus from text quality to interaction quality. Rather than asking how good a generated text is in isolation, we should ask whether and to what extent a system enables meaningful, reliable, and predictable interactions with users, bringing user intentions and human-model interaction dynamics to the focus of evaluation. I will present recent work from my group that illustrates what such an interaction-oriented paradigm can look like in practice, and discuss how LLMs-as-a-judge could play a role in this paradigm.

Speaker Information

Sina Zarieß is since 2023 a professor for Computational Linguistics, at the University of Bielefeld. Prof. Zarieß is focused on researching computational models of language use in text and dialogue, with applications in natural language generation, dialogue systems, language & vision.

Keynote 2: Beatrice Alex - Heriot-Watt University

From Benchmark to Bedside: Lessons learned in Clinical Natural Language Processing

Abstract

Clinical Natural Language Processing (NLP) sits at the intersection of two communities with very different relationships to evaluation: NLP researchers, who measure their methods via benchmark performance, and clinicians, who need systems that are trustworthy and useful in practice for patients. Drawing on over a decade of research applied to brain health, processing NHS Scotland brain imaging reports and developing NLP tools to support research on disease prevention, prediction and cohort creation at population scale, this keynote reflects on what clinical NLP teaches us about evaluation. There are numerous challenges which make this type of work hard in practice: the lack of accessible datasets, the annotation bottleneck, the asymmetric effect of different error types and the disconnect between NLP performance and readiness for patients. Closing this gap requires more than better methods; it requires evaluation co-designed with clinical partners, transparent reporting on datasets and models and a commitment to measuring what actually matters for patients.

Speaker Information

Beatrice Alex is a professor and the chair of Artificial Intelligence at Heriot-Watt University. Prof. Alex’s research focuses on text mining and natural language processing to extract information from raw text. In particular, she has focused her research into developing tools that can assist in predicting multimorbidity and adverse drug events to improve care in later life. She’s also leading the NLP work in the Warbler project with the aim to phenotype and analyse 1.7 mio brain imaging reports of the Scottish population.

Keynote 3: Albert Gatt - Universiteit Utrecht

“It’s cheaper if you don’t involve people”

Abstract

A long-standing argument for automatic evaluation, especially in tasks involving text generation, goes as follows: To the extent that an automatic evaluation method is valid and reliable, it is preferable to human evaluation because it is cheaper, more efficient, and less susceptible to inter-evaluator variation and intra-annotator inconsistency. Such arguments were often made for reference-based metrics, including model-based evaluation metrics. A similar rationale underlies the more recent turn towards using LLMs as judges, but with an important difference, namely that such evaluations often involve tasks (such as judging some qualitative dimension of a text), which LLMs seem to be able to perform much as humans would.

In this talk, I will survey the state of play with LLM-as-Judge evaluations, with reference to recent research on: (i) the compatibility between LLM judges and humans, including experts; (ii) the problems of calibration and bias in LLMs; and (iii) the extent to which LLMs capture sufficient linguistic diversity to warrant their use as stand-ins for entire samples of human evaluators. My goal is to bring to the foreground some of the assumptions that are often left implicit in this form of automatic evaluation.

Finally, I will bridge this discussion to a question that Ehud Reiter has been instrumental in bringing to the attention of our community over the course of his career, namely: to what extent does a proxy evaluation of this sort allow us to make assumptions about the utility and impact of deployed systems?

Speaker Information

Albert Gatt is a professor in Natural Language Generation and leads the Natural Language Processing group, Department of Information and Computing Sciences at Universiteit Utrecht. Prof. Gatt’s research mostly focusses on the automatic generation of language from non-linguistic information (a.k.a. Natural Language Generation). One important aspect of this is how systems – artificial or human – learn meaningful relationships between language and the non-linguistic, especially the perceptual, world.