MULTITuDE: A New Benchmark for Multilingual Machine-Generated Text Detection

A new paper authored by the Kempelen Institute of Intelligent Technologies (KInIT), one of the members of the VIGILANT consortium, introduces a new pioneering benchmark dataset designed to evaluate machine-generated text detection across multiple languages.

Recent advancements in large language models (LLMs) have opened new frontiers in text generation, yet there's a notable gap in understanding their efficacy in non-English languages. This gap extends to the detection of machine-generated text in multilingual settings. Current benchmarks are limited, focusing mainly on English texts and older generation models. Addressing this need, we are proud to introduce MULTITuDE, a pioneering benchmark dataset designed to evaluate machine-generated text detection across multiple languages.

MULTITuDE stands out with its comprehensive collection of 74,081 texts, both authentic and machine-generated, spanning 11 languages including Arabic, Catalan, Czech, German, English, Spanish, Dutch, Portuguese, Russian, Ukrainian, and Chinese. These texts are generated using 8 different multilingual LLMs, providing a diverse and challenging testing ground for text detection models. KInIT's benchmark focuses on assessing zero-shot detectors (both statistical and black-box models) as well as fine-tuned detectors, offering insights into their performance and adaptability in varied linguistic contexts.

A key aspect of KInIT's evaluation is understanding how these detectors perform with languages they haven't encountered before, both linguistically similar and dissimilar, as well as with unseen LLMs. KInIT also explores whether training detectors on multiple languages enhances their overall performance.

This comprehensive analysis aims to bridge the research gap in multilingual text generation and detection, offering valuable insights for future developments in this field.

For a deeper dive into KInIT's findings and methodology, read the preprint of the paper, which has been accepted to the EMNLP2023 conference, here.

The data used for the publication was published on GitHUB according to Open Access principles.