PrismAI

We introduce PrismAI, an environment for the automatic detection of AI-generated text. Our contributions are threefold: Firstly, we release the largest AI-detection dataset to date, comprising 537588 human-written and AI-generated documents in both English and German across seven domains, including scientific writing, weblogs, parliamentary speeches, legal court cases, classic literature, news articles, and student essays, synthesized using state-of-the-art models. Secondly, we introduce Luminar, a CNN-based model for the automatic detection of AI-generated texts. Our experiments show that by leveraging the hidden states of an LLM to derive intermediate likelihoods, our model, despite having a small footprint, can outperform other likelihood-backed baselines significantly while demonstrating strong generalization capabilities in out-of-domain and out-of-language scenarios. Thirdly, we unify existing datasets into a common corpus called AIGT-World and make it accessible through a publicly available web-based corpus explorer, which facilitates searching, reading, visualizing, and interacting with the underlying data. By doing so, we aim to elevate research in this area, expand the field to include non-English texts, propose new models, and unify existing efforts to build toward a common dataset and objective.

Corpora

PrismAI Dataset

Text Technology Lab

The PrismAI dataset was created as a playground for testing models designed to detect AI-generated texts. The dataset includes various domains, such as news articles, political speeches, legal court texts, student essays, and more, and provides texts in both German and English. For every scraped human text, there are at least two AI-generated counterparts.

The dataset is also available for download.

AIGT-World

Research Community

The AIGT-World dataset is a unified corpus comprising various datasets previously created for the automatic detection of AI-generated content. It integrates multiple past research efforts within the community and includes our newly introduced PrismAI dataset. In total, the corpus consists of 1.2 million documents.

The dataset is also available for download.

You want to search? Use the UCE-Portal!

0

Pro

source

label

language

model

dataset

date

Fulltext

NER

Embedding

KWIC

Enrich

Search.

Team

The team behind UCE and PrismAI is part of the Text Technology Lab of the Goethe-University, Frankfurt.

Prof. Alexander Mehler

(Supervisor)

Supervisor of the lab and of its projects.

Mail
Website

Robert-Mayer-Straße 10
60325 Frankfurt am Main

Kevin Bönisch

(Developer)

Responsible for developing and maintaining UCE.

Mail
Website

Robert-Mayer-Straße 10
60325 Frankfurt am Main

Manuel Schaaf

(Research Assistant)

Responsible for the PrismAI research.

Mail
Website

Robert-Mayer-Straße 10
60325 Frankfurt am Main

Layered Search Builder

Layer

PrismAI

Corpora

PrismAI Dataset

AIGT-World

Search.

Team

Prof. Alexander Mehler

Kevin Bönisch

Manuel Schaaf

iki

Chat