The Dutch EEG Speech Register Corpus

Bentum, M.

Bosch, L.F.M. ten (Louis)

van den Bosch, A.

Ernestus, M.T.C.

The Dutch EEG Speech Register Corpus contains 207 hours of EEG recordings from 48 participants listening to natural connected speech. The speech recordings were sampled from spontaneous dialogues, news broadcasts and read-aloud stories, and contain 50,277 word tokens per participant, time-locked to the EEG recordings. We cleaned the data with a novel approach by training a convolutional neural network artefact classifier on EEG recordings with manually labeled artefacts. We applied the artefact classifier on all EEG recordings and manually checked all automatically identified artefacts to ensure high data quality. Eye-related activity was removed with independent component analysis. The EEG recordings (raw and cleaned), contain 1.5 million word epochs, are freely available (license: CC BY NC 4.0) and offer research opportunities to investigate neural correlates of natural connected speech processing.