Data
Specifications

This project provides openly accessible multimodal EEG/EMG recordings for brain-computer interface research. All datasets are hosted on OpenNeuro and associated with peer-reviewed publications.

Audio

Microphone recordings synchronized with EEG, downsampled to 16 kHz. Enables audio-EEG cross-modal analysis and ground truth transcription.

EMG (Electromyography)

Bipolar facial EMG (orbicularis oris upper/lower, EOG) synchronized with EEG. Used for myoelectric artifact removal and joint EEG/EMG decoding models.

Dataset	Task	Device	EEG Ch	Hour	Related Publication
uhd-speech	Overt / Min-overt / Covert	g.Pangolin	128 ch	~113 h	Sasai et al. bioRxiv 2024
silent-speech-word	Silent speech (word)	eego sports	63 ch	16.1 h	Sasai et al. bioRxiv 2024
silent-speech-sentence	Silent speech (sentence)	eego sports	63 ch	13.6 h	Sasai et al. bioRxiv 2024
vocalized-eego	Vocalized speech	eego sports	63 ch	49.2 h	Sasai et al. bioRxiv 2024
vocalized-scarabeo	Vocalized speech	g.SCARABEO	62 ch	27.7 h	Sasai et al. bioRxiv 2024

DATASET 01:

5-Word Isolated Speech EEG-EMG Dataset

3 Paticipants 5 words

Participants spoke the same word five times consecutively following a 3-count auditory cue, across three speech production modes.

Protocol

Item	Value
Participants	3 healthy adults (male, age 29–36)
Speech conditions	Overt / Minimally overt / Covert (imagined)
Vocabulary	5 color words (Japanese)
Task structure	5 repetitions per word following a 3-count cue, 1.25 sec/repetition
EEG device	g.Pangolin (g.tec, Austria)
EEG channels	128 (8 sheets × 16 electrodes)
Sampling rate	256 Hz (online)
Electrode placement	Left-hemisphere language areas
EMG	EOG + upper/lower orbicularis oris (bipolar)
Ethics	Shiba Palace Clinic Ethics Review Committee; Declaration of Helsinki

Offline Decoding Performance
(EEGNet, for reference)

Condition	Avg, accuracy (3subjects)
Overt	94.6%
Minimally overt	94.9%
Covert	91.1%

Real-world Application

Using this dataset, we demonstrated a real-time EEG-based Gmail control interface — participants navigated their inbox and generated AI-assisted email replies by decoding 5 color words (green, magenta, orange, violet, yellow).

Access

View on Github

View on Data Repository

DATASET 02:

Open-Vocabulary Sentence Reading EEG-EMG Dataset

3 (Long-term) Paticipants Open Vocaburary

Participants read sentences from novels, text-based TV games, and JSUT corpus, producing long-form speech. Participants read them at natural speed.

Protocol

Item	Value
Participants	3 healthy adults (male, age 22-44)
Speech conditions	Overt
Vocabulary	Open (Japanese)
Task structure	Continuous speech (read sentences from on-screen texts)
EEG device	g. Pangolin, g.Scarabeo (g.tec, Austria), eegosports (ANTneuro)
EEG channels	128 (gPangolin), 64 (g.Scarabeo), 63 (eegosports)
EMG channels	5 ch (healthy: cheek, upper/lower mouth, chin, throat; patient: above/below eye)
Sampling rate	1200 Hz (gPangolin, g.Scarabeo), 1024 Hz (eegosports)
Electrode placement	Left-hemisphere (gPangolin), whole-brain (g. Scarabeo, eegosperts)
EMG	EOG + upper/lower orbicularis oris (bipolar)

Access

View on Github

View on Data Repository

DATASET 03:

64-class word/sentence Reading EEG-EMG Dataset

(TBD)

Data Specifications

General Overview

Audio

EMG (Electromyography)

5-Word Isolated Speech EEG-EMG Dataset

Protocol

Offline Decoding Performance (EEGNet, for reference)

Real-world Application

Access

Open-Vocabulary Sentence Reading EEG-EMG Dataset

Protocol

Access

64-class word/sentence Reading EEG-EMG Dataset

Data
Specifications

Offline Decoding Performance
(EEGNet, for reference)