Data
Specifications

This project provides openly accessible multimodal EEG/EMG recordings for brain-computer interface research. All datasets are hosted on OpenNeuro and associated with peer-reviewed publications.

General Overview

Summary of the entire project data structure, supported modalities, and publication mappings.

Audio

Microphone recordings synchronized with EEG, downsampled to 16 kHz. Enables audio-EEG cross-modal analysis and ground truth transcription.

EMG (Electromyography)

Bipolar facial EMG (orbicularis oris upper/lower, EOG) synchronized with EEG. Used for myoelectric artifact removal and joint EEG/EMG decoding models.

Dataset Task Device EEG Ch Hour Related Publication
uhd-speech Overt / Min-overt / Covert g.Pangolin 128 ch ~113 h Sasai et al. bioRxiv 2024
silent-speech-word Silent speech (word) eego sports 63 ch 16.1 h Sasai et al. bioRxiv 2024
silent-speech-sentence Silent speech (sentence) eego sports 63 ch 13.6 h Sasai et al. bioRxiv 2024
vocalized-eego Vocalized speech eego sports 63 ch 49.2 h Sasai et al. bioRxiv 2024
vocalized-scarabeo Vocalized speech g.SCARABEO 62 ch 27.7 h Sasai et al. bioRxiv 2024

DATASET 01:

5-Word Isolated Speech EEG-EMG Dataset

3 Paticipants 5 words

Participants spoke the same word five times consecutively following a 3-count auditory cue, across three speech production modes.

Protocol

Item Value
Participants 3 healthy adults (male, age 29–36)
Speech conditions Overt / Minimally overt / Covert (imagined)
Vocabulary 5 color words (Japanese)
Task structure 5 repetitions per word following a 3-count cue, 1.25 sec/repetition
EEG device g.Pangolin (g.tec, Austria)
EEG channels 128 (8 sheets × 16 electrodes)
Sampling rate 256 Hz (online)
Electrode placement Left-hemisphere language areas
EMG EOG + upper/lower orbicularis oris (bipolar)
Ethics Shiba Palace Clinic Ethics Review Committee; Declaration of Helsinki

Offline Decoding Performance
(EEGNet, for reference)

Condition Avg, accuracy (3subjects)
Overt 94.6%
Minimally overt 94.9%
Covert 91.1%

Real-world Application

Using this dataset, we demonstrated a real-time EEG-based Gmail control interface — participants navigated their inbox and generated AI-assisted email replies by decoding 5 color words (green, magenta, orange, violet, yellow).

DATASET 02:

Open-Vocabulary Sentence Reading EEG-EMG Dataset

3 (Long-term) Paticipants Open Vocaburary

Participants read sentences from novels, text-based TV games, and JSUT corpus, producing long-form speech. Participants read them at natural speed.

Protocol

Item Value
Participants 3 healthy adults (male, age 22-44)
Speech conditions Overt
Vocabulary Open (Japanese)
Task structure Continuous speech (read sentences from on-screen texts)
EEG device g. Pangolin, g.Scarabeo (g.tec, Austria), eegosports (ANTneuro)
EEG channels 128 (gPangolin), 64 (g.Scarabeo), 63 (eegosports)
EMG channels 5 ch (healthy: cheek, upper/lower mouth, chin, throat; patient: above/below eye)
Sampling rate 1200 Hz (gPangolin, g.Scarabeo), 1024 Hz (eegosports)
Electrode placement Left-hemisphere (gPangolin), whole-brain (g. Scarabeo, eegosperts)
EMG EOG + upper/lower orbicularis oris (bipolar)

DATASET 03:

64-class word/sentence Reading EEG-EMG Dataset

(TBD)