BCI DATASET PROVENANCE SCORECARD ================================ If you claim to hear the brain, publish the score. I built this as a practical rubric for EEG, ear-EEG, MEG, fNIRS, EMG-assisted BCI, and hybrid neurotech releases. The standard is simple: no more vibes-based cognition claims, no more empty repositories, no more "trust us" telemetry. QUICK RUBRIC ------------ Score each dimension 0, 1, or 2. 1) Dataset identity 0 = no clear version / DOI / URL 1 = some identifiers 2 = versioned, cited, stable DOI / URL 2) License 0 = no license 1 = custom or ambiguous terms 2 = clear OSI-compatible or explicit research terms 3) Integrity 0 = no checksums 1 = partial hashes 2 = full SHA256.manifest for all artifacts 4) Raw signals 0 = no raw release 1 = raw subset only 2 = raw data released in structured layout 5) Preprocessing 0 = hand-wavy methods 1 = some scripts / settings 2 = exact pipeline with scripts and parameters 6) Timing / event map 0 = no trigger map 1 = incomplete events 2 = trigger schema, timing diagram, synced annotations 7) Hardware chain 0 = device unclear 1 = partial hardware info 2 = sensor model, montage, reference, gain, sampling, firmware 8) Electrode / fit quality 0 = no impedance / contact info 1 = one-time snapshot 2 = per-session impedance / contact quality logs 9) Noise accounting 0 = no controls 1 = mentions artifact rejection 2 = explicit controls for jaw, blink, motion, heartbeat, room vibration 10) Code provenance 0 = no code commit / container 1 = repo only 2 = commit hash plus lockfile or container digest 11) Subject / task metadata 0 = thin prose only 1 = partial protocol 2 = full task protocol, timing, conditions, exclusions 12) Governance 0 = no consent / de-ID note 1 = minimal ethics note 2 = consent version, de-ID method, usage boundaries 13) Null artifacts 0 = missing files vanish silently 1 = missingness noted informally 2 = missing artifacts logged explicitly as missing 14) Reproducibility 0 = no rerun path 1 = manual recreation possible 2 = seeds, environment, scripts, deterministic or bounded replay SCORE INTERPRETATION -------------------- 0-10 = marketing fog 11-18 = interesting, but not trustworthy enough 19-24 = usable with caution 25-28 = serious scientific release MINIMUM ARTIFACT BUNDLE ----------------------- A release is not complete unless it includes, at minimum: - README describing task, protocol, exclusions, and intended use - LICENSE - SHA256.manifest - raw signal files in BIDS or a clearly documented equivalent - derivatives or preprocessed outputs with exact generation steps - event annotations and trigger mapping - hardware manifest - environment manifest - code commit hash and dependency lockfile - consent / de-identification statement - explicit record of anything missing HARDWARE MANIFEST: WHAT TO PUBLISH ---------------------------------- If the acquisition chain is vague, the result is theater. Publish: - device manufacturer and exact model - firmware version - channel count - sampling rate - reference scheme - electrode positions / montage - earbud fit method if ear-EEG - amplifier gain and filters - impedance / contact quality logs - auxiliary channels present or absent: EMG, ECG or PPG, accelerometer, gyro, audio, respiration MECHANOSENSITIVE NOISE QUARANTINE --------------------------------- Before calling a signal cognitive, release controls for: - jaw clench - swallowing - speech / subvocalization - blink / saccade - head movement - posture shift - heartbeat leakage - ambient audio - room vibration / desk coupling - cable motion / connector noise - impedance drift across the session If you do not rule these out, you may be publishing a beautifully branded microphone for the skull. MISSINGNESS MUST BE EXPLICIT ---------------------------- Silence is not consent, and absence is not proof of cleanliness. Example missing-artifact record: { "artifact": "raw/session_07_eeg.edf", "status": "missing", "reason": "corrupt upload", "discovered_at": "2026-03-05T20:00:00Z", "replacement_expected": false } MINIMAL METADATA SCHEMA ----------------------- { "dataset_id": "example_bci_release_v1", "version": "1.0.0", "doi_or_url": "", "license": "", "modality": ["EEG"], "task": "motor imagery / auditory imagery / preference classification / etc", "subjects": 0, "sessions": 0, "sampling_rate_hz": 0, "hardware": { "manufacturer": "", "model": "", "firmware": "", "reference_scheme": "", "montage": "", "impedance_log_present": false }, "aux_channels": { "emg": false, "ecg_or_ppg": false, "accelerometer": false, "audio": false, "respiration": false }, "artifacts": { "raw_data_present": false, "derivatives_present": false, "preprocessing_scripts_present": false, "event_map_present": false, "sha256_manifest_present": false, "code_commit_present": false, "environment_lockfile_present": false }, "noise_controls": { "jaw_control": false, "blink_control": false, "motion_control": false, "heartbeat_control": false, "room_vibration_control": false }, "governance": { "consent_version": "", "deidentification_method": "", "usage_restrictions": "", "missing_artifacts_logged": false } } MY BIAS, STATED OPENLY ---------------------- I care less about whether a dataset is fashionable than whether it is inspectable. A mediocre open dataset with timing maps, impedance logs, hashes, and honest caveats is more valuable than a glamorous closed demo claiming to decode desire from jaw tremor. If the future of BCI is going to sing, it needs tuning forks, not incense.