August 31, 2022
Every year, more than 69 million people around the world suffer traumatic brain injury, which leaves many of them unable to communicate through speech, typing, or gestures. These people’s lives could dramatically improve if researchers developed a technology to decode language directly from noninvasive brain recordings. Today, we’re sharing research that takes a step toward this goal. We’ve developed an AI model that can decode speech from noninvasive recordings of brain activity.
From three seconds of brain activity, our results show that our model can decode the corresponding speech segments with up to 73 percent top-10 accuracy from a vocabulary of 793 words, i.e., a large portion of the words we typically use on a day-to-day basis.
Decoding speech from brain activity has been a long-standing goal of neuroscientists and clinicians, but most of the progress has relied on invasive brain-recording techniques, such as stereotactic electroencephalography and electrocorticography. These devices provide clearer signals than noninvasive methods but require neurosurgical interventions. While results from that work suggest that decoding speech from recordings of brain activity is feasible, decoding speech with noninvasive approaches would provide a safer, more scalable solution that could ultimately benefit many more people. This is very challenging, however, since noninvasive recordings are notoriously noisy and can greatly vary across recording sessions and individuals for a variety of reasons, including differences in each person’s brain and where the sensors are placed.
In our work, we address these challenges by creating a deep learning model trained with contrastive learning and then use it to maximally align noninvasive brain recordings and speech sounds. To do this, we use wave2vec 2.0, an open source , self-supervised learning model developed by our FAIR team in 2020. We then use this model to identify the complex representations of speech in the brains of volunteers listening to audiobooks.
We focused on two noninvasive technologies: electroencephalography and magnetoencephalography (EEG and MEG, for short), which measure the fluctuations of electric and magnetic fields elicited by neuronal activity, respectively. In practice, both systems can take approximately 1,000 snapshots of macroscopic brain activity every second, using hundreds of sensors.
We leveraged four open source EEG and MEG datasets from academic institutions, capitalizing on more than 150 hours of recordings of 169 healthy volunteers listening to audiobooks and isolated sentences in English and Dutch.
We then input those EEG and MEG recordings into a “brain” model, which consists of a standard deep convolutional network with residual connections. EEG and MEG recordings are known to vary extensively across individuals because of individual brain anatomy, differences in the location and timing of neural functions across brain regions, and the position of the sensors during a recording session. In practice, this means that analyzing brain data generally requires a complex engineering pipeline crafted to realign brain signals on a template brain. In previous studies, brain decoders were trained on a small number of recordings to predict a limited set of speech features, such as part-of-speech categories or words from a small vocabulary. For our research, we designed a new subject-embedding layer, which is trained end-to-end to align all brain recordings in a common space.
Finally, our architecture learns to align the output of this brain model to the deep representations of the speech sounds that were presented to the participants. In our previous work, we used wav2vec 2.0 to show that this speech algorithm automatically learns to generate representations of speech that align with those of the brain. The emergence of “brainlike” representations of speech in wav2vec 2.0 made it a natural choice to build our decoder, because it helps to know which representations we should try to extract from brain signals.
After training, our system performs what’s known as zero-shot classification: Given a snippet of brain activity, it can determine from a large pool of new audio clips which one the person actually heard. From there, the algorithm infers the words the person has most likely heard. This is an exciting step because it shows AI can successfully learn to decode noisy and variable noninvasive recordings of brain activity when speech is perceived. The next step is to see whether we can extend this model to directly decode speech from brain activity without needing the pool of audio clips, i.e., to move toward a safe and versatile speech decoder.
Our analyses further show that several components of our algorithm, including the use of wav2vec 2.0 and the subject layer, were beneficial to decoding performance. Furthermore, we show that our algorithm improves with the number of EEG and MEG recordings. Practically, this means that our approach benefits from the pulling of large amounts of heterogeneous data, and could, in principle, help improve the decoding of small datasets. This is important because, in many cases, it can be hard to collect a lot of data for a given participant. For example, it isn’t practical to require patients to spend dozens of hours in a scanner to check whether the system works for them. Instead, algorithms could be pretrained on large datasets inclusive of many individuals and conditions, and then support the decoding of brain activity for a new patient with little data.
The results of our research are encouraging because they show that self-supervised trained AI can successfully decode perceived speech from noninvasive recordings of brain activity, despite the noise and variability inherent in those data. These results are only a first step, however. In this work, we focused on decoding speech perception, but the ultimate goal of enabling patient communication will require extending this work to speech production. This line of research could even reach beyond assisting patients to potentially include enabling new ways of interacting with computers.
More generally, our work is a part of the broader effort by the scientific community to use AI to better understand the human brain. We’re sharing this research openly to accelerate progress on the challenges still ahead. We look forward to working together and contributing to the research community in this area.
Learn more by reading our paper about decoding speech from noninvasive brain recordings.
We’d like to acknowledge contributions to this research from Alexandre Défossez, Charlotte Caucheteux, Ori Kabeli, and Jérémy Rapin.
Data were provided (in part) by the Donders Institute for Brain, Cognition and Behaviour; New York University; the University of Michigan; Trinity College Dublin; and the University of Rochester.
Research Scientist
Foundational models
Latest news
Foundational models