July 11, 2020
Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).
Written by
Haytham M. Fayek
Anurag Kumar
Publisher
International Joint Conference on Artificial Intelligence (IJCAI)
Research Topics
November 30, 2020
Koustuv Sinha, Christopher Pal, Nicolas Gontier, Siva Reddy
November 30, 2020
December 03, 2018
Gabriel Synnaeve, Daniel Gant, Jonas Gehring, Nicolas Carion, Nicolas Usunier, Vasil Khalidov, Vegard Mella, Zeming Lin
December 03, 2018
December 03, 2018
Gabriel Synnaeve, Zeming Lin, Jonas Gehring, Dan Gant, Vegard Mella, Vasil Khalidov, Nicolas Carion, Nicolas Usunier
December 03, 2018
April 24, 2017
Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, Soumith Chintala
April 24, 2017
May 06, 2019
Kenneth Marino, Abhinav Gupta, Rob Fergus, Arthur Szlam
May 06, 2019
July 03, 2019
July 03, 2019
Product experiences
Foundational models
Product experiences
Latest news
Foundational models