An Overview of Audio-Visual Source Separation Using Deep Learning
PDF

How to Cite

An Overview of Audio-Visual Source Separation Using Deep Learning. (2023). Al-Khwarizmi Engineering Journal, 19(4), 42-55. https://doi.org/10.22153/kej.2023.06.003

Publication Dates

Abstract

    In this article, the research presents a general overview of deep learning-based AVSS (audio-visual source separation) systems. AVSS has achieved exceptional results in a number of areas, including decreasing noise levels, boosting speech recognition, and improving audio quality. The advantages and disadvantages of each deep learning model are discussed throughout the research as it reviews various current experiments on AVSS. The TCD TIMIT dataset (which contains top-notch audio and video recordings created especially for speech recognition tasks) and the Voxceleb dataset (a sizable collection of brief audio-visual clips with human speech) are just a couple of the useful datasets summarized in the paper that can be used to test AVSS systems. In its basic form, this review aims to highlight the growing importance of AVSS in improving the quality of audio signals.

PDF

References

A. Al-Tmeme, W. L. Woo, S. S. Dlay and B. Gao, "Underdetermined Convolutive Source Separation Using GEM-MU with Variational Approximated Optimum Model Order NMF2D," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 35-49, Jan. 2017. http://dx.doi.org/10.1109/TASLP.2016.2620600.

Woo, W.L.; Dlay, S.S.; Al-Tmeme, A.; Gao, B. "Reverberant signal separation using optimized complex sparse nonnegative tensor deconvolution on spectral covariance matrix". Digit. Signal Process. 2018, 83, 9–23. http://dx.doi.org/10.1016/j.dsp.2018.07.018

Al-Tmeme, A.; Woo, W.L.; Dlay, S.; Gao, B. "Single channel informed signal separation using artificial-stereophonic mixtures and exemplar-guided matrix factor deconvolution". Int. J. Adapt. Control. Signal Process. 2018, 32, 1259–1281. http://dx.doi.org/10.1002/acs.2912.

Ahmed Al-Tmeme, W.L. Woo, S.S. Dlay, and B. Gao, "Underdetermined reverberant acoustic source separation using weighted full-rank nonnegative tensor models," J. Acoust. Soc. Am, 138, 3411, 2015. http://dx.doi.org/10.1121/1.4923156.

Amer, R., and Al Tmeme, A. "Hybrid deep learning model for singing voice separation". Mendel 27, 2 (2021), 44–50. http://dx.doi.org/10.13164/mendel.2021.2.044.

Mahmood, Israa N. and Hasanen S. Abdullah, "Telecom Churn Prediction Based on Deep Learning Approach" (2022) 63(6) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.6.32.

Jameel, Humam Khaled and Ban Nadeem Dhannoon, "Gait Recognition Based on Deep Learning" (2022) 63(1) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.1.36.

Al-Akkam, Reem Mohammed Jasim and Mohammed Sahib Mahdi Altaei, "Plants Leaf Diseases Detection Using Deep Learning" (2022) 63(2) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.2.34.

Hussein, Noor Alhuda Khalid and Basad Al-Sarray, "Deep Learning and Machine Learning via a Genetic Algorithm to Classify Breast Cancer DNA Data" (2022) 63(7) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.7.36.

N. Takahashi, M. K. Singh, S. Basak, P. Sudarsanam, S. Ganapathy and Y. Mitsufuji, "Improving Voice Separation by Incorporating End-To-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 41-45, http://dx.doi.org/10.1109/ICASSP40776.2020.9053845.

S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, "FaceFilter: Audio-visual speech separation using still images," Proc. of Interspeech, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1065.

R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, "Multi-modal multi-channel target speech separation," IEEE Journal of Selected Topics in Signal Processing,2020. http://dx.doi.org/10.1109/JSTSP.2020.2980956.

Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, "ADLMVDR: All deep learning MVDR beamformer for target speech separation," ICASSP, pp. 6089–6093, 2021. http://dx.doi.org/10.1109/ICASSP39728.2021.9413594.

G. Li, J. Yu, J. Deng, X. Liu and H. Meng, "Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 6042-6046, http://dx.doi.org/10.1109/ICASSP43922.2022.9747237.

J. Ong, B. T. Vo, S. Nordholm, B. -N. Vo, D. Moratuwage and C. Shim, "Audio-Visual Based Online Multi-Source Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1219-1234, 2022. http://dx.doi.org/10.1109/TASLP.2022.3156758.

T. Afouras, J. S. Chung, and A. Zisserman, "The conversation: Deep audio-visual speech enhancement,"2018. http://dx.doi.org/10.21437/Interspeech.2018-1400.

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, "Looking to listen at the cocktail party: A speaker-independent audiovisual model for speech separation," ACM Trans. Graph., pp. 112:1–112:11, 2018. http://dx.doi.org/10.1145/3197517.3201357.

Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang, "Audio-visual speech enhancement using multimodal deep convolutional neural networks," IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018. http://dx.doi.org/10.1109/TETCI.2017.2784878.

R. Lu, Z. Duan, and C. Zhang, "Listen and look: audio–visual matching assisted speech source separation," IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315–1319,2018. http://dx.doi.org/10.1109/LSP.2018.2853566.

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, "The sound of pixels," in Proc. of ECCV, 2018. http://dx.doi.org/10.1007/978-3-030-01246-5_35.

A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, "Seeing through noise: Visually driven speaker separation and enhancement," in Proc. of ICASSP, 2018. http://dx.doi.org/10.1109/ICASSP.2018.8462527.

M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, "DNN driven speaker independent audio-visual mask estimation for speech separation," in Proc. of Interspeech, 2018. http://dx.doi.org/10.21437/Interspeech.2018-2516.

G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, and L. Badino, "Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6900–6904. http://dx.doi.org/10.1109/ICASSP.2019.8682061.

J. Wu, Y. Xu, S. Zhang, L. Chen, M. Yu, L. Xie, and D. Yu, "Time domain audio visual speech separation," in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 667–673. http://dx.doi.org/10.1109/ASRU46091.2019.9003983.

Mandar Gogate et al. "Deep Neural Network Driven Binaural Audio Visual Speech Separation". In: International Joint Conference on Neural Networks (IJCNN). IEEE. 2020, pp. 1–7. http://dx.doi.org/10.1109/IJCNN48605.2020.9207517.

Q. Nguyen, J. Richter, M. Lauri, T. Gerkmann and S. Frintrop, "Improving mix-and-separate training in audio-visual sound source separation with an object prior," 2020 (ICPR). http://dx.doi.org/10.1109/ICPR48806.2021.9412174.

C. Li and Y. Qian, "Deep Audio-Visual Speech Separation with Attention Mechanism," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7314-7318, http://dx.doi.org/10.1109/ICASSP40776.2020.9054180.

R. Gu et al., "Multi-modal multi-channel target speech separation," IEEE J-STSP, 2020. http://dx.doi.org/10.1109/JSTSP.2020.2980956.

C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba. "Music gesture for visual sound separation". In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10475–10484, 2020. http://dx.doi.org/10.1109/CVPR42600.2020.01049.

Lingyu Zhu and Esa Rahtu. , "Visually guided sound source separation using cascaded opponent filter network" Proc. of ACCV, 2020. http://dx.doi.org/10.1007/978-3-030-69544-6_25.

K. Tan, Y. Xu, S.-X. Zhang, M. Yu, and D. Yu, "Audio-visual speech separation and dereverberation with a two-stage multimodal network," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 542–553, 2020. http://dx.doi.org/10.1109/JSTSP.2020.2987209.

L. Qu, C. Weber, and S. Wermter, "Multimodal target speech separation with voice and face references," Proc. of Interspeech, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1697.

T. Rahman and L. Sigal, "Weakly-Supervised Audio-Visual Sound Source Detection and Separation,"IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 2021, pp. 1-6. http://dx.doi.org/10.1109/ICME51207.2021.9428196.

R. Gao and K. Grauman. "VisualVoice: Audio-visual speech separation with cross-modal consistency". In CVPR, 2021. 8, 45. http://dx.doi.org/10.1109/CVPR46437.2021.01524.

Majumder, S., Al-Halah, Z., Grauman, K.: "Move2Hear: Active audio-visual source separation". In: ICCV (2021). http://dx.doi.org/10.1109/ICCV48922.2021.00034.

Y. Liu and Y. Wei, "Multi-Modal Speech Separation Based on Two-Stage Feature Fusion," IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 2021, pp. 800-805, http://dx.doi.org/10.1109/ICSIP52628.2021.9688674.

Y. Tian, D. Hu and C. Xu, "Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2744-2753. http://dx.doi.org/10.1109/CVPR46437.2021.00277.

Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S., Masumura, R.: "Audio-visual speech separation using cross-modal correspondence loss "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6673–6677. IEEE (2021). http://dx.doi.org/10.1109/ICASSP39728.2021.9413491.

V. -N. Nguyen, M. Sadeghi, E. Ricci and X. Alameda-Pineda, "Deep Variational Generative Models for Audio-Visual Speech Separation," IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia, 2021, pp.1-6 http://dx.doi.org/10.1109/MLSP52302.2021.9596406.

Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn, "Looking into your speech: Learning cross-modal affinity for audio-visual speech separation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1336–1345. http://dx.doi.org/10.1109/CVPR46437.2021.00139.

Lingyu Zhu and Esa Rahtu. "Leveraging category information for single-frame visual sound source separation," In IEEE 2021 9th European Workshop on Visual Information Processing (EUVIP), pages 1–6. http://dx.doi.org/10.1109/EUVIP50544.2021.9484036.

R. Gu, S. -X. Zhang, Y. Zou and D. Yu, "Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 849-862, 2023. http://dx.doi.org/10.1109/TASLP.2022.3229261.

T. Oya, S. Iwase and S. Morishima, "The Sound of Bounding-Boxes," 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 2022, pp. 9-15. http://dx.doi.org/10.1109/ICPR56361.2022.9956384.

Lingyu Zhu and Esa Rahtu. "Visually guided sound source separation and localization using self-supervised motion representations" In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1289–1299, 2022. http://dx.doi.org/10.1109/WACV51458.2022.00223.

D. -H. Pham, Q. -A. Do, T. T. -H. Duong, T. -L. Le and P. -L. Nguyen, "End-to-end Visual-guided Audio Source Separation with Enhanced Losses," Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 2022, pp. 2022-2028. http://dx.doi.org/10.23919/APSIPAASC55919.2022.9980162.

Xudong Xu, Bo Dai, and Dahua Lin, "Recursive visual sound separation using minus-plus net," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 882–891. http://dx.doi.org/10.1109/ICCV.2019.00097.

Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba, "The sound of motions," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1735–1744. http://dx.doi.org/10.1109/ICCV.2019.00182.

R. Lu, Z. Duan, and C. Zhang, "Audio-visual deep clustering for speech separation," IEEE ACM Trans. Audio Speech Lang. Process., vol. 27, no. 11, pp. 1697–1712, 2019. http://dx.doi.org/10.1109/TASLP.2019.2928140.

Ruohan Gao and Kristen Grauman, "Co-separating sounds of visual objects," In Proc. ICCV, 2019. http://dx.doi.org/10.1109/ICCV.2019.00398.

M. Cooke, J. Barker, S. Cunningham, X. Shao. "An audio-visual corpus for speech perception and automatic speech recognition" The Journal of the Acoustical Society of America, vol.120, no.5, pp.2421–2424, 2006. http://dx.doi.org/10.1121/1.2229005.

N. Alghamdi, S. Maddock, R. Marxer, J. Barker, G. J. Brown. "A corpus of audio-visual Lombard speech with frontal and profile views" The Journal of the Acoustical Society of America, vol.143, no.6, pp.EL523–EL529, 2018. http://dx.doi.org/10.1121/1.5042758.

N. Harte, E. Gillen. "TCD-TIMIT: An audio-visual corpus of continuous speech". IEEE Transactions on Multimedia, vol.17, no.5, pp.603–615, 2015. http://dx.doi.org/10.1109/TMM.2015.2407694.

G. Y. Zhao, M. Barnard, M. Pietikainen. "Lipreading with local spatiotemporal descriptors". IEEE Transactions on Multimedia, vol.11, no.7, pp.1254–1265, 2009. http://dx.doi.org/10.1109/TMM.2009.2030637.

I. Anina, Z. H. Zhou, G. Y. Zhao, M. Pietikäinen. "OuluVs2: A multi-view audiovisual database for non-rigid mouth motion analysis". In Proceedings of the 11th IEEE International Conference and Workshops on Auto- matic Face and Gesture Recognition, IEEE, Ljubljana, Slovenia, pp.1−5, 2015. http://dx.doi.org/10.1109/FG.2015.7163155.

A. Nagrani, J. S. Chung, A. Zisserman. "VoxCeleb: A large-scale speaker identification dataset". In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp.2616−2620, 2017. http://dx.doi.org/10.21437/Interspeech.2017-950.

J. S. Chung, A. Nagrani, A. Zisserman. "VoxCeleb2: Deep speaker recognition". In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp.1086−1090, 2018. .

J. S. Chung, A. Zisserman. "Lip reading in the wild". In Proceedings of the 13th Asian Conference on Computer Vision, Springer, Taipei, China, pp.87−103, 2017. http://dx.doi.org/10.1007/978-3-319-54184-6_6.

J. S. Chung, A. Senior, O. Vinyals, A. Zisserman. "Lip reading sentences in the wild". In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp.3444−3453, 2017. http://dx.doi.org/10.1109/CVPR.2017.367.

J. S. Chung, A. Zisserman. "Lip reading in profile". In Proceedings of British Machine Vision Conference 2017, BMVA Press, London, UK, 2017. http://dx.doi.org/10.5244/C.31.155.

J. Roth et al., "Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4492-4496 http://dx.doi.org/10.1109/ICASSP40776.2020.9053900.

Copyright: Open Access authors retain the copyrights of their papers, and all open access articles are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided that the original work is properly cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations. While the advice and information in this journal are believed to be true and accurate on the date of its going to press, neither the authors, the editors, nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.