An Overview of Audio-Visual Source Separation Using Deep Learning

Noorulhuda Mudhafar Sulaiman; Ahmed Al Tmeme; Mohammed Najah  Mahdi

doi:10.22153/kej.2023.06.003

مجلد 19 عدد 4 (2023), المقالات

مجلد 19 عدد 4 (2023)

نظرة عامة على فصل المصادر السمعية والبصرية باستخدام التعلم العميق

المقالات

منشور 12/01/2023

Noorulhuda Mudhafar Sulaiman⁺⁻
Ahmed Al Tmeme⁺⁻
Mohammed Najah Mahdi ⁺⁻

Noorulhuda Mudhafar Sulaiman

Department of Information and Communications Engineering/Al-Khwarizmi College of Engineering/ University of Baghdad/ Baghdad/ Iraq

Ahmed Al Tmeme

Department of Information and Communications Engineering/Al-Khwarizmi College of Engineering/ University of Baghdad/ Baghdad/ Iraq

Mohammed Najah Mahdi

ADAPT Centre/ School of Computing/ Dublin City University/ Dublin D09 DXA0/ Ireland

PDF (الإنجليزية)

كيفية الاقتباس

نظرة عامة على فصل المصادر السمعية والبصرية باستخدام التعلم العميق. (2023). مجلة الخوارزمي الهندسية, 19(4), 42-55. https://doi.org/10.22153/kej.2023.06.003

الملخص

تقدم ورقة المراجعة هذه نظرة عامة على أنظمة فصل المصادر السمعية والبصرية التي تعتمد على تقنيات التعلم العميق. تناقش الورقة أهمية فصل المصادر السمعية والبصرية في مختلف المجالات ، بما في ذلك التعرف على الكلام وتقليل الضوضاء وتعزيز وضوح الكلام. تسلط المراجعة الضوء على العديد من مجموعات البيانات المستخدمة بشكل شائع لتقييم خوارزميات فصل المصادر السمعية والبصرية ، مثل مجموعة بيانات الشبكة (Grid) ؛ الذي يحتوي على تسجيلات صوتية ومرئية لمتحدثين يقرؤون الجمل ومجموعة بيانات AVSpeech ؛ التي تشتمل على مقاطع فيديو كلام بدون ضوضاء خلفية متداخلة. تناقش الورقة أيضًا مزايا وقيود تقنيات فصل المصادر السمعية والبصرية القائمة على التعلم العميق، وإمكانياتها لتطبيقات العالم الحقيقي. بشكل عام ، تؤكد المراجعة الورقية على الأهمية المتزايدة لـ AVSS كأسلوب لتحسين جودة الإشارات الصوتية.

PDF (الإنجليزية)

المراجع

A. Al-Tmeme, W. L. Woo, S. S. Dlay and B. Gao, "Underdetermined Convolutive Source Separation Using GEM-MU with Variational Approximated Optimum Model Order NMF2D," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 35-49, Jan. 2017. http://dx.doi.org/10.1109/TASLP.2016.2620600.

Woo, W.L.; Dlay, S.S.; Al-Tmeme, A.; Gao, B. "Reverberant signal separation using optimized complex sparse nonnegative tensor deconvolution on spectral covariance matrix". Digit. Signal Process. 2018, 83, 9–23. http://dx.doi.org/10.1016/j.dsp.2018.07.018

Al-Tmeme, A.; Woo, W.L.; Dlay, S.; Gao, B. "Single channel informed signal separation using artificial-stereophonic mixtures and exemplar-guided matrix factor deconvolution". Int. J. Adapt. Control. Signal Process. 2018, 32, 1259–1281. http://dx.doi.org/10.1002/acs.2912.

Ahmed Al-Tmeme, W.L. Woo, S.S. Dlay, and B. Gao, "Underdetermined reverberant acoustic source separation using weighted full-rank nonnegative tensor models," J. Acoust. Soc. Am, 138, 3411, 2015. http://dx.doi.org/10.1121/1.4923156.

Amer, R., and Al Tmeme, A. "Hybrid deep learning model for singing voice separation". Mendel 27, 2 (2021), 44–50. http://dx.doi.org/10.13164/mendel.2021.2.044.

Mahmood, Israa N. and Hasanen S. Abdullah, "Telecom Churn Prediction Based on Deep Learning Approach" (2022) 63(6) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.6.32.

Jameel, Humam Khaled and Ban Nadeem Dhannoon, "Gait Recognition Based on Deep Learning" (2022) 63(1) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.1.36.

Al-Akkam, Reem Mohammed Jasim and Mohammed Sahib Mahdi Altaei, "Plants Leaf Diseases Detection Using Deep Learning" (2022) 63(2) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.2.34.

Hussein, Noor Alhuda Khalid and Basad Al-Sarray, "Deep Learning and Machine Learning via a Genetic Algorithm to Classify Breast Cancer DNA Data" (2022) 63(7) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.7.36.

N. Takahashi, M. K. Singh, S. Basak, P. Sudarsanam, S. Ganapathy and Y. Mitsufuji, "Improving Voice Separation by Incorporating End-To-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 41-45, http://dx.doi.org/10.1109/ICASSP40776.2020.9053845.

S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, "FaceFilter: Audio-visual speech separation using still images," Proc. of Interspeech, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1065.

R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, "Multi-modal multi-channel target speech separation," IEEE Journal of Selected Topics in Signal Processing,2020. http://dx.doi.org/10.1109/JSTSP.2020.2980956.

Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, "ADLMVDR: All deep learning MVDR beamformer for target speech separation," ICASSP, pp. 6089–6093, 2021. http://dx.doi.org/10.1109/ICASSP39728.2021.9413594.

G. Li, J. Yu, J. Deng, X. Liu and H. Meng, "Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 6042-6046, http://dx.doi.org/10.1109/ICASSP43922.2022.9747237.

J. Ong, B. T. Vo, S. Nordholm, B. -N. Vo, D. Moratuwage and C. Shim, "Audio-Visual Based Online Multi-Source Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1219-1234, 2022. http://dx.doi.org/10.1109/TASLP.2022.3156758.

T. Afouras, J. S. Chung, and A. Zisserman, "The conversation: Deep audio-visual speech enhancement,"2018. http://dx.doi.org/10.21437/Interspeech.2018-1400.

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, "Looking to listen at the cocktail party: A speaker-independent audiovisual model for speech separation," ACM Trans. Graph., pp. 112:1–112:11, 2018. http://dx.doi.org/10.1145/3197517.3201357.

Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang, "Audio-visual speech enhancement using multimodal deep convolutional neural networks," IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018. http://dx.doi.org/10.1109/TETCI.2017.2784878.

R. Lu, Z. Duan, and C. Zhang, "Listen and look: audio–visual matching assisted speech source separation," IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315–1319,2018. http://dx.doi.org/10.1109/LSP.2018.2853566.

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, "The sound of pixels," in Proc. of ECCV, 2018. http://dx.doi.org/10.1007/978-3-030-01246-5_35.

A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, "Seeing through noise: Visually driven speaker separation and enhancement," in Proc. of ICASSP, 2018. http://dx.doi.org/10.1109/ICASSP.2018.8462527.

M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, "DNN driven speaker independent audio-visual mask estimation for speech separation," in Proc. of Interspeech, 2018. http://dx.doi.org/10.21437/Interspeech.2018-2516.

G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, and L. Badino, "Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6900–6904. http://dx.doi.org/10.1109/ICASSP.2019.8682061.

J. Wu, Y. Xu, S. Zhang, L. Chen, M. Yu, L. Xie, and D. Yu, "Time domain audio visual speech separation," in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 667–673. http://dx.doi.org/10.1109/ASRU46091.2019.9003983.

Mandar Gogate et al. "Deep Neural Network Driven Binaural Audio Visual Speech Separation". In: International Joint Conference on Neural Networks (IJCNN). IEEE. 2020, pp. 1–7. http://dx.doi.org/10.1109/IJCNN48605.2020.9207517.

Q. Nguyen, J. Richter, M. Lauri, T. Gerkmann and S. Frintrop, "Improving mix-and-separate training in audio-visual sound source separation with an object prior," 2020 (ICPR). http://dx.doi.org/10.1109/ICPR48806.2021.9412174.

C. Li and Y. Qian, "Deep Audio-Visual Speech Separation with Attention Mechanism," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7314-7318, http://dx.doi.org/10.1109/ICASSP40776.2020.9054180.

R. Gu et al., "Multi-modal multi-channel target speech separation," IEEE J-STSP, 2020. http://dx.doi.org/10.1109/JSTSP.2020.2980956.

C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba. "Music gesture for visual sound separation". In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10475–10484, 2020. http://dx.doi.org/10.1109/CVPR42600.2020.01049.

Lingyu Zhu and Esa Rahtu. , "Visually guided sound source separation using cascaded opponent filter network" Proc. of ACCV, 2020. http://dx.doi.org/10.1007/978-3-030-69544-6_25.

K. Tan, Y. Xu, S.-X. Zhang, M. Yu, and D. Yu, "Audio-visual speech separation and dereverberation with a two-stage multimodal network," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 542–553, 2020. http://dx.doi.org/10.1109/JSTSP.2020.2987209.

L. Qu, C. Weber, and S. Wermter, "Multimodal target speech separation with voice and face references," Proc. of Interspeech, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1697.

T. Rahman and L. Sigal, "Weakly-Supervised Audio-Visual Sound Source Detection and Separation,"IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 2021, pp. 1-6. http://dx.doi.org/10.1109/ICME51207.2021.9428196.

R. Gao and K. Grauman. "VisualVoice: Audio-visual speech separation with cross-modal consistency". In CVPR, 2021. 8, 45. http://dx.doi.org/10.1109/CVPR46437.2021.01524.

Majumder, S., Al-Halah, Z., Grauman, K.: "Move2Hear: Active audio-visual source separation". In: ICCV (2021). http://dx.doi.org/10.1109/ICCV48922.2021.00034.

Y. Liu and Y. Wei, "Multi-Modal Speech Separation Based on Two-Stage Feature Fusion," IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 2021, pp. 800-805, http://dx.doi.org/10.1109/ICSIP52628.2021.9688674.

Y. Tian, D. Hu and C. Xu, "Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2744-2753. http://dx.doi.org/10.1109/CVPR46437.2021.00277.

Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S., Masumura, R.: "Audio-visual speech separation using cross-modal correspondence loss "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6673–6677. IEEE (2021). http://dx.doi.org/10.1109/ICASSP39728.2021.9413491.

V. -N. Nguyen, M. Sadeghi, E. Ricci and X. Alameda-Pineda, "Deep Variational Generative Models for Audio-Visual Speech Separation," IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia, 2021, pp.1-6 http://dx.doi.org/10.1109/MLSP52302.2021.9596406.

Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn, "Looking into your speech: Learning cross-modal affinity for audio-visual speech separation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1336–1345. http://dx.doi.org/10.1109/CVPR46437.2021.00139.

Lingyu Zhu and Esa Rahtu. "Leveraging category information for single-frame visual sound source separation," In IEEE 2021 9th European Workshop on Visual Information Processing (EUVIP), pages 1–6. http://dx.doi.org/10.1109/EUVIP50544.2021.9484036.

R. Gu, S. -X. Zhang, Y. Zou and D. Yu, "Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 849-862, 2023. http://dx.doi.org/10.1109/TASLP.2022.3229261.

T. Oya, S. Iwase and S. Morishima, "The Sound of Bounding-Boxes," 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 2022, pp. 9-15. http://dx.doi.org/10.1109/ICPR56361.2022.9956384.

Lingyu Zhu and Esa Rahtu. "Visually guided sound source separation and localization using self-supervised motion representations" In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1289–1299, 2022. http://dx.doi.org/10.1109/WACV51458.2022.00223.

D. -H. Pham, Q. -A. Do, T. T. -H. Duong, T. -L. Le and P. -L. Nguyen, "End-to-end Visual-guided Audio Source Separation with Enhanced Losses," Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 2022, pp. 2022-2028. http://dx.doi.org/10.23919/APSIPAASC55919.2022.9980162.

Xudong Xu, Bo Dai, and Dahua Lin, "Recursive visual sound separation using minus-plus net," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 882–891. http://dx.doi.org/10.1109/ICCV.2019.00097.

Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba, "The sound of motions," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1735–1744. http://dx.doi.org/10.1109/ICCV.2019.00182.

R. Lu, Z. Duan, and C. Zhang, "Audio-visual deep clustering for speech separation," IEEE ACM Trans. Audio Speech Lang. Process., vol. 27, no. 11, pp. 1697–1712, 2019. http://dx.doi.org/10.1109/TASLP.2019.2928140.

Ruohan Gao and Kristen Grauman, "Co-separating sounds of visual objects," In Proc. ICCV, 2019. http://dx.doi.org/10.1109/ICCV.2019.00398.

M. Cooke, J. Barker, S. Cunningham, X. Shao. "An audio-visual corpus for speech perception and automatic speech recognition" The Journal of the Acoustical Society of America, vol.120, no.5, pp.2421–2424, 2006. http://dx.doi.org/10.1121/1.2229005.

N. Alghamdi, S. Maddock, R. Marxer, J. Barker, G. J. Brown. "A corpus of audio-visual Lombard speech with frontal and profile views" The Journal of the Acoustical Society of America, vol.143, no.6, pp.EL523–EL529, 2018. http://dx.doi.org/10.1121/1.5042758.

N. Harte, E. Gillen. "TCD-TIMIT: An audio-visual corpus of continuous speech". IEEE Transactions on Multimedia, vol.17, no.5, pp.603–615, 2015. http://dx.doi.org/10.1109/TMM.2015.2407694.

G. Y. Zhao, M. Barnard, M. Pietikainen. "Lipreading with local spatiotemporal descriptors". IEEE Transactions on Multimedia, vol.11, no.7, pp.1254–1265, 2009. http://dx.doi.org/10.1109/TMM.2009.2030637.

I. Anina, Z. H. Zhou, G. Y. Zhao, M. Pietikäinen. "OuluVs2: A multi-view audiovisual database for non-rigid mouth motion analysis". In Proceedings of the 11th IEEE International Conference and Workshops on Auto- matic Face and Gesture Recognition, IEEE, Ljubljana, Slovenia, pp.1−5, 2015. http://dx.doi.org/10.1109/FG.2015.7163155.

A. Nagrani, J. S. Chung, A. Zisserman. "VoxCeleb: A large-scale speaker identification dataset". In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp.2616−2620, 2017. http://dx.doi.org/10.21437/Interspeech.2017-950.

J. S. Chung, A. Nagrani, A. Zisserman. "VoxCeleb2: Deep speaker recognition". In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp.1086−1090, 2018. .

J. S. Chung, A. Zisserman. "Lip reading in the wild". In Proceedings of the 13th Asian Conference on Computer Vision, Springer, Taipei, China, pp.87−103, 2017. http://dx.doi.org/10.1007/978-3-319-54184-6_6.

J. S. Chung, A. Senior, O. Vinyals, A. Zisserman. "Lip reading sentences in the wild". In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp.3444−3453, 2017. http://dx.doi.org/10.1109/CVPR.2017.367.

J. S. Chung, A. Zisserman. "Lip reading in profile". In Proceedings of British Machine Vision Conference 2017, BMVA Press, London, UK, 2017. http://dx.doi.org/10.5244/C.31.155.

J. Roth et al., "Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4492-4496 http://dx.doi.org/10.1109/ICASSP40776.2020.9053900.

حقوق الطبع والنشر: يحتفظ مؤلفو الوصول المفتوح بحقوق الطبع والنشر لاعمالهم، ويتم توزيع جميع مقالات الوصول المفتوح بموجب شروط ترخيص Creative Commons Attribution License، والتي تسمح بالاستخدام غير المقيد والتوزيع والاستنساخ في أي وسيط، بشرط ذكر العمل الأصلي بشكل صحيح. إن استخدام الأسماء الوصفیة العامة، والأسماء التجاریة، والعلامات التجاریة، وما إلی ذلك في ھذا المنشور، حتی وإن لم یتم تحدیدھ بشکل محدد، لا یعني أن ھذه الأسماء غیر محمیة بموجب القوانین واللوائح ذات الصلة. في حين يعتقد أن المشورة والمعلومات في هذه المجلة صحيحة ودقيقة في تاريخ صحتها، لا يمكن للمؤلفين والمحررين ولا الناشر قبول أي مسؤولية قانونية عن أي أخطاء أو سهو قد يتم. لا يقدم الناشر أي ضمان، صريح أو ضمني، فيما يتعلق بالمواد الواردة في هذه الوثيقة.

نظرة عامة على فصل المصادر السمعية والبصرية باستخدام التعلم العميق

كيفية الاقتباس

تنزيل الاقتباسات

الملخص

المراجع