نظرة عامة على فصل المصادر السمعية والبصرية باستخدام التعلم العميق

المؤلفون

  • Noorulhuda Mudhafar Sulaiman Department of Information and Communications Engineering/Al-Khwarizmi College of Engineering/ University of Baghdad/ Baghdad/ Iraq
  • Ahmed Al Tmeme Department of Information and Communications Engineering/Al-Khwarizmi College of Engineering/ University of Baghdad/ Baghdad/ Iraq
  • Mohammed Najah Mahdi ADAPT Centre/ School of Computing/ Dublin City University/ Dublin D09 DXA0/ Ireland

DOI:

https://doi.org/10.22153/kej.2023.06.003

الملخص

تقدم ورقة المراجعة هذه نظرة عامة على أنظمة فصل المصادر السمعية والبصرية التي تعتمد على تقنيات التعلم العميق. تناقش الورقة أهمية فصل المصادر السمعية والبصرية في مختلف المجالات ، بما في ذلك التعرف على الكلام وتقليل الضوضاء وتعزيز وضوح الكلام. تسلط المراجعة الضوء على العديد من مجموعات البيانات المستخدمة بشكل شائع لتقييم خوارزميات فصل المصادر السمعية والبصرية ، مثل مجموعة بيانات الشبكة (Grid) ؛ الذي يحتوي على تسجيلات صوتية ومرئية لمتحدثين يقرؤون الجمل ومجموعة بيانات AVSpeech ؛ التي تشتمل على مقاطع فيديو كلام بدون ضوضاء خلفية متداخلة. تناقش الورقة أيضًا مزايا وقيود تقنيات فصل المصادر السمعية والبصرية القائمة على التعلم العميق، وإمكانياتها لتطبيقات العالم الحقيقي. بشكل عام ، تؤكد المراجعة الورقية على الأهمية المتزايدة لـ AVSS كأسلوب لتحسين جودة الإشارات الصوتية.

التنزيلات

تنزيل البيانات ليس متاحًا بعد.

المراجع

A. Al-Tmeme, W. L. Woo, S. S. Dlay and B. Gao, "Underdetermined Convolutive Source Separation Using GEM-MU with Variational Approximated Optimum Model Order NMF2D," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 35-49, Jan. 2017. http://dx.doi.org/10.1109/TASLP.2016.2620600.

Woo, W.L.; Dlay, S.S.; Al-Tmeme, A.; Gao, B. "Reverberant signal separation using optimized complex sparse nonnegative tensor deconvolution on spectral covariance matrix". Digit. Signal Process. 2018, 83, 9–23. http://dx.doi.org/10.1016/j.dsp.2018.07.018

Al-Tmeme, A.; Woo, W.L.; Dlay, S.; Gao, B. "Single channel informed signal separation using artificial-stereophonic mixtures and exemplar-guided matrix factor deconvolution". Int. J. Adapt. Control. Signal Process. 2018, 32, 1259–1281. http://dx.doi.org/10.1002/acs.2912.

Ahmed Al-Tmeme, W.L. Woo, S.S. Dlay, and B. Gao, "Underdetermined reverberant acoustic source separation using weighted full-rank nonnegative tensor models," J. Acoust. Soc. Am, 138, 3411, 2015. http://dx.doi.org/10.1121/1.4923156.

Amer, R., and Al Tmeme, A. "Hybrid deep learning model for singing voice separation". Mendel 27, 2 (2021), 44–50. http://dx.doi.org/10.13164/mendel.2021.2.044.

Mahmood, Israa N. and Hasanen S. Abdullah, "Telecom Churn Prediction Based on Deep Learning Approach" (2022) 63(6) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.6.32.

Jameel, Humam Khaled and Ban Nadeem Dhannoon, "Gait Recognition Based on Deep Learning" (2022) 63(1) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.1.36.

Al-Akkam, Reem Mohammed Jasim and Mohammed Sahib Mahdi Altaei, "Plants Leaf Diseases Detection Using Deep Learning" (2022) 63(2) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.2.34.

Hussein, Noor Alhuda Khalid and Basad Al-Sarray, "Deep Learning and Machine Learning via a Genetic Algorithm to Classify Breast Cancer DNA Data" (2022) 63(7) Iraqi Journal of Science. http://dx.doi.org/10.24996/ijs.2022.63.7.36.

N. Takahashi, M. K. Singh, S. Basak, P. Sudarsanam, S. Ganapathy and Y. Mitsufuji, "Improving Voice Separation by Incorporating End-To-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 41-45, http://dx.doi.org/10.1109/ICASSP40776.2020.9053845.

S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, "FaceFilter: Audio-visual speech separation using still images," Proc. of Interspeech, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1065.

R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, "Multi-modal multi-channel target speech separation," IEEE Journal of Selected Topics in Signal Processing,2020. http://dx.doi.org/10.1109/JSTSP.2020.2980956.

Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, "ADLMVDR: All deep learning MVDR beamformer for target speech separation," ICASSP, pp. 6089–6093, 2021. http://dx.doi.org/10.1109/ICASSP39728.2021.9413594.

G. Li, J. Yu, J. Deng, X. Liu and H. Meng, "Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022, pp. 6042-6046, http://dx.doi.org/10.1109/ICASSP43922.2022.9747237.

J. Ong, B. T. Vo, S. Nordholm, B. -N. Vo, D. Moratuwage and C. Shim, "Audio-Visual Based Online Multi-Source Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1219-1234, 2022. http://dx.doi.org/10.1109/TASLP.2022.3156758.

T. Afouras, J. S. Chung, and A. Zisserman, "The conversation: Deep audio-visual speech enhancement,"2018. http://dx.doi.org/10.21437/Interspeech.2018-1400.

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, "Looking to listen at the cocktail party: A speaker-independent audiovisual model for speech separation," ACM Trans. Graph., pp. 112:1–112:11, 2018. http://dx.doi.org/10.1145/3197517.3201357.

Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang, "Audio-visual speech enhancement using multimodal deep convolutional neural networks," IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018. http://dx.doi.org/10.1109/TETCI.2017.2784878.

R. Lu, Z. Duan, and C. Zhang, "Listen and look: audio–visual matching assisted speech source separation," IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315–1319,2018. http://dx.doi.org/10.1109/LSP.2018.2853566.

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, "The sound of pixels," in Proc. of ECCV, 2018. http://dx.doi.org/10.1007/978-3-030-01246-5_35.

A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, "Seeing through noise: Visually driven speaker separation and enhancement," in Proc. of ICASSP, 2018. http://dx.doi.org/10.1109/ICASSP.2018.8462527.

M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, "DNN driven speaker independent audio-visual mask estimation for speech separation," in Proc. of Interspeech, 2018. http://dx.doi.org/10.21437/Interspeech.2018-2516.

G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, and L. Badino, "Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6900–6904. http://dx.doi.org/10.1109/ICASSP.2019.8682061.

J. Wu, Y. Xu, S. Zhang, L. Chen, M. Yu, L. Xie, and D. Yu, "Time domain audio visual speech separation," in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 667–673. http://dx.doi.org/10.1109/ASRU46091.2019.9003983.

Mandar Gogate et al. "Deep Neural Network Driven Binaural Audio Visual Speech Separation". In: International Joint Conference on Neural Networks (IJCNN). IEEE. 2020, pp. 1–7. http://dx.doi.org/10.1109/IJCNN48605.2020.9207517.

Q. Nguyen, J. Richter, M. Lauri, T. Gerkmann and S. Frintrop, "Improving mix-and-separate training in audio-visual sound source separation with an object prior," 2020 (ICPR). http://dx.doi.org/10.1109/ICPR48806.2021.9412174.

C. Li and Y. Qian, "Deep Audio-Visual Speech Separation with Attention Mechanism," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7314-7318, http://dx.doi.org/10.1109/ICASSP40776.2020.9054180.

R. Gu et al., "Multi-modal multi-channel target speech separation," IEEE J-STSP, 2020. http://dx.doi.org/10.1109/JSTSP.2020.2980956.

C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba. "Music gesture for visual sound separation". In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10475–10484, 2020. http://dx.doi.org/10.1109/CVPR42600.2020.01049.

Lingyu Zhu and Esa Rahtu. , "Visually guided sound source separation using cascaded opponent filter network" Proc. of ACCV, 2020. http://dx.doi.org/10.1007/978-3-030-69544-6_25.

K. Tan, Y. Xu, S.-X. Zhang, M. Yu, and D. Yu, "Audio-visual speech separation and dereverberation with a two-stage multimodal network," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 542–553, 2020. http://dx.doi.org/10.1109/JSTSP.2020.2987209.

L. Qu, C. Weber, and S. Wermter, "Multimodal target speech separation with voice and face references," Proc. of Interspeech, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1697.

T. Rahman and L. Sigal, "Weakly-Supervised Audio-Visual Sound Source Detection and Separation,"IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 2021, pp. 1-6. http://dx.doi.org/10.1109/ICME51207.2021.9428196.

R. Gao and K. Grauman. "VisualVoice: Audio-visual speech separation with cross-modal consistency". In CVPR, 2021. 8, 45. http://dx.doi.org/10.1109/CVPR46437.2021.01524.

Majumder, S., Al-Halah, Z., Grauman, K.: "Move2Hear: Active audio-visual source separation". In: ICCV (2021). http://dx.doi.org/10.1109/ICCV48922.2021.00034.

Y. Liu and Y. Wei, "Multi-Modal Speech Separation Based on Two-Stage Feature Fusion," IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 2021, pp. 800-805, http://dx.doi.org/10.1109/ICSIP52628.2021.9688674.

Y. Tian, D. Hu and C. Xu, "Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 2744-2753. http://dx.doi.org/10.1109/CVPR46437.2021.00277.

Makishima, N., Ihori, M., Takashima, A., Tanaka, T., Orihashi, S., Masumura, R.: "Audio-visual speech separation using cross-modal correspondence loss "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6673–6677. IEEE (2021). http://dx.doi.org/10.1109/ICASSP39728.2021.9413491.

V. -N. Nguyen, M. Sadeghi, E. Ricci and X. Alameda-Pineda, "Deep Variational Generative Models for Audio-Visual Speech Separation," IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia, 2021, pp.1-6 http://dx.doi.org/10.1109/MLSP52302.2021.9596406.

Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn, "Looking into your speech: Learning cross-modal affinity for audio-visual speech separation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1336–1345. http://dx.doi.org/10.1109/CVPR46437.2021.00139.

Lingyu Zhu and Esa Rahtu. "Leveraging category information for single-frame visual sound source separation," In IEEE 2021 9th European Workshop on Visual Information Processing (EUVIP), pages 1–6. http://dx.doi.org/10.1109/EUVIP50544.2021.9484036.

R. Gu, S. -X. Zhang, Y. Zou and D. Yu, "Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 849-862, 2023. http://dx.doi.org/10.1109/TASLP.2022.3229261.

T. Oya, S. Iwase and S. Morishima, "The Sound of Bounding-Boxes," 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 2022, pp. 9-15. http://dx.doi.org/10.1109/ICPR56361.2022.9956384.

Lingyu Zhu and Esa Rahtu. "Visually guided sound source separation and localization using self-supervised motion representations" In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1289–1299, 2022. http://dx.doi.org/10.1109/WACV51458.2022.00223.

D. -H. Pham, Q. -A. Do, T. T. -H. Duong, T. -L. Le and P. -L. Nguyen, "End-to-end Visual-guided Audio Source Separation with Enhanced Losses," Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 2022, pp. 2022-2028. http://dx.doi.org/10.23919/APSIPAASC55919.2022.9980162.

Xudong Xu, Bo Dai, and Dahua Lin, "Recursive visual sound separation using minus-plus net," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 882–891. http://dx.doi.org/10.1109/ICCV.2019.00097.

Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba, "The sound of motions," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1735–1744. http://dx.doi.org/10.1109/ICCV.2019.00182.

R. Lu, Z. Duan, and C. Zhang, "Audio-visual deep clustering for speech separation," IEEE ACM Trans. Audio Speech Lang. Process., vol. 27, no. 11, pp. 1697–1712, 2019. http://dx.doi.org/10.1109/TASLP.2019.2928140.

Ruohan Gao and Kristen Grauman, "Co-separating sounds of visual objects," In Proc. ICCV, 2019. http://dx.doi.org/10.1109/ICCV.2019.00398.

M. Cooke, J. Barker, S. Cunningham, X. Shao. "An audio-visual corpus for speech perception and automatic speech recognition" The Journal of the Acoustical Society of America, vol.120, no.5, pp.2421–2424, 2006. http://dx.doi.org/10.1121/1.2229005.

N. Alghamdi, S. Maddock, R. Marxer, J. Barker, G. J. Brown. "A corpus of audio-visual Lombard speech with frontal and profile views" The Journal of the Acoustical Society of America, vol.143, no.6, pp.EL523–EL529, 2018. http://dx.doi.org/10.1121/1.5042758.

N. Harte, E. Gillen. "TCD-TIMIT: An audio-visual corpus of continuous speech". IEEE Transactions on Multimedia, vol.17, no.5, pp.603–615, 2015. http://dx.doi.org/10.1109/TMM.2015.2407694.

G. Y. Zhao, M. Barnard, M. Pietikainen. "Lipreading with local spatiotemporal descriptors". IEEE Transactions on Multimedia, vol.11, no.7, pp.1254–1265, 2009. http://dx.doi.org/10.1109/TMM.2009.2030637.

I. Anina, Z. H. Zhou, G. Y. Zhao, M. Pietikäinen. "OuluVs2: A multi-view audiovisual database for non-rigid mouth motion analysis". In Proceedings of the 11th IEEE International Conference and Workshops on Auto- matic Face and Gesture Recognition, IEEE, Ljubljana, Slovenia, pp.1−5, 2015. http://dx.doi.org/10.1109/FG.2015.7163155.

A. Nagrani, J. S. Chung, A. Zisserman. "VoxCeleb: A large-scale speaker identification dataset". In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp.2616−2620, 2017. http://dx.doi.org/10.21437/Interspeech.2017-950.

J. S. Chung, A. Nagrani, A. Zisserman. "VoxCeleb2: Deep speaker recognition". In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp.1086−1090, 2018. .

J. S. Chung, A. Zisserman. "Lip reading in the wild". In Proceedings of the 13th Asian Conference on Computer Vision, Springer, Taipei, China, pp.87−103, 2017. http://dx.doi.org/10.1007/978-3-319-54184-6_6.

J. S. Chung, A. Senior, O. Vinyals, A. Zisserman. "Lip reading sentences in the wild". In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp.3444−3453, 2017. http://dx.doi.org/10.1109/CVPR.2017.367.

J. S. Chung, A. Zisserman. "Lip reading in profile". In Proceedings of British Machine Vision Conference 2017, BMVA Press, London, UK, 2017. http://dx.doi.org/10.5244/C.31.155.

J. Roth et al., "Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4492-4496 http://dx.doi.org/10.1109/ICASSP40776.2020.9053900.

التنزيلات

منشور

2023-12-01

كيفية الاقتباس

نظرة عامة على فصل المصادر السمعية والبصرية باستخدام التعلم العميق. (2023). مجلة الخوارزمي الهندسية, 19(4), 42-55. https://doi.org/10.22153/kej.2023.06.003

تواريخ المنشور