The research approach consists of visual-audio synchronisation and speech processing. Matching the stereoscopic images features, a 3D point cloud can be extracted through the image processor. A time-of-flight (TOF) depth camera will be included as a complementary sensor to enable it to adapt to different interaction scenarios. Beamformer steers to the mouth direction and optimises the array pattern for target co-ordination. The challenge of developing the technology is the need to build an alignment with the beamformer filter coefficients and image frame to improve voice processing. Therefore, a compilation algorithm will be developed to achieve real-time visual-audio synchronisation.
In the future, this technology could be applied in service robots to enhance the audio processing functions, thereby enabling the robots to provide better response to users’ command.
|