AI system learns concepts shared across video, audio, and text

May 04, 2022

"A machine-learning model can identify the action in a video clip and label it, without the help of humans. " reported by Adam Zewe from MIT News Office.

Alexander Liu, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and first author of a paper tackling this problem and his collaborators "developed an artificial intelligence technique that learns to represent data in a way that captures concepts which are shared between visual and audio modalities. For instance, their method can learn that the action of a baby crying in a video is related to the spoken word “crying” in an audio clip.

Using this knowledge, their machine-learning model can identify where a certain action is taking place in a video and label it.

It performs better than other machine-learning methods at cross-modal retrieval tasks, which involve finding a piece of data, like a video, that matches a user’s query given in another form, like spoken language. Their model also makes it easier for users to see why the machine thinks the video it retrieved matches their query."

"This research was supported, in part, by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside, and by the MIT Lincoln Laboratory."

Artificial intelligence system learns concepts shared across video, audio, and text by MIT News