With generative AI being a key feature of all its new software and hardware projects, it should be no surprise [[link]] that Microsoft has been developing its own machine learning models. VASA-1 is one such example, where a single image of a person and an audio track can be converted into a convincing video clip [[link]] of said person speaking the recording.
Just a few years ago, anything created via generative AI was instantly identifiable, by several factors. With still images, it would be things like the number of fingers on a person's hand or even just something as simple as having the correct number of legs. AI-generated video was even worse, but at least it was very meme-worthy.
However, a research report from Microsoft shows that the obvious nature of generative AI is rapidly going to disappear. VASA-1 is a machine learning model that turns a single static image of a person's face into a short, realistic video, through the use of a speech audio track. The model examines the sound's changes in tone and pace and then creates a sequence of new images where the face is altered to match the speech.
I'm not doing it any justice with that description, because some of the examples posted by Microsoft are startlingly good. Others aren't so hot, though, and it's clear that the researchers selected the best examples to showcase what they've achieved. In particular, a short video demonstrating the use of the model in real-time highlights that it still has a long way to go before it becomes impossible to distinguish real reality from computer-generated reality.
"It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection."
This is probably why Microsoft's research remains behind closed doors right now. That said, I can't imagine [[link]] it will be long before someone manages to not only replicate the work but improve it, and potentially use it for some nefarious purpose. On the other hand, if VASA-1 can be used to detect deepfakes and it could be implemented in the form of a simple desktop application, then this would be a big step forward—or rather, a step away from a world where AI dooms us all. Yay!
