This AI Makes The Mona Lisa Speak…And More!

This AI Makes The Mona Lisa Speak…And More!


Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. In an earlier episode, we covered a paper
by the name Everybody Dance Now. In this stunning work, we could take a video
of a professional dancer, then record a video of our own, let’s be diplomatic – less beautiful
moves, and then, transfer the dancer’s performance onto our own body in the video. We called this process motion transfer. Now, look at this new, also learning-based
technique, that does something similar…where in goes a description of a pose, just one
image of the target person, and on the other side, out comes a proper animation of this
character according to our prescribed motions. Now, before you think that it means that we
would need to draw and animate stick figures to use this, I will stress that this is not
the case. There are many techniques that perform pose
estimation, where we just insert a photo, or even a video, and it creates all these
stick figures for us that represent the pose that people are taking in these videos. This means that we can even have a video of
someone dancing, and just one image of the target person, and the rest is history. Insanity. That is already amazing and very convenient,
but this paper works with a video to video problem formulation, which is a concept that
is more general than just generating movement. Way more. For instance, we can also specify the input
video of us, then add one, or at most a few images of the target subject, and we can make
them speak and behave using our gestures. This is already absolutely amazing, however,
the more creative minds out there are already thinking that if we are thinking about images,
it can be a painting as well, right? Yes, indeed, we can make the Mona Lisa speak
with it as well. It can also take a labeled image, this is
what you see here, where the colored and animated patches show the object boundaries for different
object classes, then, we take an input photo of a street scene, and we get photorealistic
footage with all the cars, buildings, and vegetation. Now, make no mistake, some of these applications
were possible before, many of which we showcased in previous videos, some of which you can
see here, what is new and interesting here is that we have just one architecture here
that can handle many of these tasks. Beyond that, this architecture requires much
less data than previous techniques as it often needs just one or at most a few images of
the target subject to do all this magic. The paper is ample in comparisons to these
other methods, for instance, the FID measures the quality and the diversity of the generated
output images, and is subject to minimization, and you see that it is miles beyond these
previous works. Some limitations also apply, if the inputs
stray too far away from topics that the neural networks were trained on, we shouldn’t expect
results of this quality, and we are also dependent on proper inputs for the poses and segmentation
maps for it to work well. The pace of progress in machine learning research
is absolutely incredible, and we are getting very close to producing tools that can be
actively used to empower artists working in the industry. What a time to be alive! Thanks for watching and for your generous
support, and I’ll see you next time!

100 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *