STT_Video Production with Generative AI

I want to talk about it. That's the fun of my serious about this. All right. Look of secrecy. Thanks, Mark. Hi, I'm K Ka. Um Yes, I'm with Adobe. I work on A I tools and Premio pro and core workflow. I'm also the Hollywood webinar uh for C as well. And I'm excited about three great presentations that we have. We're officially starting the 10 a.m. session and we're gonna be talking about generative A is impact of video production um and a novel method for constructive knowledge graphs from, from unstructured text to improve news, data management. And then we'll Well, Eric back to talk about A I and animation. So first we're gonna start with, he works for Aws. He's worked in the field of A I at Amazon for almost 10 years. Currently manages a team of A I and data specialist for A Aws. And the team's focus is on the uh workloads and it's previously been a secret offer for C and is a technical editor for A I books for a variety of publishing. So please welcome friend. All right, thanks everyone. Appreciate you coming here today. So I'm gonna talk about a very specific topic about applying generative A I and you are close. So I think mainly address video generation models which are one of the most promising tools to apply to the. And so I'm gonna start off with what I think is one of the most successful examples of the use of these tools uh for film called air head. And I'm gonna talk about the reality of the work that went into making it professional. And I'm gonna talk about the current state of the art where we're at with these models and their usefulness. Then I'm gonna talk about the key techniques for getting the most out of these models or, or any gen I model which is prompt engineering. Then I'm gonna talk about how you combine these models in post production. You saw a little of that with uh Roger's talk. The search is a very important capability and then finally, I, I'll dive a little bit into how these models are trained and how they work. So you may have seen this. Uh You won't be able to hear the audio, but I just want you to get a sense for of the visuals for this or if you haven't seen it came out earlier this year. So uh I want you to pay attention to the visuals. I'll just play a few seconds of it for you and again, just focus on the visuals of it. Uh This is a film about a character who's head is a bull. So it's very well done. In my opinion, it took a team of three people about two weeks to do it. So there you can see the central concept of the character and that's what the film is based around. All right. So I think that gives you a pretty good flavor. So you see, it's quite professional looking, very well done. But behind the scenes, there are quite a few issues. So one of the key issues here was the fact that for every 300 clips that were produced, only one was usable. So of course, that's a huge amount of waste. And behind the scenes, there's GP U hardware running these models. So obviously, that's a a key limitation. And then the bottles have approved to a point where there's a lot of consistency of objects within the scene. But between different clips or shots, you'll notice the character looked a little bit different. So that's uh another limitation. Currently, you see with these ones and then these models have an inherent randiness, so they're stochastic. So they won't produce the same result for the same cost uh every single time. And what happens is that typically in post, you have to use a huge amount of editing. So for example, even though the producers wanted to character to have a balloon with, you know, based on it, sometimes the model with a mind of its own just slapped the face on it, which they have to head it out and post. So it's quite a bit of work. And so again, the key limitation that the producer stated was just the, the lack of sufficient creative control over the model output. This is a situation that's improving. And I I expect over the next year, there will be substantial evolution, but that leads me to my next goal. So what are the feasible use cases now? So I think there's actually quite a few, whether you're uh an Indie or a uh producer, commercials. And so ads is, is one of the key uh initial applications that we already see. And so I'll play this video for you now. So this is actually probably the first large scale production use case of video generation. So Amazon ads allows advertisers to create their own videos using A I. And so this is what it looks like. Basically you supply a product image and uh the A Ian. So you can see this is very straightforward application. It's not uh super dynamic uh when it gets you. Uh I I it's definitely more engaging than just a static in it, right? So uh beside these ads, and I think we'll see quite a bit of applications and commercials. And uh of course, some of you may have seen an infamous toys R US commercial that came out earlier this year as well. Um There's a lot of potential uses in pre-production So we've already seen some of those uh animatic uh other visualizations that don't require very high resolution necessarily just to give a director a previous idea of what something might look like. Uh Then in terms of production itself, uh B roll second unit. So if you want to establish a scene of the castle in Spain, you can certainly ask the model to produce that and, and have good results of how, how to full stock footage or send the crew over there and then post-production editing as, as we saw earlier is one of the key use cases that is practical now more or less and be effects as well. All right. So uh a key consideration when you're choosing one of these models and there's, there's several commercial commercially available right now with many more on the way is what kind of in input mode is available. So uh the the first video generation models were text based. So uh you would supply a text prompt, describing the scene and it would produce an output, but now the trend is towards multimodal. So you can supply not all the text but also potentially an image or even another video. And let's make sure there's basically anything to anything. So the sample would be uh supplying uh text prompt and then getting out a video with audio, also sit into it. So we'll see that uh much more in the future. I I would say next year All right. So I'm going to give you a concrete example here. So I'm gonna start the talk video for you. And so this is just uh kind of a standard drone shot of the Hawaii jungle, right? So not, not very dramatic, it's pretty short clip. Um And then what you can do is also supply a text crop to modify that. So this is basically edit a sample. So you can say, hey, I would like to have kayak in the water there. And so so in the bottom of it there, you can see the kayaks have been added and you can see perhaps see that they're bobbing up and down on the water. So that's very straightforward application of a local local input for one of these models, both text and video. All right. So text plus image is if you're mathematically minded, it's like the inverse operation of the video input. So with the image input, you basically supply the aesthetic uh with the input image and you animate it. Whereas with the video input, you're supplying uh animation and then you're changing the a static of potentially. So in this example, I heard a prompt uh about a live music festival and then I supply that to a video generation model uh with some additional prompts specifying action. So I'll play that lower video for you. And I think you'll see it did a pretty good job of capturing additional prompt language, the dancing the fashion stage life like that again for you. So that again, the models work pretty well at these types of applications. But the the range of animation at this point is still quite limited. So it's not really going to go outside the the frame of the input image or extrapolate to anything beyond the frame or do any exceptionally dynamic motion. So you've seen a few prompts already and and the prompts are really what allows you to apply the creative control to that. So there's a big difference here between video generation models and other types of models you've seen for text generation. And it comes down to the requirement of specialized knowledge if you want to produce a professional quality video. So if you just have a text model, you can ask uh you know what, what team won the Super Bowl in 1983 or summarize this document for me, you don't really need to have any specialized knowledge though, of course, with any kind of model prompt engineering is necessary to get the most optimal output. But if you're trying to create a professional video, you have to have a wider base of knowledge, you can't possibly produce a professional video, something that uh reflects your creative vision unless you do have that. So what I've seen over the last year is a trend towards enabling users to apply a language of cinematography uh when using these models. So I'll show you some examples of that. But first in terms of what those elements of cinematography are. So I think different experts would have uh a different list. Perhaps I think uh you know, most of us could agree that the elements you see here are the key elements like for example, camera placement and movement, uh shock composition and size, uh focus and depth of field are really critical as a storytelling technique. And then lighting of course has been a crucial storytelling technique in film entertainment since the very beginning. And just they just th those are only six of them, other people would say, you know, type of film stock and all these other parameters and elements. And just think of all the different combinations of, of those elements that you have to consider when constructing uh a clip. And then besides these, there's all the rules of cinematography uh regarding motion of the characters, for example. So it's, it's just a lot of background knowledge to take into consideration. So again, I'm gonna show you another example of what I I mean here, this is a relatively simple example. Uh but it's basically a a trapping shot so we can play this video for you. And so you can see that, play that one more time, you can see that the model did a pretty good job of capturing what the prompt was asking for. There are a couple of oddities there, for example, uh the trees on the left side of the street are bare. I, I don't know why. I mean, I'm not an east coast person, so maybe that's the way east coast works. I'm west coast person. But so that's again, something to keep in mind. That's, that's why you get those ratios of 300 to 1 unusable to usable out. All right, here's another example, using a couple of other problems so that the fuel, the shock size, I'm gonna go ahead and play that with you for you again, futuristic Tokyo Street. I, I think overall the prompt does a pretty good job of capturing or the model is a pretty good job of capturing what the prompt was asking for. So, so I think you can see that so far these models have have reached a point where they're doing a a pretty OK job of, of capturing some of these basic elements of cinematography, there's still a lot of work to do. And then that's not to mention the fact that if you're really trying to put together a story of professional video, there's a lot more that you need to consider. So uh currently there's some, some models that allow promises as long as 1000 words. So you're still starting to get to the point where people are gonna be writing novels as problems. So, and, and perhaps that would be the direction of you just uh input great expectations. A model will spit out the whole movie for you. But you know, we're still a very long way away from that, but still, even with 1000 words, it's a lot to consider it. So how do you organize your product and, and think about it? So one possible template is what I have at the bottom of this side here. But like camera movement establishing, seeing in additional details and that can help you kind of organize your thoughts until you gain more experience and confidence, writing prompts. So this this particular prompt is one I I wrote to be somewhat more complex. I specify where the main character of the toy Poodle is in the middle ground. And uh the motion I specify left to right. Why I specify that? Uh It's because one of the rules of cinematography is, is the fact that for some reason, uh left, right and motion is invokes a more positive emotional response from viewer than right and left motion. So again, all these rules can be broken and are broken all the time. But more often than not following these rules will produce better results. All right. So I'm gonna start the top video here so you can see what the model did. So overall, I think the model did a pretty good job of capturing that prompt, but let's try another click here. All right. So the bottom foot was produced by the same prompt, same module. But as you can see, uh the results are quite different. So the especially the left to right motion is not as pronounced as in the first club. So again, just thinking, if you're using these models, there's gonna be a lot of trial and error as you work through it. So I'm gonna talk about some more advanced techniques you use to, to manage that. All right. But first, another key difference with video generation models is that they involve time. So the other models that you might be familiar with or text generation or image generation are not time dependent. So you ask for a summary of a document, it's the same five minutes from now as it is from a day from now, right? So with media generation, you have the time dimension introduced a lot of complexity. So how do we deal with this? So right now, different model providers have different techniques. One of them you see on the screen here, it's called prompt travel where basically you specify like a frame number point in time that seems to change. And so that this particular provider did it was they provided a prompt body that changes over time and a prompt head and tail which stayed the same or change more slowly uh during the course of the quilt. So that's one way. But another way is to act as what I call prompt chain where you just have one long prompt that specifies how the scene changes over time. And you don't specify a frame number of time span of it, you just want a lot on the side how to make this transition. So that's uh another common one. And then of course, the prompts get longer and longer. So 1000 more problems uh might seem very short when we have to uh deal with those kinds of locations. So kind of getting back to the thing that these models can be wasteful and require a lot of trial and error. So one way to manage that issue using a technique called apple preview. So this basically allows you to request a set of meeting clips that are produced a plan thing of five seconds or a couple of seconds just so you can get an idea of what the problem might produce. And this is important, not only because the time was sorting through all, you know, potentially hundreds of foots generated by a model, but also due to the cost again, you're running on GP U or some other type of accelerated hardware. So it can get quite costly too. This is another technique that makes video generation more practical and feasible. So up to now, I've kind of talked about, you know, how you might use these models in, in generating what I what you might call raw footage uh that, that you use in the, the main production itself. But there's also many post-production use cases of these models. And so uh for, for this stage of production. It's not only video generation models, there's many other types of gen E I models that can be applicable here. So, editing, we, we've mentioned before, visual effects, I talked to a number of visual effects houses about how they're experimenting with Je I uh currently uh reuse of content is something that's happening now. Uh And, and is, you know, super a huge potential workload and then localization is also huge uh as a current workload. And, and there are a number of companies offering uh localization with GEN I right now. All right. So editing um so probably the most prominent example of, of doing this uh is Adobe. And so if you attended a show this year, like the play, the show, uh this was probably featured in NOO and so using the W Clarify model, um there's a, a number of uh uh features in there like uh generative extend optic edition and removal uh as well as a third party uh video gen generation model market. So you can pick from different model providers. So I'm just gonna play a few seconds of this. Uh Just to give you a quick idea. Again, there's gonna be no audio, all this is on youtube though. So we can find if I just wanted to have an idea, I'm gonna speed it up a little bit so you can see that object functionality. So again, this is, this is really useful functionality and it's, it's available right um in the near future. So, but it's practical and again, uh you can certainly potentially build your own even if, if you have a very straightforward use case and you have a short clip and you just wanna use a video generation model that has a video equip. So reuse of content is another really uh important workload for Je I. So uh the goal really is to monetize existing content that you have archived potentially. Um you know, examples include social media influencers who have a huge amount of footage and they wanna reduce it to a short travel log or highlight reel. And so here uh in order to make this practical, you have to combine two different components, video search and video summarization, both power by generative A I. So video search again kind of building on what Roger said earlier um involves basically indexing what we call embeddings in a factor database. So there's a number of different vector databases out there. And so the vectors are actually uh numerical representation of the short frame of video along with the textual information that we associate with it. So basically you can you sample a video, you convert it to uh feature vectors which are just a race of of numbers and then you about the database where they connect so they can be searched. And so then once, once the banks are in the micro database, you can just search it with a natural language. And so again, I've seen a huge number of these cases. In fact, uh uh NFL and people who use sports just uh came out uh with their solutions, they rolled their own. I have to be on the F US. Um But there's as many different products and if you walk around something like any beach show, there, there are solutions along these lines and many different foods uh video summarization is another really important component of content review. So for this one, you're, you're leveraging a number of different uh Je I models, it's not just a single model. And so also you potentially want to um you know, summarize to a short form video. So you're not just creating a text description of the video, you're, you're creating a short video for short form content itself. And again, there's, there's many different ways to do this. There's a number of different vendors who package this all together for you. Um For A R we publish a solution guidance. Uh But again, there's, there's also quite a few vendors in this area as well. So you combine those two and then globalization is another area where there's just a ton of different vendors out there already operating uh different uh kinds of solutions. Unfortunately, you can't hear the audio here. Uh But in the top foot, Jeff is speaking English and the lower one, he's speaking Spanish and just wanna quickly wrap up with how video generation models work. So basically they work by shaping random noises into images which are then extended in the time dimension. So at the center of these, you you see uh image generation model, it could be a famous one like a staple diffusion. And then you have uh various components surrounding it uh to smooth out the action and ensure content consistency and to make sure that the folks are overwhelming the GP U memory of the hardware. Finally, in terms of how they train again uh to pro to produce professional quality video, you want to use professionally produced content. You've seen some examples already such as the Lion Escape Heel that just happened with runway in A I and um it's a very challenging uh thing to do for many reasons. As many petabytes of archive data to migrate to the cloud. If you're gonna do this in the cloud, uh you have to prepare gigantic data preparation pipeline for transcoding um format uh removing unlicensed elements to avoid and then it could take many, many weeks for this to drop. So I think that does thank you so much for so one short question.