The VideoVerse

TVV EP 07 - Using AI and ML to scale automated video creation with Wu-Hsi Li of Firework

Visionular Season 1 Episode 7

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 39:52

On average a 30-second video costs around $2,000 to produce. Trying to create thousands of videos at that rate can get quite costly. Wu0Hsi of Firework asked the question, what if we treated a video more like a script, that could be easily edited, modified, and templatized? This approach led them to develop a scalable solution that lets you input your content and data and let AI spit out thousands of videos for the same cost dramatically affecting the ROI of each video.


Learn more about Visionular and get more information on AV1.

Wu-Hsi: I'm Wu-Hsi. Mark, thanks for having me here. I'm the RD Lead here at "Firework." So I'm leading the creation infrastructure and our creation tool "Firework Studio" at Firework. Before Firework, I had a PhD from MIT Media Lab. In which time I'm actually a researcher of computer music technology.

Mark: Wu-Hsi, it's great to have you on inside The Video Verse. When you and I were talking about doing this interview, I learned about your work at MIT as an electronic musician and a technologist. And even though I did not go to MIT and I do not have a PhD, I am an electronic musician. So we found a lot of common ground there.

Wu-Hsi: So I remember growing up, I'm a very passionate pianist, but I don't want to just play the piano. I feel like there are more important things than something behind the sheet music. It's about the expression, because music is all about telling my story to the world. And I've been wondering like, ah, so how do we empower people to express themself? So without the need to suffer from frustration of instrument practice and reading sheet music and all that, it's about feeling something and to tell this story, tell this emotional state. So I think that from that, I see a lot of our story that overlapped and that comes into today's conversation.

Mark: Tell us what exactly you have built at Firework. And maybe you can also tell us how the product developed from when you first founded the company to now.

(Chpt 1  Dynamic Vision   2:08 )

Wu-Hsi: So we were thinking like, one thing we are very sure of is the dynamic vision. So what is dynamic vision? It's like you walk into a store and there are so many products in front of you, so which one is going to catch this customer's attention? So in this digital world, we go to many websites. So you can imagine that is just the same as walking in a physical store.

So right now you see many images and text and all that, but among all that, there is something that is moving, which is a short video. And then our human brain cannot help just to lay all the attention to that dynamic visual source, which is that video.

So we think that, okay, even though Firework as social media platform doesn't hit it, but that's okay. We have built all the infrastructures about the short-form video, and how about we use this short-form video infrastructure to power the business need. So we are taking the short-form video service to the B2B scene and try to see what can we do from there?

So it began a series of journey, actually, that not only that we are able to start giving this options for many websites or app holder to have short-form video, but also we are going into live stream. We are going into a shoppable video. So right now we are the provider of a whole series of decentralized video related software as a service so that we are giving the power back to all those big company.

So before Firework, they need to go to Facebook, they need to go to Instagram, they need to go to Google. So many of the big companies that are dominating the ad world, however, at the end of the day, they realize that, "Hey, that's my customer. "I want to own the data for my customer." So we are trying to leverage this complication from all those companies. They want the traffic, but they want to own the data and own the customer.

And so Firework chose a different approach that we say that, okay, we have a series of weapons. we can empower the customer so that they can also compete with their competitor instead of worrying about, where should their money goes to, TikTok of Facebook.

Mark: Sure. So have you built the platform, and initially you were intending, as you said, to build a social network. Basically to compete with TikTok. Is Firework then kind of an e-commerce social network, or are you focused more on the video creation tools that then someone would use and then maybe publish videos on TikTok, for example, or Facebook or other platforms?

Wu-Hsi: So in the video world, then we need to have content and consumer at the same time. And the content come from creator. So at one point it was a message that everyone can be a creator, but at the end of the day, the bar for producing a very eye tracking, compelling video is extremely hard. So how do we really make people believe that, Ah, this is something I can make a video with some touches on our screen and that this is not becoming like a place that only the elite can make high-quality video. It is one of the biggest challenges.

Mark: So you actually created these tools out of need rather than some like business objective, right? Did I hear you say that correctly?

Wu-Hsi: Exactly. So whether it is in the social media facing short-form video world, or the e-commerce facing short-form video world. One number is the same. So the production cost for a 30 second video at average is 2,000 U.S. dollar.

So think about it's just 30 seconds. If they like this 30 second, you better have the other 30 second and the other 30 second. Then this is a deep hole for small business owner to fill, right? So about 82% of the company, and the website owner or the brand that we talked to. 82% told us that they have a content issue. By saying content issue, whether this is always about how you trade resource to content. So it's not like it's impossible to make content, but you need a pipeline that you always have something new every week, every day, every month, then this is going to be a problem that we are looking at, at least for a few more years.

Mark: Now you talked about the pivots over the years and you pivoted-pivoted-pivoted. I know you recently introduced, I think recently, Firework Studio.

Wu-Hsi: So we are thinking about, okay, the best way to unlock the video creation is to think about this video remaining in the state of fluid. By saying fluid, we are thinking about video like a script rather than a file. So a script is a text file. And you can say, I want to localize this video to a client from Japan. How should I do that? I want to target this video to a more senior group, then maybe I will replace the model that I use in this video. Maybe I will revise the message in this video. Maybe I will increase the font size. So there are many things that are just a few clicks on your mouse. That was unable to accomplish only because the video format is frozen. Okay.

So Firework Studio tried to think about video as a script and a video file at the same time that whenever you want to make changes, you can do it because we give you this room of adjustment, by preserving the script behind the video.

Mark: Architecturally, how is your video pipeline built? For example, starting with, did you decide to utilize your own data centers? Are you on public cloud? Are you using multiple public clouds? Can you describe for us what the video pipeline looks like?

(Chpt 2 Video Pipeline 9:58)

Wu-Hsi
: Right. So this is also a big question because right now Firework is serving many business clients, and some of them will prefer Amazon AWS, some of them will prefer Google, and even we see that in the future, maybe there will be some legal requirement from a client in different continent, that they say that, oh, they have to have the data only within the border of their country.

So and so forth. So that is part one. And part number two is, so obviously there is the making of the video. The making of the video is how do you produce the video, edit the video, make the cut, audio, and video, and image? And when you don't have enough asset, then can you find a third party media library to support that.

And all the way into the transcoding territory. Like now when you have one polished video, then in order to serve a user from the web, in order to serve a user on their phone, so there are different formats that we need to support. So all then, we don't even need to tell our client, our client only needs to tell us one thing. They feel like the load up time is faster. They feel like users are seeing beautiful videos. And that's what we want to hear. So behind this simplicity, there are many details that we need to take care of.

Mark: Are you using any commercial solutions, either commercial encoders, or other technologies? And I know that the listeners would be really curious to know, what was your evaluation process? How did you even decide to use commercial and then which commercial, or to use open source and maybe which project? Because there's a lot of options.

Wu-Hsi: Yeah. So, I think there's also the video quality part and then the video creation part, Versus, from the creation, to compilation, to transcoding. So in a way there's no secret that, most of the companies need to look at "FFmpeg," so it's a framework. So when you say you use FFmpeg that doesn't mean anything actually.
So on top of that, then obviously you can reapply a codec by just Googling "FFmpeg command." But we want to think about how do we leverage the bits to where that matter?

So in that way, we are using some video optimization, such as Visionular. So their technology is using the bit to what matters to human eyes. So this is one example. And the other example is, okay, so when we are building our creation tool from the ground up, that we were looking at while labeling some existing tool, but then, as we mentioned that the most important philosophy is that we think that we want to build our own video script. We want to build that infrastructure. It's something bigger than just a tool that makes a video.

So, in that case, we work with a external company that they were SDK provider, but we choose not to just use all the front end, all the creation tool. We say that we want to build it in our way. So there are some decisions that we make along the way that we explore, watching Weibo, InVideo, many of those company. And then we decide that we can do better. And I think that to the voice of many of our client, that the fact that we can leverage the consumptions and all that, and the fact that now we have a batch editing option and all that, some clients think that we are indeed doing a better job right now.

Mark: Clearly, quality is important to you, as you are designing the solutions, as you're building your product. And it's also important to the customer. I'm curious in both. What you've experienced.

Wu-Hsi: So I would not only look at quality. So right now, let's first define what is quality. So the quality to us is about how much it retains a user to finish watching a video. So that can be one measurement.

The second measurement is, okay, by the process of watching the video, how many watchers proceed into something behind this video? It can be a product, it can be a subscription of a channel. So one and so forth.

Mark: So they take action.

Wu-Hsi: Yeah. So this quality goes to the cost side. So the cost is, we know that, okay, if I hire a professional video production company, they can make this video $2,000 per 30 seconds. Okay. Then some small business owner is like, ah, let's find a student at a school. Okay. That they are still learning.

Mark: $200. Maybe.

(Chpt 3 Quality and Scale 16:05)


Wu-Hsi: Yeah. Maybe, 200. But we cannot only look at quality without thinking about skill, because right now If you can make one video, then how can you make a thousand videos? When you make a thousand videos, you are hoping that this is not 2000 by a thousand.

So you are hoping that because I make more, I not only get a discount, but because I'm getting good at building this process. So there are something more than a group discount, right. So in this sense, Firewall Studio kick in and think about, okay, how can we be the best scalable solution. By saying scalable solution, We are thinking about, that you have a spreadsheet, this spreadsheet talks to your database directly. So let's imagine like we are providing a video solution for customized, personalized medical solution.

Or it can be something like, I am a real estate company. So every seller comes to my website, provides a list of pictures that describe the appearance and amenity for that housing object, as well as that, they will go through a form to fill in all the supporting information. So what we need to do is that we make one really good looking showcasing video for a real estate object. And then by hooking up the spreadsheet of all the data that talks to the database and something, we call a template, that means that you only need to make one video and you plug in a adapter to your database.

And then we are talking about thousands of thousands of videos made at the same time. Then on the Firework Studio side, then we provide further variation that you don't necessarily want these thousand videos looking exactly the same. So how do we make it look somewhat different? But then it's almost the same cost. So let's say if you spend $2,000 making that one template that end up being that duplication of thousands of video, you'll feel like that $2,000 can be the best money you ever spend. So we do not look only at quality. We need to look at quality and scale at the same time.

And that's why Firework only boasts the consumption and the creation is going to have something that we can empower the client to do more than just one video.

Mark: What AI or ML models are you using or did you develop them? I assume you're using some modeling there.

Wu-Hsi: Yeah. So this is a good question. So in the video world, there are many fundamental problems that we need to solve. And one of the biggest is how do we connect the TV world and the mobile phone world. So by saying that, TV is horizontal format, but your phone is vertical. Just about three, five years ago, we think these two worlds hate each other because you go to YouTube at that time, if there's anyone shooting a video vertically and you consuming it on your computer. You feel like this is such a poor experience. And underneath that video, you will see people commenting that, "Only people having no sense will record video vertically.

Just five years later, every video is recorded vertically. If you still record horizontally on your phone, then that means you are showing ages. That that part I disagree, but the mainstreams are being redefined. And so in terms of the AI, then the source can be horizontal, we need to help cropping the video to your phone format. But then you need to know how do we crop the video just right. So that you are not missing out.

And the other big question is how do we summarize a long video to a very short one? So it can be like a 10 minute video, but no one is going to watch 10 minutes, unless you are showing them something great in the first 20 seconds or maybe 2 seconds. So how do you find the highlight of the video? So those are something that we need to keep exploring the models out there and how do we pick up those technologies and connect to our data stream and to make sure that, okay, we are providing the company to do live stream event, but right after the live stream event, we already know which part is the highlight of your live stream session. And then we can create highlight, right off the bat.

Mark: Well, you showed me a demo and I hope that we can link up to it. I'm sure we probably can in the show notes, demonstrating where the video that I saw, I think you started out vertical and then you rotated the phone horizontal, or maybe it was vice versa. Anyway, it was just super amazing because everything stayed in frame, the video didn't glitch, it didn't buffer, it didn't pause. It was almost like the video was shot in both formats and you could just seamlessly rotate. How are you achieving that? Can you tell us structurally, what components you're using to achieve that? Because it's very impressive.

(Chpt 4 Zoom in and Zoom out the video 24:06)

Wu-Hsi: Yeah. Glad that you bring this up. So Firework had a signature immersive video watch mode called "The Reveal." So the Reveal is that, you can watch this video in immersive mode. And when I say immersive mode, this is not about wearing a VR gear. This is not about 360, but this is about, that you can flip your phone 360 degree, but not 360 relative to the viewer, but you flip the phone to horizontal or to vertical and the transition of the video, not only fits the frame, but also make you feel like, wow, I'm in there. I can choose a unique viewing angle to consume the video, whether that is a sports video that you want to be there, or it can be a VR art that you feel like, oh, when I tilt my phone, I can zoom in and I can zoom out as if I am moving closer and farther away from the video world.

So this is a very difficult video experience that I can't describe in words. So if some of you are just listeners to this podcast, I will encourage you to check out the video I'm going to send to Mark. Check it out.

Mark: We'll put it up on the website for sure. But what are the technologies that you used? Did you develop all of this from the ground up? Or were you able to leverage some libraries and... Yeah, I know that our listeners really wanna know how did you do it?

Wu-Hsi: Yeah. So actually, right now, some of you cannot see it, but I'm showing this from my phone, that on the surface that this looks like, oh, I'm just rotating this video so that it's fixed to this orientation of the phone. So that it's a sticks with the gravity. But if you see that this is a bike video, then you want to sort of tilt the phone as the bikers are riding in this pass.

So what we do is, we first need to run an analysis of where the object is, and then we need to find a reference friend for the vertical position versus the horizontal position. Then we want to think about, okay, when we are rotating the view, you don't want this view to change too fast that makes the viewers, I feel, uncomfortable.

So there are many details that we just redo over and over until we feel like, wow, this finally looks very smooth. So this is not just about AI that identifies the horizontal boundary and the vertical boundary, but also how do we apply a smooth transition so that you are still within this boundary, but by fitting the boundary, you also want to zoom at the right time.

So zooming is a very important part in this context, because think about that the viewpoint can be vertical or horizontal. But then the source can be any aspect ratio as well. It can be a square video, it can be a vertical or a horizontal video. Then when you are fitting vertical to a horizontal video, then you are viewing just a very small portion.
So you want to make sure that you are locking on the right object of the video.

Mark: That's right. Wow. Amazing. So the video engine, the pipeline has a process where you're performing all of this analysis, you're intelligently altering the viewport. After you assemble the video stream, then you send that to the video encoder, or do you have some other step that you are doing before it goes to encoding.

Wu-Hsi: This is a great question. Because many people feel like, wow, how do you even produce that video? Because they think that this needs to be recorded in a special mode. But the beautiful thing about this technology is that we can make all the existing video just work because after all, it can be one video. We have a special project that, it can be multiple video streams. There is a in one viewing experience, but in the normal mode that we applied a special transcoding to the video. So that those pixel that you will never get to watch in this immersive mode can be masked out in order to save the bandwidth. And because that there are certain angle that you will zoom the video up to 300%. So we do need to up the sample rate so that at every angle that you consume this video, that will give you a satisfactory viewing experience.

Mark: So I can imagine that with some videos, you would almost run out a resolution because if you have, let's say a 720p video, but you're watching... So let's just say for sake of discussion, a 720p, let's say a vertical video, so 720p of course, it's kind of measured, you're talking video, but let's just stay with that. And then if I turn it sideways, so I've got 1280 by 720 pixels. So I have 720 pixels wide because it's vertical. If I'm thinking about it correctly, I now rotate the phone horizontally. Well, now I have to stretch 720 pixels across the width of the display. Are you doing some sort of scaling function because on some really high resolutions, like on an iPhone 13 Pro, for example, there's a lot more pixels there. So what are you doing?

Wu-Hsi: So Mark, you are not only a good musician, but also a good mathematician. So you are spot on this. So normally, right now resolution is just a number. So there are actual aspects out there that we are also using, especially on live stream that because of the user having a different data connection, that we will serve different bandwidth also. That is already a given.

But in this context, normally, if we think that, okay, 540p optimized video can provide a low bar for the video watching in vertical viewport to vertical video source. And in this special mode that you are fitting horizontal video canvas to vertical viewport, then you need to at least double that because you are zooming in at this 200 To 300%.

So if you're zooming 300%, then you better have 1080p because otherwise after the zoom, your 540p will become only half. But we cannot just serve 1080p to every phone.

So there will be a mask that we crop out. So that some part of this video becomes black out. That will be a important way for us to encode this video and keep the bitrate at the reasonable range. Whereas the user are seeing the pixel where that matter.

Mark: Yeah. Amazing. And because you're dealing right now with files, not live streams, or can you do this with live?

Wu-Hsi: Oh yeah, absolutely. You can do that for a live stream. You can do that for live stream. So the only difference will be that what I said, the pre-transcoded pixel removal. So some of that may not be performed in time for live stream, but we have been testing that.

So this technology can be applied on top of any video format. So it can be pre-recorded, pre-transcoded video. Which we can do something more than a live video.
But for a live event, we even think about that having two video stream and we can still apply to this immersive mode in real time.

Mark: Amazing. Wow. By the way, what codecs do you support? Does your platform stream only H.264, are you supporting HEVC because you have both the ingest codecs, you have to support, which I imagine you have to support HEVC, and of course, H.264.

(Chpt 5  Mainstream for mobile is HEVC 33:19)

Wu-Hsi: Not just that. So I think any video format out there in the industry, we will be asked, we will be requested by the client. So right now, the mainstream for web is H.264. The mainstream for mobile is HEVC. And even like we are talking about, right now, most of the cell phone is capable of recording HDR video and all that. And as well as, say, we need to generate a GIF, animated image.

So that say, when our client wants to send out email, they can be attached in, not only Gmail, but other email agent. So any video format that we have known is something that we are using right now.

Mark: Incredible. Yeah. Incredible. Wow. So you obviously support files and that's where you started, you now support live. Is this all HTTP live based streaming? Are you doing any WEBRTC? What are your plans there?

Wu-Hsi: Yeah, so those are some details that there are many platform that are providing options, like IVS Player, Agora, so we are exploring some of those options. And sometimes we feel like, this platform can give us a little more, but then it require more engineering cost to adopt it, So for now, we are still exploring what is the good, long-term solution for us? But right now we begin by having our own HLS transcoder. But then we are using some of the... Like IVS Player out there. And then we are thinking about a Agora. So all those options, I think that's not a secret to the industry.

Mark: Yeah, exactly. And, really, where my question is ultimately headed. The reason why I ask about WEBRTC is that, I'm just curious how important it is to get to ultra-low latency below a couple hundred milliseconds, for example, end to end, which obviously, the only way to do that is using WEBRTC. Is that something that is important or does that open up any opportunities for you or does it not really make a difference if it's delayed 15 seconds?  

Wu-Hsi: Because we are not streaming a music event. Or a sports event. So in these two context, then the client obviously want a very low latency. However, there is a special use case of why we want low latency. So think about online and offline experience connecting. So this is a metaverse imagination.


There is a physical store, but inside a physical store, you have a live stream shopping showcase event, right there. So the user saw the stream and they want to walk to the store to see what is going on right there. So there can be some audience online, and audience offline. At the same time. So when this two timeline are off by a huge latency, then we cannot talk about how do we connect this two timeline together.

So this aspect of thing that we are thinking about, in the end metaverse is not just VR or connecting this digital world and this real world. It's all about thinking anything like including a website that it has a timeline and it has a space, and how do we navigate, and how do we design a great user experience that you are walking not only in physical world, but in digital world.

So whether it is video or music, one of the beautiful thing is that it has the timeline.
So the timeline drives a music or video, but it's also like this time and space intersection is something we think that eventually Firework is also a metaverse company. Firework is also thinking about this digital world and real world connection. So my past research at MIT is all about augmented reality, virtual reality.

And so my first project at Firework is about that immersive mode, now I feel like, wow, I actually connected VR experience to a video watching experience that was considered very passive. Like a user watching video like zombie only moving their finger. But I convert this watch mode to a little more like you are flipping your phone while you watch, but I think my vision for next step is something more aggressive. Like it's just thinking about continuity between this digital shopping world to the offline shopping world.

Mark:
Well, Wu-Hsi, thank you very much for coming on inside The Video Verse and sharing all that you're building. It's amazing. And I know the listeners really got a lot out of our interview. So thank you.

Wu-Hsi: Thank you, Mark. Yeah. Thank you for having me here.