TVV EP 26 - Debargha Mukherjee: A Peek Into The AV2 Future Artwork

The VideoVerse

Listen to video experts and engineers speak about all things video. From UGC to OTT to Broadcast, we discuss the approaches and algorithms they use to deliver the ultimate video experience, spanning capture, encoding, processing, distribution, streaming, and playback.

All Episodes

The VideoVerse

TVV EP 26 - Debargha Mukherjee: A Peek Into The AV2 Future

July 01, 2024 • Visionular • Season 1 • Episode 26

Back for a second time on TheVideoVerse, Debargha Mukherjee, Principal Engineer at Google, discusses the upcoming AV2 project, touted as the successor to the popular and powerful AV1 codec. In this podcast, Debargha talks about the advent of the AV2 project and its goals. He delves into the specialized tools newly introduced in AV2, the enhancements to existing AV1 tools, and the resultant performance improvements. As a bonus, Debargha also gives us a glimpse into the use of AI in video compression.

Zoe: Hi, everyone. Thanks for coming back to our "The Video Verse" podcast. And so, I'm Zoe from Visionular. And then for this episode, I actually have my colleague, Krishna, joining me from Bangalore as a co-host. Hi, Krishna.

Krishna: Hi, Zoe. Nice to be here.

Zoe: Yeah. Okay. Thank you for joining with me. And then so for this episode, we actually invite Debargha Mukherjee from Google back to our "The Video Verse" series. Actually, Debargha was talking with us for the very first episode of our podcast series. And we are very, very happy and excited to have him back. Hi, Debargha, you want to say something? So I still want you to actually make an intro even though maybe quite some of our audience are already familiar with you because your episode, at this moment, stays as the top viewed, most viewed, out of all our 23, maybe at this moment, already 26 episodes that we publish. So you wanna actually have intro to our new audience and then we are really excited to have you back. Yeah?

Debargha: Yeah, thank you, Zoe and Krishna, for inviting me to this podcast for the second time. Yeah, so just for everybody's information, so I'm a principal engineer in Google and I have been with Google for about 14 years.

Zoe: Oh, that long? Okay.

Debargha: Yeah. And most of my time in Google, I have been working on video compression. I started off in YouTube for the YouTube transcoding pipeline and then went on to work on VP9, then AV1. And currently, we are looking at the next-gen codec AVM from AOM. Yeah, which probably will release sometime in the next few years, yeah. So I think that's what we have been doing.

Zoe: So you mentioned that the project name is AVM, right? Can you just -- what's the full name for that?

Debargha: AVM is the name of the software that we actually have been looking at for our next generation codec. So AOM Video Model, that's what it stands for. AOMedia is the Alliance for Open Media, that's the industry consortium that was set up in 2015 to work on video codec. And if you know, the first video codec from AOM was called AV1 and that was released in 2018. And from 2020 onwards or 2021 onwards, we have been looking at the next-gen AOM codec, and we are very much in the middle of that project and it's pretty much a work in progress. There's no timeline that has been decided yet but-

Zoe: Okay.

Debargha: We are making progress towards the next-gen codec. So AV1, as you remember, the full name is actually Alliance of Open Media video codec one.

Zoe: Okay, got it. So definitely, I believe that we're going to talk about it, because you mentioned that and then, so let's talk a little bit about AVM because AV1 was finalized in 2018. So right now it has been almost five years and-

Debargha: Six.

Zoe: It's still to many of us that we think it's still new standard, so it's still being deployed and that we all notice that like more and more at least platforms start to support that. For example, one of the biggest announcement from Apple was last year, they announced that the iPhone 15 already have the hardware AV1 decoder in place in the new phone models. So we'd like to actually, maybe because this is total about AVM, so what is the purpose of AVM and what is basically the overall picture? So what will it be different from using AV1?

Debargha: Well AVM, whenever AVM matures, so we don't know the name of the new codec yet. I mean, as you know in the video area, codecs evolve every 7 to 10 years. So that's what our plan is. Like, let's say if the AV1 was the first codec from AOM released in 2018, it is about time we started looking towards next-gen codec, which will probably be at least the next one or two years or even more depending on the target.

And the goal currently is just to get a next generation level of compression, so to get more compression. But we are also very careful about complexity on the decoder side because we know that more and more, I think at least for the generation that we are thinking of, I think video codecs would still be dependent on ASICs and those ASICs will have to be supported on all mobile chip sets with a pretty limited silicon area and all that. So I think with those constraints, we cannot have very complex tools on the decoder side. S

So with that, and that is one thing that we have been focusing on a lot, unlike some of other efforts that are also looking at next-gen codecs. So, we have been looking at codec or decoders that can actually be deployed very easily on different mobile chip sets. And a whole bunch of devices. So I think that's what makes our work in AVM kind of challenging. I think if we could use much more complex decoder site tools, we could have had more gains easily, but then that would have been much harder to deploy in practice.

And I think in the consortium, we have been working very closely with many other companies, most of whom have a very big presence in online video delivery. And I think all of them actually buy that concept. So we need to have a very simple decoder, like the decoder should not be increasing too much beyond what AV1 already has, but we want to make it much more, like more compact for sure, but also need to support more features and actually cover a whole range of applications. I think most of the companies are not too much worried about five to 10% difference in coding efficiency, but more about features and decoder site complexity.

Krishna: So to just catch the point that you made, when you talk about applications, typically when we've seen the past 15, 20 years about codec evolution, everybody talks about higher resolutions, right? AVC was able to do 1080p well, and then you had HEVC would say it's 4K and then 8K and AV1's going in this direction. But when we look at the industry, at least from a B2B point of view, majority of the consumption is actually taking place on mobile phones, vertical videos, social media, people uploading from their mobile phones. A lot of encoding and decoding is taking place on probably lower end devices. So when you're designing a new codec, when you're designing tools for a new codec, how do you kind of cater to this entire spectrum? Yes, there is the world of the very large screen TVs, but there's also a huge, huge industry which is probably constrained to a mobile phone.

Debargha: That's a good question. I think it really boils down to the common test conditions and the tests sets that are created at the beginning of the process. So I think, so at the beginning of the AVM development process, I think we went through almost a year of figuring out what test conditions to use and what videos and how many videos for each resolution and all of that.

I think that kind of sets the stage for what the codec is going to be. So let's say in AV2, VVC, I think we have quite a few videos that are 1080p, some videos that are 4K, but much fewer at lower than 1080p, let's say 720p, 360p, 480p, those are much fewer. So automatically when a tool is proposed, I think we would be picking tools that are adopted as candidates which do better at let's say 1080p or 4K resolution because you have more videos there and that actually kind of the overall stats is kind of biased more towards those videos.

I think we did, so I think unlike the AV1 development effort where we had all different resolutions of different mixes, I think in the AV2 case, I think the focus definitely is more on higher resolutions. It's not that we have 8K videos in the test set, we don't, but we have quite a few 4K and a lot 1080p. Now, one problem of course is when a codec is being developed, at that time, the encoder is often not optimized very well.

It actually takes a very long time to encode even do day-to-day tests on even 1080p or 4K videos. So that is one of the, that's a necessary evil, I would say like, we have just now we have started actually focusing more on reducing the encoder complexity also, because it has become really hard to run tests as we develop tools.

Krishna: I can imagine.

Debargha: But I think, but in our previous experience, I think the encoder side complexity usually drops pretty fast when people start looking at ways of optimizing the algorithms. Yeah, so we are not too worried about that, but I think the decoder side complexity is more of an issue because we have strict limits on how much decoder complexity we would afford in the new codec.

Krishna: This is very interesting.

Zoe: So you talk about the test set, because Krishna just mentioned regarding the different resolutions. So on the other perspective, we just wonder regarding the test set you choose because that actually represent the application that new standards will target to address. So what kind of content like gaming, outdoor sports, so we like to have idea the distribution of different category content in the test set.

Debargha: So I think we have the 4K set which we call our A1 set has has 4K videos, which are really high resolution, high quality.

Then the 1080p area, we have two sets. One set is natural content. The other set is, which you call V1, is basically a mixed kind of partly graphics, not really screen content, but really graphics content, gaming content and all of that, I think.

And some of those videos are actually pretty hard to compress and they take a very long time to compress. In addition to that, so you have two HD sets. So one is for natural content, there are about 19, 20 videos and then V1 set which is this graphics content which has about 12 videos, I think, so.

And then there's also another set which is only for screen content that is called the V2 set. And that is really like information captured from the screen, let's say presentations or just windows moving and things like that. So I think, so screen content and like remote graphics is actually one of the applications that we are targeting a lot for AV2. So I think that's why we have the all of these sets that we have to run our tests on for every tool that is proposed, yeah.

Zoe: I see. So basically, for those kind of a different content, we believe that different tools are being developed. So for example, you did address quite some graphic content, right, for 4K and 1080p for different categories and there's also a special category addressing the screen content. So we'll know that AV1 actually featured by screen content codec tools. Then what is new there, in AVM to be expected, regarding graphics and screen content?

Debargha: So all of those tools have been improved, I think. So the first screen content, let's say, we actually, there has been quite a bit of effort in actually improving the identity transform in AV1.

AV1 has identity transform, which is also there in AV2 but the way we do the entropy coding of identity transform has actually been improved quite a bit and that actually helps screen content a lot.

Then we have some, several other tools like BAWP and all of that. Those have been tuned in a way that actually does much better on the V2 set. So I think if you look at the overall, the on the screen content set, I think the encoding efficiency improvement has actually been quite a bit more than natural content so far.

I think in the natural content right now, maybe we are maybe 24, 25% better than AV1. I think for screen content, let's say V2 set is probably much higher than 30% already.

I think there's quite a bit of improvement that has happened in screen content in the combination. It's not just completely new tools but it's a combination of improving the entropy coding of identity transform and improving some of the loop filtering tools, then improving, adding some BAWP and those kind of tools that seem to work much better for screen content.

For intra block copy, I think that is also another area that has been improved. Like we can do inter block copy now for interframes, which we didn't do before. And I think it's still a work in progress. I think there is still more room for improving that.

Zoe: Okay, so just like you mentioned, IBC is a basic intro block copy and I also mentioned another acronym which is... B...

Debargha: Oh, BAWP, block-adaptive weighted prediction, yeah.

Zoe: Okay. Got it. So basically, I was really impressed by the gains that are ready, especially for screen content. And do you actually expect or observe the similar gains on top of screen content as opposed to graphics content? Do they have actually some correlation between this computer generated content?

Debargha: There is some. So I think as I mentioned like V1 and V2 are two sets which are very different characteristics. V1 is more of like screen content like gaming content, which is synthetic content but it's not really screen content. And then there's a V2 set which is screen content.

So the V2 set has good gains. V1 set is roughly actually similar to natural content. Yeah, so I think there is the V1 set actually is, which is the synthetic content actually is, the characteristics are not exactly the same as natural content, but the gains that we have so far are similar. Yeah.

Zoe: That means there's two quite different content and then the gains actually even you mentioned the natural content, right? They achieve about 24.

Debargha: About 24, 25%.

Zoe: And then, we actually expect a yearly one, a new standards can be actually announced as opposed to as a predecessor. And 25%, we believe . So then what further is expected before it's actually finally released?

Debargha: Yeah, so among the companies that are working on AVM, I think there is a wide range of expectations for the next product. Like from the Google side, I think we want at least 40% improvement perceptually which may actually translate to somewhat lower, let's say 30, 35% in PSNR terms.

But our main goal is to have like, objective is to have at least a 40% improvement in bit rate, compared to AV1, in perceptual terms. Now other companies may have different goals. I think, I know for sure like Apple's goal for compression is actually lower.

So I think they'll be happy with 25 to 30%, let's say. And different companies have different goals, but I think, we still have some work to do for at least maybe one, a little bit more than a year to see if we can agree on a timeline for release of the new codec.

Krishna: So along the same lines, Debargha, you talked about screen content coding. I think the other significant tool in AV1 was the film grain compression. So are you also seeing similar goals or similar improvements on that side?

Debargha: So the film grain one is a little bit controversial, I would say.For AV1. Because it was a non-conformant point. So I think there is, and I think a lot of people think that was not the right decision for AV1, because I think a lot of the implementations are not really tested.

And it's not clear if they're actually implemented, they have been implemented correctly or not because it's kind of like out of loop and out of this non-conformant point.

I think that is being re-rethought.

And I think on the same veins, I think in AVM, we have also been looking at debanding techniques. And it's again, nothing has been decided yet, but I think there is quite a bit of discussions and arguments that happen in those committees as to whether we need to have something in the codec or it's enough to depend on the display pipeline outside the codec that automatically does debanding and things like that.

I think there is, and the same argument comes up in film grain also, whether it's needed. Is it actually useful to have something in the codec or not?

I think right now we are at this phase in AVM development where we still need to get coding gains or maybe a good five to 10% in PSNR terms. But maybe there is also some effort that has been started on like other than hardware simplification on the high level syntax and-

Krishna: Okay.

Debargha: And how this film grain or debanding, how those things would be, would come into the picture, if at all. So I think some discussions have started on that. They're still quite preliminary. I mean, AVM is a work in progress and we hope that once we are ready to release a new codec, I think everybody will be happy with the solution for all of them.

Zoe: All right, so just now you mentioned screen content and then there's several coding tools that are being developed for natural content to achieve that 24, 25%. Can you roughly mention that, especially someone sort of have some background knowledge about codec standards, coding tools, and so which kind of a coding tools, for example, bring in how much percentage out of that 24, 25%?

Debargha: Well, to be honest, because you have this very stringent, decoding hardware requirement, I think lately all tools that are being adopted have pretty small gains.

But if I look at the history of this, how this 25% came about, I think a large part is from, let's say, for the inter coding side, let's say we have a lot of tools that actually improves this compound modes. The compound modes are modes where we predict a frame from two frames, two reference frames, yeah.

For compound modes, so we have something called temporal interpolated picture modes. So I think for any frame that you're coding, you may try to interpolate the motion fields from the two neighboring reference frames and create sort of like an intermediate frame. And that intermediate frame is then used as a predictor for the blocks in the current frame. So I think, so that mode is one tool that has quite good gains and that has been worked on originally by Apple and then many other companies and it's a tool that has good gains.

Another tool that has good gains is, and by good, I mean about 2% or so, gains. So I think, yeah, so another tool that has very good gains is this optical flow and different variations of optical flow. So optical flow is sort of like compound but kind of uses optical flow constraints that the decoder also does to kind of refine the motion field, okay. And that has shown pretty good gains and there have been different variations of that.

Some of the decoder side refinement tools, that includes optical flow, as well as some kind of motion search on the decoder side, a combination of those two have pretty good gains. And then a third item that I want to mention is WAP modes. I think AV1 had some basic warp modes. I think those have been improved substantially in AVM so far. And I think all the warp tools, combination of warp tools and so warp modes are now more mainstream. They're like not like an afterthought as it was in AV1. I think, it's still in AVM, most of the modes are using this translation, but WAP modes are more mainstream and they're used more often. So I think so those have also helped a lot. So these are the inter coding side, I can think of the main gains.

One thing I mentioned is this BAWP, block-adapted weight prediction, then improvements in CFL, or chroma from luma tools, I think all of those have also added up. I think together they may have more than 1% or so gains. Then intra side, I think there have been quite a bit of improvements in intra secondary transforms and all of that. So I think secondary transforms are being used more, very often for inter modes. For inter modes, they're also being used now, but there are less gains for inter modes. But again, these are all a work in progress.

Then, one area that I should mention is partitioning. So block partitioning is a big area which has very good gains. So there have, the block partition scheme has become much more flexible from AV1, and we have pretty good gains from that for interframes, as well as interframes. In addition to that, and we also have this concept of semi-decoupled partitioning which means that the luma and the chroma trees for intra frames, for all intra frames can actually diverge after a certain point. And these seem to have good gains for all intra coding. Recently, there has been work going on, on extending those concepts to interframes, also, yeah.

So I think all these areas on partitioning and partitioning is one area where it seems like there's always gain to be had and partitioning is one area where also the decoder side, there isn't much of an issue, but more of an issue on the partitioning side is the encoder side decisions, because it can take a lot of competition to figure out what's the best partition to use, yeah. So we are trying to use ML based partitioning determination tools now, but I think partitioning is one area where I think we have quite a good gains and we expect to have even more gains in the next year or so.

Krishna: Fantastic. Debargha, I have one question, the last sentence, the last word that you used, ML. So here we are doing a lot of work which involves AI, ML, convolution and dual networks. So on a generic thought, just like a thought, a philosophical question, with more companies, more encoder companies looking at using AI, ML, and actually trying to deploy on GPUs, how do you take all of this into consideration during the tooling process or building generating that in terms of choosing?

Debargha: So in the age of AI that we are in, it's hard to ignore AI and ML in codecs. And I think, so in early on, we had quite a bit of discussion to see exactly how much ML we can have in a mainstream codec. So I think in academia there have been lots of work in ML based coding tools and all that.

For image coding, I think it has been fairly established that with ML, full ML learning based image codecs, you can get substantially better coding efficiency than let's say, a standard image codec. Even with the latest codecs like VVC, you can get better than that. For video case, I think it's less clear whether there is any, whether it's substantially going to be better than, let's say, AVM or VVC, or not. But one area where I think is the easiest to get gains on is this hybrid case where you have a standard video codec but then you just replace, let's say, the in loop filtering pipeline with an ML, with a neural network model, let's say CNN based model and all that.

And it also has been shown that the larger that model, the more gains you can have. So I think you can get eight to 10% easily if you have a very large model. But one thing that we have to remember is we are thinking of a codec that is going to be released in less than next one to two years. And within that timeframe, it's hard for us, I think it has been discussed between Nvidia, Apple and Google and other companies also that it's unlikely that we'll be able to use, let's say, an existing GPU on a device to offload some of the tasks. So I think for the next generation, it'll still be, has to be, the decoder has to be self-contained ASIC.

Now if you have that constraint in mind, then you have to think of what's the silicon area for that ASIC, where the inference may happen. And it turns out that even for a very small model, a very small ML model, the entire decoder, the silicon area is equivalent to a very simple ML model in terms of the silicon area that is needed.

So let's say if AV1, for instance, maybe has 10 million gates, the entire AV1 decoder probably requires eight to 10 million gates. Now if you have a simple in filtering model, not a very complex one, a simple one that may give two, 3% gain, let's say, three 4% gain, maybe, so if you have that in it, then immediately that takes up your, it'll be almost, you get as many, as much gates as you had for AV1. So you can double your silicon area but you get only 3% or 4% gain.

Now for AVM, our goal is to keep the decoder aside to be only 2X of AV1, but to get, let's say, 30% gain. So the math doesn't work out. So I think that's the problem that I think everybody agrees on for at least for the next-gen codec. Now if you're talking about a codec that evolves in five years or 10 years down the road, then the situation may be different.

But for the time horizon that we are looking at, I think ML is increasingly, seems unlikely, but we are looking at different tools in AVM, also for ML, and again focusing on in filtering. We're looking at very small networks, let's say, 600 MACs per pixel. MACs, means multiply-accumulate operations. So 600 MAC per pixel is a very, very tiny ML model. Most other like works in the literature will may be 600K MAC per pixel and that will give you maybe seven, 8% gain. But we are looking at 600 MAC per pixel and trying to see if we can, we have some gain like we're trying to see if we can get 1.5 to 2% gain from that tool, then it becomes similar to a similar trade off as a conventional tool and then it is possible to implement. So we are looking at that but it's not clear if we'll be successful, yeah.

Zoe: Well, in summary of what you just mentioned, first is the compression standards is still follow the traditional two dimensional transfer plus motion admission, and then because we can't at this moment afford end to-end new network, that type of, because the model is too big but even within the traditional framework as you just mentioned, even for a very small model to gain two to 3%, that takes a lot of decoder side ASIC area and that's a constraint, right. So you mentioned right now you can only afford the self-contained ASIC. For supporting the decoder. And then under this constraint, right now, and I just want to confirm.

Debargha: Right now, there's no motion learning tools that has been accepted yet by AVM, even though you are actually looking into the post, the filter using a lot smaller model but that is still being investigated or maybe under proposal, has not been accepted by this moment.

So at this time, there is a focus group in AVM, focus groups are sort of like an area where some people, some interested companies, they actually work together to develop a technology. So I think we are looking, we are working in a focus group on these tools. Yeah, so I think Google, Amazon and I think we have some work that we are doing together to make a practical ML tool possible. But yeah, nothing has been adopted as a candidate for AVM yet.

But it's still ongoing. Now hopefully for this new filter using a much smaller model, actually potentially, as you mentioned, gain 1.5 to 2%. It could be potentially accepted.

And then once it's accepted, you become the first machine learning tool.

Yes, there is a marketing element definitely. I think if the AVM timeline, if it happens in one or two years then, or if you're able to release that codec, then it may very well be the first video codec that has a proper ML tool, yeah.

Zoe: Got it. So just now, I have one thing mentioned 'cause you said all this inter and intra coding tools. So which one, you mentioned a few other contribute to the current gains. So there's a two special thing we just want to have idea for our audience 'cause intra tools eventually will basically leading to a new image codec standard.

So we want to learn for example, compared to the AV1 intra, which is actually fundamental for the AVIF, for the image code format. And what kind of gains, at this moment, for the intra mode only for AVM compared to AV1? This is the first thing. Second one is there's also, there's a low delay case. So we all know that a low delay meaning that you only use a forward prediction without using any backward, that's addressing the extremely low latency, which nowadays, a lot of people talking about low delay, live streaming and real time communication interactions, like what we do right now. So we like to have an idea what are the gains out of that 24, 25% for intra, what are gains for just the low delay with only I-frame and P-frame included?

Debargha: So for the all intra case, I think the gain is, I don't remember the number off of the top of my head, but it's lower, it's kind of little less than 20%. I think 17 to 20%, if I remember correctly. So that's the all intra numbers right now and for low delay, it's actually, it's little less than 25 because I think in the regular random access case, as I mentioned like the compound modes have been improved a lot and in low delay, we cannot use compound modes.

I think the gain is, I think it's 20s, low twenties, maybe 20, 21 around that, yeah. But I think the random access is still the biggest gain, 24, 25, as I mentioned. And then all intra is a little less. But one thing about all intra I would like to be a little bit careful, I think for all intra case, the encoder side design actually matters a lot and you need to have these encoder side bells and whistles to really make the image look good. Like you need to have adaptive monetization and things like that done in a much better way. So I think, so just using the CTC test conditions for all intra is not really the, is actually further away from what one would normally deploy in practice.

Now for all intra case, I think there has been, there is another focus group on all intra case, all intra mode for AV2. And there is quite a bit of push on supporting higher bit depths like up to 16 bit depths, alpha channel and things like that. So I think this is something that is going to happen sometime later this week, later this year. So I think in terms of coding efficiency, it's a little bit, 17, 18 around that percent, but I think people are more interested in having more functionalities in AV2, for image codecs.

Zoe: You mentioned that for low delay and compound mode can now be utilized. So in our understanding, is the compound mode usually involves two reference frames combination because you also mentioned a few tools that you can actually create a virtually interplayed frame as a reference. We just wonder whether, just using a two forward frames can be also like that to create a compound mode.

Debargha: So there, we have compound modes with same side references which are in the past, but then the effectiveness is much lower. So if you have both future and past frames in reference buffers, then many of these tools would work much better. Let's say, optical flow or decoder side motion refinement or even just TIP mode cannot be used at all. I think it only is applied for when you have two references, one in the past and one in the future. So because you cannot use many of these high performance tools very effectively in the low delay case, so the gains are lower.

Zoe: Got it. So basically for low delay is mainly this compound mode doesn't help much.

Debargha: Yes, yeah. Yeah.

Zoe: And you also mentioned that actually because of all this constraint and then especially the decoder side, a stricter constraint, quite some complicated tools can now be included in the new standard, at least for this kind of a standard that's being anticipated to be released in the near future, let's say, one to two years, depends on the final target. So as Krishna just also mentioned that people talk about AIs and then on the other side of the platform, even the device involved also may be on a faster pace in the near future. So we like to learn for those very complicated tools that has not been able to make to this standard, let's say, what kind of technologies you see that has quite a potential once the complexity type of constraint can be released? For example, you did mention that machine learning was a big model, then will be able to observe maybe potentially eight to 10%. So what average technology-

Debargha: So ML is definitely one. ML filtering and so ML filtering is the biggest, I think the biggest possible gain that we can see. I think ML also comes in prediction modes, like the intra prediction modes can use ML. I think there have been some work that we had done in the past on inter, intra ML based prediction modes. I think those can also contribute maybe, one, 2%, but the biggest chunk for ML based tools would come from in filtering. Now in terms of other tools which are too complex, I think there's a whole bunch of tools that people can develop using templates, template matching, and I think there are many such tools that are actually being explored in impact, for instance, which total maybe gives four or 5% gain, different kinds of template matching.

But we have stayed away from those kind of tools because I think it's pretty hard for decoders. I mean, none of the decoder hardware people actually like those tools. So we have stayed away from that and try to focus on more tools that are actually possible to implement in practice. Then talking of inter prediction tools, there is passing dependency is very, very important for hardwares. So I think if we ignored passing dependencies, like if you are reconstructing a motion vector and using a complex process and then that motion vector is used as a reference for decoding the next block, that is an example of a passing dependency.

Now if you ignored the passing dependencies then you could get maybe two, 3% gain in various ways. But because of decoder constraints, we don't allow that. So I think there are a whole bunch of inter coding pipeline decoder, like pipeline issues that you have to be aware of and because of that we cannot use certain tools that we could have used, yeah. So I think these are some examples. Another one is overall bandwidth, like when you do motion comp, the overall bandwidth is something that hardware is usually very sensitive of.

If you have to decode a lot of blocks to get the prediction for one block, that is not really like acceptable. So you have to be careful as to how much, how many blocks, how many motion comms do you have to do to get the prediction for one block. So there could be some very complicated OBMC kind of techniques, which require you to decode a lot of blocks to get the prediction for one block, but those would not be something that hardware would be happy about, yeah. So I think, yeah, so these are some examples that I can think of.

Krishna: Along the same lines of complexity, do you see any new parallelism topics coming out? Like going beyond frame level or slices and stuff as any new innovations on that side?

Debargha:: Well frame parallelism was there in AV1 and it probably will stay in AV2. In terms of within block parallelism, not so much, but I think it's, I mean the hard, there's a hardware subgroup now that looks at every tool to see how much they can do in parallel. So I think at that level, I think all tools are being scrutinized but I don't see any innovation in that sense. I think it's still... I mean we have not done anything that actually makes things much more paralyzable than it used to before, yeah.

Zoe: Basically, out of the discussions, we wonder how, for example, what will be the high level syntax, for example, this new standard may consider, let's say, Apple also roll out the Vision Pro and I actually put the box right there. And so, and people started to talk about multi view HEVC and then even though actually HEVC has been there for quite a while, since 2013, almost more than 10 years. And we have learned that AV1 may have appendix for the support multi view.

Debargha: For this kind of high level syntax, we're beyond the traditional 2D videos. So what kind of other features, because you mentioned for AV1, one side is focused on definitely still coding efficiency. On the other side is actually considered new application, new use cases. And we'd like to know that for the scope of this new codec standard, what the specific other new use cases that will be considered uncovered?

I think you already nailed the main one, like multi-view and I think because Apple is very involved in this process, I think, but the high level syntax work in AVM is just about starting. So I think if you had asked me six months from today, I'll have a better answer for that.

Zoe: Okay, so we'll invite you back to our episode again for the third time.

Debargha: Yeah, so I think, but there have been some proposals that have been made in AVM from Apple on supporting multi-view and different frame, packing modes for stereo and all of that. I think, so some of those will become part of AV1 and AV2 spec and I think, but it's too early to say anything at this point because there is a lot of push and pull and some people think you do not need to have those in the elementary stream and all of that. So I think, yeah, it's still too early to see what, but some definitely, there is a motivation to support multi-view and Vision Pro kind of applications. Yeah.

Zoe: I see. So what others, for example, there could be some other use cases where the-

Debargha: Alpha channel is one I mentioned, alpha channel for a video. Higher bit depth could be, well it's again, no decision has been made but there is an argument to support higher bit depth at least for the all intra mode. Now alpha channel is actually probably going to happen no matter what. And then higher bit depth, it's not clear yet. Then there's also been talk of a better lossless coding mode for video.

Zoe: Okay.

Debargha: So I think that's another one for medical images or something like for very high quality video archival if there is, so these are some of the techniques that are being talked about, some of the tools that we may be able to support in AVM and it's mostly a matter of playing with the high level syntax. But again, I think again, if you ask me maybe later this year, I'll have a better understanding as to what actually is happening and what is going to be adopted.

Zoe: Okay because I want to still ask one little bit along this line because Krishna mentioned the GPU, mentioned the parallelism, so we still wonder because this time, the test side also include more special, right, about screen content, computer generated content. So based on that line, will this new standard, AVM, will AV2 ever consider combining the traditional compression plus some synthesis on the decoder side?

Debargha: Combining traditional compression with what, sorry?

Zoe: Some synthesis, for example, you just provide just like the film grain.

Debargha: So only actually provide a model that's specified by the standard and then the encoder only need to actually transcreate and then transmit this set of limited number of parameters to the decoder side and then synthesize and reconstruct some portion of the signal and get it back to the decoder and find reconstructed videos.

So I think the debanding, as I mentioned, and film grain, those are examples of cases where you're using the bitstream to transport some information about the source but to make it kind of optional, I think the decoder may or may not be able to actually use it. So to make it optional, you have to have that outside the conformance point.

It's sort of like if the decoder is able to use the information, they can add film grain using the information in the bitstream, but if it cannot, it'll still be, it's not the end of the world. So I think that kind of model has been talked about even for debanding. So debanding is one where banding is a big problem in all online videos, especially when the bit depth drops.

Now, to remove banding, all TVs have a complicated pipelines to handle banding anyway. Now the question is whether there is any additional benefit one can have by signaling information about the source in the bitstream. So there's a focus group on debanding that was operational for a while in AVM, and eventually I think, there are some proponents who wanted to have some debanding tools but then the other people who thought that they need to actually demonstrate that the existing pipelines are not sufficient for this purpose. So I think there is still some wrangling going on.

There is no decision that has been made. Yeah, I think, of course the hardware teams don't want to do these things because they already have other things in the pipeline after the decoder that already handles some of these. Film grain is actually a much stronger case because film grain, there's an artistic intent in film grain and if the codec is removing that or if the codec doesn't remove that, then you have to spend quite a lot of bits. But if you do remove it, then the artistic intent is gone.

There is a need to put back the same characteristics of film grain back into the decoder, but I think that is something, well, it was there in AV1, that was a good starting point but I think AV2, it is my opinion, I think that it can be a lot better but we have to make sure that it's actually represented the right way so it is actually easier for people to implement it and then there has to be conformance points also established after the film grain has been put back in so that we know that all hardwares are actually doing the right thing so the results are predictive, yeah.

Zoe: I see. Actually, I was thinking really out of the box wildly because right now even on the devices, so again, not only like GPU, like NPU, there could be more things available out of the box. So we're just wondering whether does standard ever consider that they have some parameters, actually transferred together maybe as SEIs or something actual on the high level and to guide some image videos to be recreated and added back to the actually compression version, yeah?

Debargha: Yeah, it could happen. I think AVM is a work in progress. I think some of these ML tools, I think it is possible to use the NPUs that already exist and most of the time they may be idle. But like as of now, I haven't seen anything concrete in AVM yet. I think, Yeah.

Zoe: Got it. We actually, I just want to be aware of the time but then again back to one question because I was really surprised to realize that you have been with Google for 14 years. And I know that prior that because we were also colleagues back at HP. It seems like your whole career right now between these two companies.

Debargha: That's right.

Zoe: And I'm just wondering, I think also a lot of audience actually came to us, for you, you have been working in this compression for this many years and that you're still working that because this new standard that is actually coming up. And just by yourself, because right now, everybody actually should know that you already been elected as IEEE fellow. And then so you have been doing quite some research now on the standardizations but also on the series. So we like to have an idea also combining your own career professional pursuit. So how do we anticipate not just the AVM but looking into the future about video compression because that also relate to your career because if there's still a lot of things we can do, I believe you are still staying in that area for quite some new years down the road.

Debargha: I think that's a good question. I think it seems like if I stayed in compression or if we stayed in the area of compression, I think the next-gen beyond AVM is probably use some form of ML no matter what. I think it doesn't have to be full end-to-end ML based video. I think it'll be some kind of a hybrid combination. So if you think of a codec that evolves in 2030 or 2032 or around that time, then it's very likely it's going to have a big ML component in it but it may also have some conventional components.

It's basically sort of like a hybrid conventional ML tool. And I think that's where I think the industry is going to go. But on the other hand, I see that there is, so in the last 15, 20 years, people have been increasing resolutions and with increasing resolutions, the applications had increased. But if you look at, let's say, okay, 1080p or HD to 4K was a big jump and people are usually happy but let's say, 4K to 8K, I think, perceptually, it's less of an impact. And I think, so when you think of 8K to 16K and things like that, there is, I think there is a certain limit on how much humans can perceive. So there is something that tells me that, wait, well, the compression may not actually be maybe not as important as it was maybe 10 years back.

If you're let's say AVM, has 25, 30% better compression than AV1, then the next generation maybe gets another 30%. But then it also comes at a cost of complexity on encoder side and decoder side and all that. So the equation may not, and perceptually, it may not give you that much of a gain. So it's not clear to me whether there is, whether the run that we have had in video compression since internet video took off like 10, 15 years back, whether it's going to continue for another 10 years or 20 years or so. So I think we probably need some better applications, bigger applications, which require a lot of data to be pushed at a limited bandwidth. And then I think compression will get a new life.

Zoe: I believe, as we all have discussed, there's actually two items, so one side is a new application that's driving the needs of the technologies. On the other side like ML and big model, and it's actually the technologies that actually create the possibility of new application to emerge and then these two things will mutually influence each other.

Debargha: Yeah.

Zoe: Alright. And we really actually happy to have you down here. So, the most exciting for me, actually for us, is that there's a potential we'll definitely invite you back in say in six months or down the road when this new codex standard is close to its finalization. And then we like to, I think everybody will be actually expected to see how it happened. Of course at the same time, the impact side, there's also ongoing codec standardization activities down there.

And the new applications will be there and the technology will be actually proceeding. And so always exciting time for the video world 'cause the video will take a large majority still, the format percentage in terms of communications.

And thanks very much for coming back, Debargha, and we're looking forward to having you again for the third time.

Debargha: Yeah, thank you, Zoe and Krishna. Yeah, thanks for inviting me and it was a pleasure to be here and present what I know about AVM and where it's going.

Zoe: Yeah, you shared with us a lot.

Debargha: This was a great chat, yep. It was a great chat.

Zoe: Yes, I learned a lot actually, especially those details and I was actually surprised and impressed for example, special categories. I suppose you want another more than 60% already being gained for computer generated content. So there's a lot of things we can all together looking forward to. And thank you both, again, and thanks for everyone to listen to this episode. We'll see you for the next one. Thank you.

Debargha: Thank you.

People on this episode

The VideoVerse

The VideoVerse

TVV EP 26 - Debargha Mukherjee: A Peek Into The AV2 Future

People on this episode

David Lea

Mark Donnigan

Nathan Veer

Thomas Davies

Zoe Liu