Improved chroma prediction

Modern video coding standards such as Thor form predictions for the luma channel (Y) and chroma channels (U and V) which are encoded separately (in that order). The prediction for each channel has spatial or temporal dependencies only in its own channel. Most of the perceived information of a video is to be found in the luma channel, but there still remain correlations between the luma and chroma channels. For instance, the same shape of an object can often be seen in all three channels, and if this correlation is not exploited, some structural information will be transmitted three times. Thor will attempt to improve the chroma prediction by finding linear relationships between the each of the initial chroma predictions and the luma prediction, and if certain criteria are satisfied, use that relationship to form a new prediction based on the reconstructed luma samples.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The improved predictions are derived from the reconstructed luma samples using a mapping. The underlying assumption is that the colours can be identified by their luminosities. Informally we can say that a new chroma prediction is formed from the reconstructed luma block painted with the colours of the initial chroma prediction. There is often a linear correlation between the luma and chroma channel, so that a chroma sample c can be expressed by the linear function

where y is the corresponding luma sample. This observation has been previously been used in techniques to convert YUV 4:2:0 and YUV 4:2:2 images to YUV 4:4:4, and in a (rejected) proposal for HEVC as a special intra mode. Thor, however, generalises the prediction, so it does not depend on the coding mode (i.e. whether inter or intra, or the kind of inter/intra mode). Since it would be too costly to transmit the values a and b in the linear mapping, and since both the encoder and decoder must be able to compute identical predictions, a and b are derived from data available to both using linear regression.

Since the assumption that the correlation is the same in the predicted block and in the reconstructed block is not always true, the new prediction from luma might not be better even when there is a very good correlation in the predicted block. Therefore, we can only expected an improvement if the initial prediction is bad, and the luma residual is used as an estimate for this. The initial chroma prediction is kept unless the average squared difference between the reconstructed luma samples yr and the predicted y samples for an N*N prediction block is above 64:

64 N*N ]]> The encoder and decoder must compute a and b using the same least square fit for an N*N prediction block, where y and c denote the luma and chroma samples in the initial prediction:

These sums will all be contained within a 32 bit signed integer. Then the following must be computed using 64 bit arithmetic:

> 2*log2(N)) SScc = CCsum - ((Csum * Csum) >> 2*log2(N)) SSyc = YCsum - ((YCsum * YCsum) >> 2*log2(N)) ]]> Still using 64 bit arithmetic, if

0 /\ 2 * SSyy * SSyy > SSyy * SScc ]]> then it is assumed that the correlation is reasonably good and a new prediction will be computed and used. Otherwise, the initial prediction will be kept. First, a and b must be computed:

> 2*log2(N) ]]> The final operations are performed with 32 bit arithmetic, so a must be clipped to [-2^23, 2^23] and b must be clipped to [-2^31, 2^31-1]. The a new chroma prediction c' is computed using the reconstructed luma samples yr, a and b, and a clipping function saturating the results to an 8 bit value:

> 16) ]]> The above assumes 4:4:4 format. For the 4:2:0 format the predicted luma block must be subsampled first:

> 2 ]]> The resulting new chroma prediction must also be subsampled. The clipping is performed before the subsampling.

> 16) + clip((a*yr(2*i+1, 2*j) + b) >> 16) + clip((a*yr(2*i, 2*j+1) + b) >> 16) + clip((a*yr(2*i+1, 2*j+1) + b) >> 16) + 2) >> 2 ]]> In intra mode the chroma prediction improvement must be performed right after each transform, since the new chroma reconstruction will be used to predict the next block.

The improved chroma prediction may significantly improve the compression efficiency for images or video containing high correlations between the channels. It is particularly useful for encoding screen content, 4:4:4 content, high frequency content and "difficult" content where traditional prediction techniques perform poorly. Little quality change is seen for content not in these categories, but there is a general small increase in chroma PSNR. An encoded configured for low delay and medium complexity was used for the following results. The numbers have been computed using the Bjontegaard Delta Rate (BDR). The rates for Y, U and V have been shown separately.

This document has no IANA considerations yet. TBD

This document has no security considerations yet. TBD

The author would like to thank Arild Fuldseth and Mo Zanaty for reviewing this document and design.