<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced. 
An alternate method (rfc include) is described in the references. -->

<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY RFC5226 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5226.xml">
<!ENTITY I-D.valin-netvc-pvq PUBLIC '' "http://xml2rfc.ietf.org/public/rfc/bibxml3/reference.I-D.valin-netvc-pvq.xml"> ]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="info" docName="draft-cho-netvc-applypvq-00" ipr="trust200902">
<!-- category values: std, bcp, info, exp, and historic
ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
or pre5378Trust200902
you can add the attributes updates="NNNN" and obsoletes="NNNN" 
they will automatically be output with "(if approved)" -->


<!-- ***** FRONT MATTER ***** -->
<!-- ..................................................................... -->
<front>
<!-- The abbreviated title is used in the page header - it is only necessary if the 
full title is longer than 39 characters -->

<title>Applying PVQ Outside Daala</title>

<!-- add 'role="editor"' below for the editors if appropriate -->

<!-- Another author who claims to be an editor -->

<author fullname="Yushin Cho" initials="Y.C." surname="Cho">
<organization>Mozilla Corporation</organization>
<address>
<postal>
<street>331 E. Evelyn Avenue</street>
<!-- Reorder these if your country does things differently -->
<city>Mountain View</city>
<region>CA</region>
<code>94041</code>
<country>USA</country>
</postal>
<phone>+1 650 903 0800</phone>
<email>ycho@mozilla.com</email>
<!-- uri and facsimile elements may also be added -->
</address>
</author>

<date day="31" month="Oct" year="2016"/>

<!-- If the month and year are both specified and are the current ones, xml2rfc will fill 
in the current day for you. If only the current year is specified, xml2rfc will fill 
in the current day and month for you. If the year is not the current one, it is 
necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the 
purpose of calculating the expiry date).  With drafts it is normally sufficient to 
specify just the year. -->

<!-- Meta-data Declarations -->

<area>ART</area>

<workgroup>NETVC (Internet Video Codec)</workgroup>

<!-- WG name at the upperleft corner of the doc,
IETF is fine for individual submissions.  
If this element is not present, the default is "Network Working Group",
which is used by the RFC Editor as a nod to the history of the IETF. -->

<keyword>PVQ</keyword>
<keyword>Daala</keyword>

<!-- Keywords will be incorporated into HTML output
files in a meta tag but they have no effect on text or nroff
output. If you submit your draft to the RFC Editor, the
keywords will be used for the search engine. -->

<abstract>
<t>This document describes the Perceptual Vector Quantization (PVQ) 
outside of the Daala video codec, where PVQ was originally developed.
It discusses the issues arising while integrating PVQ into a traditional
video codec, AV1.</t>
</abstract>
</front>

<!-- ..................................................................... -->
<middle>

<!-- ..................................................................... -->
<section anchor="background" title="Background">
<t>Perceptual Vector Quantization (PVQ)&nbsp;<xref target="I-D.valin-netvc-pvq"/> 
has been proposed
as a quantization and coefficient coding tool for an internet video codec.
PVQ was originally developed for Daala video codec <eref target="https://xiph.org/daala/"/>,
which does a gain-shape coding
of transform coefficients instead of more traditional scalar quantization.
(The original abbreviation of PVQ, "Pyramid Vector Quantizer", as in
<xref target="I-D.valin-netvc-pvq"/> is now commonly expanded as "Perceptual Vector Quantization".)</t>

<t>The most distinguishing idea of PVQ is the way it referneces a predictor.
With PVQ, we do not subtract the predictor from the input to produce a residual, 
which is then transformed and coded.
Both the predictor and the input are transformed into the frequency domain.

Then, PVQ applies a reflection to both the predictor and the input such that 
the prediction vector lies on one of the coordinate axes, and codes the angle between them.
By not subtracting the predictor from the input, the gain of the predictor can be preserved
and is explicitly coded,
which is one of the benefits of PVQ.
Since DC is not quantized by PVQ, the gain can be viewed as the amount of contrast in an image,
which is an important perceptual parameter.
</t>

<t>Also, an input block of transform coefficients is split into frequency bands 
based on their spatial orientation and scale.
Then, each band is quantized by PVQ separately.

The 'gain' of a band indicates the amount of contrast in the corresponding orientation and scale.
It is simply the L2 norm of the band. The gain is non-linearly companded
and then scalar quantized and coded. 
The remaining information in the band, the 'shape',
is then defined as a point on the surface of a unit hypersphere.</t>

<t>Another benefit of PVQ is activity masking based on the gain, 
which automatically controls the quantization resolution based on the image contrast
without any signaling.
For example, for a smooth image area (i.e. low contrast thus low gain),
the resolution of quantization will increase, thus fewer qunatization errors will be shown.
Succint summary on the benefits of PVQ can be found in the Section 2.4 of
<xref target="Terriberry_16"/>.
</t>

<t>Since PVQ has only been used in the Daala video codec, which contains many non-traditional
design elements, there has not been any chance to see the relative coding performance of
PVQ compared to scalar quantization in a more traditional codec design.
We have tried to apply PVQ in AV1 video codec, which is currently being developed 
by Alliance for Open Media (AOM) as an open source and royalty-free video codec.

While most of benefits of using PVQ arise from the enhancement of subjective quality of video,
compression results with activity masking enabled are not available yet in this draft
because the required parameters, which were set for Daala, have not been adjusted to AV1 yet.
These results were achieved optimizing solely for PSNR.</t>

</section>

<!-- ..................................................................... -->
<section anchor="integration" title="Integration of PVQ into non-Daala codec, AV1">
<t>Adopting PVQ in AV1 requires replacing both the scalar quantization step and 
the coefficient coding of AV1 with those of PVQ.

In terms of inputs to PVQ and the usage of trasnforms
as shown in <xref target="traditional_arch"/> and 
<xref target="av1_pvq"/>,
the biggest conceptual change required in a traditional coding system, such as AV1, is
<list style="symbols">
<t>Introduction of a transformed predictor both in encoder and decoder.
For this, we apply a forward transform to the predictors,
both intra-predicted pixels and inter-predicted (i.e., motion-compensated) pixels.
This is because PVQ references the predictor in the transform domain,
instead of using a pixel-domain residual as in traditional scalar quantization.</t>
<t>Absence of a difference signal (i.e. residue) defined as "input source - predictor".
Hence AV1 with PVQ does not do any 'subtraction' in order for an input to reference the predictor. 
Instead, PVQ takes a different approach to referencing the predictor 
which happens in the transform domain.</t>
</list>
</t>

<figure align="center" anchor="traditional_arch" title="Traditional architecture containing Quantization and Transforms">
<artwork align="center"><![CDATA[
  input X --> +-------------+                 +-------------+
              | Subtraction | --> residue --> | Transform T |
predictor --> +-------------+     signal R    +-------------+
        P                                            |
     |                                               v
     v                                              T(R)
    [+]--> decoded X                                 |
     ^                                               |
     |                                               v
     |       +-----------+    +-----------+     +-----------+
decoded  <-- | Inverse   | <--| Inverse   | <-- | Scalar    |
      R      | Transform |    | Quantizer |  |  | Quantizer |
             +-----------+    +-----------+  |  +-----------+
                                             v 
                                       +-------------+
                         bitstream  <--| Coefficient |
                         of coded T(R) |       Coder |
                                       +-------------+
]]></artwork>
</figure>


<figure align="center" anchor="av1_pvq" title="AV1 with PVQ">
<!-- <preamble>Preamble text - can be omitted or empty.</preamble> -->

<artwork align="center"><![CDATA[
            +-------------+            +-----------+ 
  input X-->| Transform T |--> T(X)--> | PVQ       | 
            |_____________|            | Quantizer |  +-------------+
                                +----> +-----------+  | PVQ         |
            +-------------+     |            |------> | Coefficient |
predictor-->| Transform T |--> T(P)          v        | Coder       |
        P   |_____________|     |      +-----------+  +-------------+
                                |      | PVQ       |        |
                                +----> | Inverse   |        v
                                       | Quantizer |    bitstream
                                       +-----------+    of coded T(X)
                                              |
              +-----------+                   v
 decoded X <--| Inverse   | <--------- dequantized T(X)
              | Transform |
              +-----------+
]]></artwork>
</figure>


<!--<t>List styles: 'empty', 'symbols', 'letters', 'numbers', 'hanging',
'format'.</t> -->

<section anchor="skip" title="Signaling Skip for Paritition and Transform Block">
<t>In AV1, a skip flag for a partition block is true if all of quauntized coefficients
in the partition are zeros.
The signaling for the prediction mode in a partition cannot be skipped.
If the skip flag is true with PVQ, the predicted pixels are the final decoded pixels 
(plus frame wise in-loop filtering such as deblocking) as in AV1 then a forward transform of a predictor
is not required.
</t>
<t>While AV1 currently defines only one 'skip' flag for each 'partition'
(a unit where prediction is done), PVQ introduces another kind of 'skip' flag,
called 'ac_dc_coded', which is defined for each transform block
(and thus for each Y'CbCr plane as well).
AV1 allows that a transform size can be smaller than a partition size which leads to
partitions that can have multiple transform blocks.
The ac_dc_coded flag signals whether DC and/or whole AC coefficients are coded by PVQ or not
(PVQ does not quantize DC itself though).
</t>
</section>

<!-- .......................................... -->
<section anchor="issues" title="Issues">

<t>
<list style="symbols">
<t>PVQ has its own rate-distortion optimization (RDO) that differs from
that of traditional scalar quantization.
This leads the balance of quality between luma and chroma to be different from 
that of scalar quantization.
When scalar quantization of AV1 is done for a block of coefficients,
RDO, such as trellis coding, can be optionally performed.
 
The second pass of 2-pass encoding in AV1 currently uses trellis coding.
<!--during quantization to optimize the coefficient coding, -->
When doing so it appears a different scaling factor is applied
for each of Y'CbCr channels.</t>

<t>In AV1, to optmize speed, there are inverse transforms that can skip 
applying certan 1D basis functions based on the distribution of quantized coefficients.
However, this is mostly not possible with PVQ since the inverse transform is applied directly to
a dequantized input, instead of a dequantized difference (i.e. input source - predictor) 
as in traditional video codec. This is true for both encoder and decoder.</t>

<t>PVQ was originally designed for the 2D DCT,
while AV1 also uses a hybrid 2D transform consisting of 
a 1D DCT and a 1D ADST. This requires PVQ to have new coefficient scanning orders 
for the two new 2D transforms, DCT-ADST and ADST-DCT
(ADST-ADST uses the same scan order as for DCT-DCT).
Those new scan orders has been produced based on that of AV1,
for each PVQ-defined-band of new 2D transforms.</t>

</list>
</t>


</section>

</section>

<!-- ..................................................................... -->
<section anchor="performance" title="Performance of PVQ in AV1">
<!-- ........................................... -->
<section anchor="coding_gain" title="Coding Gain">
<t>With the encoding options specified by both NETVC 
(<eref target="https://tools.ietf.org/html/draft-ietf-netvc-testing-03"/>) and
AOM testing for high latency case,
PVQ gives similar coding efficiency to that of AV1, which is measured in PSNR BD-rate.
Again, PVQ's activity masking is not turned on for this testing.
Also, scalar quantization has matured over decades, 
while video coding with PVQ is much more recent.</t>

<t>We compare the coding efficiency for one of IETF test sequence set
"objective-1-fast" defined in <eref target="https://tools.ietf.org/html/draft-ietf-netvc-testing-03"/>,
which consists of sixteen of 1080p, seven of 720p, and seven of 640x360 sequences
of various types of content, including slow/high motion of people and objects, 
animation, computer games and screen casting.
The encoding is done for the first 30 frames of each sequence.
The encodig options used is : 
"-end-usage=q -cq-level=x --passes=2 --good --cpu-used=0 --auto-alt-ref=2 --lag-in-frames=25 --limit=30",
which is official test condition of IETF and AOM for high latency encoding except limiting 30 frames only.</t>

<t>For comparison reasons, some of the lambda values used in RDO are adjusted
to match the balance of luma and chroma quality of the PVQ-enabled AV1 to that of
current AV1. 

<list style="symbols">
<t>Use half the value of lambda during intra prediction for the chroma channels.</t>
<t>Scale PVQ's lambda by 0.8 for the chroma channels.</t>
<t>Do not do RDO of DC for the chroma channels.</t>
</list>
</t>

<t>
The result are shown in <xref target="gain_table"/>, 
which is the BD-Rate change for several image quality metrics.
(The encoders used to generate this result are available from author's git repository
<eref target="https://github.com/ycho/aom/commit/2478029a9b6d02ee2ccc9dbafe7809b5ef345814"/> and
AOM's repositiony <eref target="https://aomedia.googlesource.com/aom/+/59848c5c797ddb6051e88b283353c7562d3a2c24"/>.)
<!--Full comparison result is also available at
<eref target="https://arewecompressedyet.com/?r%5B%5D=
av1_pvq_p2_30f_chroma_L_0.8_2016-10-19T15-48-55.328Z
&r%5B%5D=av1_master_30f_p2_2016-10-13T15-53-37.656Z&s=objective-1-fast"/-->
</t>

<texttable anchor="gain_table" title="Coding Gain by PVQ in AV1">
<!--preamble>Comparison between AV1 and AV1 + PVQ.</preamble-->

<ttcol align="center">Metric</ttcol>
<ttcol align="center">AV1 --> AV1 + PVQ</ttcol>
<c>PSNR</c>
<c>0.10%</c>
<c>PSNR-HVS</c>
<c>0.53%</c>
<c>SSIM</c>
<c>1.27%</c>
<c>MS-SSIM</c>
<c>0.42%</c>
<c>CIEDE2000</c>
<c>-0.94%</c>
</texttable>

</section>

<!-- ........................................... -->
<section anchor="speed" title="Speed">

<t>Total encoding time increases roughly 20 times or more when intensive RDO options,
such as "--passes=2 --good --cpu-used=0 --auto-alt-ref=2 --lag-in-frames=25", are turned on.

The biggest reason for significant increase in encoding time is 
due to the increased computation by the PVQ.
The PVQ tries to find asymptotically-optimal codepoints (in RD optimization sense)
on a hypershpere with a greedy search, which has the time complexity close to O(n*n) 
for n coefficients. Meanwhile, scalar quantization has the time complexity of O(n).</t>

<t>Comparing to Daala, the search space for a RDO decision in AV1 is
far larger because AV1 considers ten intra prediction modes 
and four different transforms (for the transform block sizes 4x4, 8x8, and 16x16 only),
and the transform block size can be smaller than the prediction block size. 
Since the largest transform and the prediction sizes are currently 32x32 and 64x64 in AV1, 
PVQ can be called 
<!-- 10 x 4 x ((4 + 16) + 4) + // for partition sizes 4x4, 8x8, and 16x16 
  10 x ((4 + 16 + 64)  + 4 x (4 + 16 + 64)  ) --> 
approximately 5,160 times more in AV1 than in Daala.
Also, AV1 uses transform and quantization for each candidate of RDO.</t>

<t>As an example, AV1 calls PVQ function 632,520 times to encode the grandma_qcif (176x144)
in intra frame mode (for a actual quantizer used for quantization being 38), 
while Daala calls 3843 times only. So, PVQ was called 165 times more in AV1 than Daala.
<xref target="speed_table"/> shows the frequency of PVQ function calls in AV1 at each speed level (mode = good).
The first column indicates speed level,
the second column shows the number of calls to PVQ's search for each band 
(function pvq_search_rdo_double() in 
<eref target="https://github.com/ycho/aom/blob/14981eebb4a08f74182cea3c17f7361bc79cf04f/av1/encoder/pvq_encoder.c#L84"/>),
and the third column shows the number of calls to PVQ's encoding of whole transfrom block 
(function od_pvq_encode() in 
<eref target="https://github.com/ycho/aom/blob/14981eebb4a08f74182cea3c17f7361bc79cf04f/av1/encoder/pvq_encoder.c#L763"/>).

Smaller speed level gives slower encoding but better quality for the same rate
by doing more RDO optimizations.
The major difference from speed level 4 to 3 is enabling that a transform block size 
can be smaller than a prediction (i.e. partition) block size.</t>


<texttable anchor="speed_table" title="Number of Calls to PVQ in AV1">
<ttcol align="center">Speed Level</ttcol>
<ttcol align="center"># of calls to PVQ search for a band</ttcol>
<ttcol align="center"># of calls to PVQ encode</ttcol>
<c>5</c>
<c>365,913</c>
<c>26,786</c>
<c>4</c>
<c>472,222</c>
<c>56,980</c>
<c>3</c>
<c>3,680,366</c>
<c>564,724</c>
<c>2</c>
<c>3,680,366</c>
<c>564,724</c>
<c>1</c>
<c>3,990,327</c>
<c>580,566</c>
<c>0</c>
<c>4,109,113</c>
<c>632,520</c>
</texttable>

</section>
</section>

<!-- ..................................................................... -->
<section anchor="future_work" title="Future Work">
<t>Possible future works include:

<list style="symbols">
<t>Enable activity masking, which also needs HVS-tuned quantiztion matrix (bandwise QP scalers).</t>
<t>Adjust, probably perceptualy driven, the balance between luma and chroma qualities.</t>
<t>Optimize the speed of the PVQ codes, adding SIMD.</t>
<t>RDO with more model-driven decision making, instead of full transform + quantization.</t>
</list>
</t>
</section>


<!-- ..................................................................... -->
<section anchor="repository" title="Development Repository">
<t>The ongoing work of integrating PVQ into AV1 video codec is located at 
the git repository <eref target="https://github.com/ycho/aom/tree/av1_pvq"/>. </t>

</section>

<!-- ..................................................................... -->
<section anchor="Acknowledgements" title="Acknowledgements">
<t>Thanks to Tim Terriberry for his proof reading and valuable comments.
Also thanks to Guillaume Matres for his contibutions to intergrating PVQ into AV1 
during his intership at Mozilla and Thomas Daede for providing and maintaining 
testing infrastructure by way of www.arewecompressedyet.com (AWCY) web site
<eref target="https://arewecompressedyet.com/"/>.</t>

</section>

<!-- Possibly a 'Contributors' section ... -->

<section anchor="IANA" title="IANA Considerations">
<t>This memo includes no request to IANA.</t>
</section>

</middle>


<!--  *****BACK MATTER ***** -->
<!-- ..................................................................... -->
<back>
<!-- References split into informative and normative -->

<!-- There are 2 ways to insert reference entries from the citation libraries:
1. define an ENTITY at the top, and use "ampersand character"RFC2629; here (as shown)
2. simply use a PI "less than character"?rfc include="reference.RFC.2119.xml"?> here
(for I-Ds: include="reference.I-D.narten-iana-considerations-rfc2434bis.xml")

Both are cited textually in the same manner: by using xref elements.
If you use the PI option, xml2rfc will, by default, try to find included files in the same
directory as the including file. You can also define the XML_LIBRARY environment variable
with a value containing a set of directories to search.  These can be either in the local
filing system or remote ones accessed by http (http://domain/dir/... ).-->

<!-- ..................................................................... -->
<references title="Informative References">
<!-- Here we use entities that we defined at the beginning. -->
&I-D.valin-netvc-pvq;

<reference anchor="Perceptual-VQ" target="https://arxiv.org/pdf/1602.05209v1.pdf">
<front>
<title>Perceptual Vector Quantization for Video Coding</title>
<author initials="JM." surname="Valin" fullname=""><organization/></author>
<author initials="TB." surname="Terriberry" fullname=""><organization/></author>
<date month="February" year="2015" />
</front>
<seriesInfo name="Proceedings of SPIE Visual Information Processing and Communication" value=""/>
</reference>

<reference anchor="PVQ-demo" target="https://people.xiph.org/~jm/daala/pvq_demo/">
<front>
<title>Daala: Perceptual Vector Quantization (PVQ)</title>
<author initials="JM." surname="Valin" fullname=""><organization/></author>
<date month="November" year="2014" />
</front>
</reference>

<reference anchor="Terriberry_16" target="https://arxiv.org/pdf/1610.02488.pdf">
<front>
<title>Perceptually-Driven Video Coding with the Daala Video Codec</title>
<author initials="TB." surname="Terriberry" fullname="Tim Terriberry"><organization/></author>
<date month="September" year="2016" />
</front>
<seriesInfo name="Proceedings SPIE Volume 9971, Applications of Digital Image Processing XXXIX" value=""/>
</reference>

</references>

<!-- ..................................................................... -->
<!--
<section anchor="app-additional" title="Additional Stuff">
<t>This becomes an Appendix.</t>
</section>
-->
</back>
</rfc>
