Internet Engineering Task Force                 Audio-Video Transport WG
INTERNET-DRAFT                                 C. Bormann / Univ. Bremen
                                                        L. Cline / Intel
                                                      G. Deisher / Intel
                                                       T. Gardos / Intel
                                                     C. Maciocco / Intel
                                                       D. Newell / Intel
                                                   J. Ott / Univ. Bremen
                                                   S. Wenger / TU Berlin
                                                          C. Zhu / Intel


               RTP Payload Format for the 1998 Version of
                    ITU-T Rec. H.263 Video (H.263+)


Status of This Memo

This document is an Internet-Draft.  Internet-Drafts are working 
documents of the Internet Engineering Task Force (IETF), its areas, and 
its working groups.  Note that other groups may also distribute working 
documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months 
and may be updated, replaced, or made obsolete by other documents at any 
time.  It is inappropriate to use Internet-Drafts as reference material 
or to cite them other than as "work in progress."

To learn the current status of any Internet-Draft, please check the
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 
ftp.isi.edu (US West Coast).

Distribution of this document is unlimited.


1. Introduction

This document specifies an RTP payload header format applicable to the 
transportation of video streams generated based on the 1998 version of 
ITU-T Recommendation H.263.

The 1998 version of ITU-T Recommendation H.263 added numerous coding 
options to improve codec performance over the 1996 version.  The 1998 
version is referred to as H.263+ in this document.  Among the new 
options, the ones with the biggest impact on the RTP payload are the 
slice structured mode (SS), independent segment decoding mode (ISD), and 
the scalability mode.  This section summarizes the impact of these new 
coding options on packetization.  Refer to [4] for more information on 
coding options.

Slice structure was added to H.263+ for three purposes: to provide 
enhanced error resilience capability, to make the bitstream more 
amenable to use with an underlying packet transport such as RTP, and to 
minimize video delay.  The slice structured mode supports fragmentation 
at macroblock boundaries.

When the independent segment decoding option is employed, a video 
picture frame is broken into segments and encoded in such a way that 
each segment is independently decodable.  Utilizing ISD in a lossy 
network environment helps prevent the propagation of errors from one 
segment of the picture to others.

H.263+ also includes bitstream scalability as an optional coding mode.  
Three kinds of scalability are defined: temporal, signal-to-noise ratio 
(SNR), and spatial scalability.  Temporal scalability is achieved via 
the disposable nature of bi-directionally predicted frames, or B-frames.   
SNR scalability permits refinement of encoded video frames, thereby 
improving the quality (or SNR).  Spatial scalability is similar to SNR 
scalability except the refinement layer is twice the size of the base 
layer in the horizontal dimension, vertical dimension, or both.


2. Usage of RTP

When transmitting H.263+ video streams over the internet, the output of 
the encoder can be packetized directly.  All the bits resulting from the 
bitstream including the fixed length codes and variable length codes 
will be included in the packet.

For H.263+ bitstreams coded with temporal, spatial, or SNR scalability, 
each layer may be transported to a different network address.  More 
specifically, each layer may use a unique IP address and port 
combination.  In addition, temporal relations between layers shall be 
expressed using the RTP timestamp so that they can be synchronized at 
the receiving ends in multicast or unicast applications.

The H.263+ video streams will be carried as payload data within RTP 
packets.  A new H.263+ payload header, H.263+ payload header, is defined 
in section 4.  This section defines the usage of the RTP fixed header 
and H.263+ video packet structure.


2.1 RTP Header Usage

Each RTP packet starts with a fixed RTP header.  The following fields of 
the RTP fixed header are used for H.263+ video streams:

Marker bit (M bit): The Marker bit of the RTP header is set to 1 when 
the current packet carries the end of current frame, and is 0 otherwise.

Payload Type (PT): The Payload Type shall specify H.263+ video payload 
format.  A dynamic payload can be used initially until a static payload 
type is assigned.

Timestamp: The RTP Timestamp encodes the sampling instance of the first 
video frame contained in the RTP data packet.  The RTP timestamp may be 
the same on successive packets if a video frame occupies more than one 
packet.  In a multilayer scenario, all pictures corresponding to the 
same temporal reference should pertain the same timestamp.  If temporal 
scalability is used and B-frames are present, the timestamp may not be 
monotonically increasing in the video stream.  If B-frames are 
transmitted on a separate layer and address, they must be synchronized 
properly with the reference frames.  Please refer to the 1998 ITU 
Recommendation for H.263 [4] for information on required transmission 
order to a decoder.  For an H.263+ video stream, the RTP timestamp is 
based on a 90 kHz clock, the same as that of the RTP payload for H.261 
stream [5].


2.2 Video Packet Structure

An H.263+ compressed bitstream is carried as a payload within each RTP 
packet.  For each RTP packet, the RTP header is followed by an H.263+ 
payload header, which is followed by a standard H.263+ compressed 
bitstream.  The size of the H.263+ payload header is variable depending 
on the payload involved as detailed in the section 4.  The layout of the 
RTP H.263+ video packet is shown as:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |    RTP Header                                               ...
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |    H.263+ Payload Header                                    ...
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |    H.263+ Compressed Data Stream                            ...
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


3. Design Considerations

The goal of this payload format is to specify an efficient way of 
encapsulating an H.263+ standard compliant bitstream and enhance the 
resiliency towards packet losses.  Due to the large number of different 
possible coding schemes in H.263+, a copy of the picture header with 
configuration information is inserted into the payload header when 
appropriate.

There are a few assumptions and constraints associated with this H.263+ 
payload header design.  The purpose of this section is to point out 
various design issues and also discuss several coding options provided 
by H.263+ that may impact the performance of network video.

. It is reasonable to assume that no single macroblock will be too large 
  to fit in a packet.

. The optional slice structured mode described in annex K of H.263+ [4]  
  enables more flexibility for packetization.  Furthermore, packets 
  based on a slice structure are also inherently more loss resilient.  
  Similar to a picture segment that begins with a GOB header, the 
  motion vector predictors in a slice are restricted to reside within 
  its boundaries.  For these reasons, the use of the slice structured 
  mode is strongly recommended for network applications.

. In non-rectangular slice structured mode, only complete slices should 
  be included in a packet.  In other words, slices should not be 
  fragmented across packets.  Optimally, a packet will contain only one 
  slice.

. When the slice structure is not applied, the insertion of a GOB header 
  in every GOB is recommended to reduce the dependency on motion vector 
  prediction across GOBs.  See section 3.3 of [6] for more information.
 
. The independently segmented decoding described in annex R of [4] does 
  not allow any data dependency across slice or GOB boundaries in 
  reference picture.  It can be utilized to further improve resiliency 
  in high loss conditions.

. If ISD is used in conjunction with the slice structure, the 
  rectangular slice submode shall be enabled and the dimensions and 
  quantity of the slices present in a frame shall remain the same 
  between two intra-coded frames (I-frames).  The ISD segments may be 
  entirely intra coded from time to time to realize quick error 
  recovery without adding latency time associated with sending complete 
  I-frames.

. For resiliency, sending a full picture header for every frame is 
  recommended.  In other words, the sender should always set the 
  subfield UFEP in PLUSPTYPE to '001' in the video bitstream.

. In a multi-layer scenario, each layer can be transmitted to a 
  different network address.  The configuration of each layer such as 
  the enhancement layer number (ELNUM), reference layer number (RLNUM), 
  and scalability type should be determined at the start of the session 
  and should not change during the course of the session.


4. H.263+ Payload Header

For H.263+ video streams, each RTP packet carries only one H.263+ video 
packet.  The H.263+ payload header is always present for each H.263+ 
video packet.  The payload header has variable length.  If a picture 
header is included in the payload header, the length of the picture 
header in number of bytes is specified by PLEN.  The minimum length of 
the payload header is 32 bits, corresponding to PLEN equals 0.

The H.263+ payload header is structured as follow:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |V=0|SBIT |EBIT |  PLEN   |PEBIT| TID | Trun  |       RR        |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |1 0 0 0 0 0| picture header starting with TR, PTYPE, ...       .
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

  V: 2 bits
  Version number.  Set to '00' for this payload format.
  [Ed. Note: The version control will not take effect until a draft has 
  been formally submitted to the IETF.]

  SBIT: 3 bits
  Start bit position specifies the number of bits that should be 
  ignored in the first data byte of the payload.

  EBIT: 3 bits
  End bit position indicates the number of bits that should be ignored 
  in the last data byte of the payload.

  PLEN: 3 bits
  Picture header length in number of bytes.

  PEBIT: 3 bits
  End bit position indicates the number of bits that should be ignored 
  in the last byte of the picture header.

  TID: 3 bits
  Thread id.  Used only in optional video redundancy coding mode (VRC).  
  See annex N of [4].  All three bits must be set to 0 unless VRC mode 
  is applied.

  Trun: 4 bits
  Cyclic packet number.  Used only in optional VRC mode.  These bits 
  must be set to 0 unless VRC mode is applied.

  RR: 9 bits
  Reserved bits.

Notice that the TID and Trun fields are associated only with the video 
redundancy coding usage scenario derived from the reference picture 
selection mode specified in annex N of [4].  The TID and Trun bits must 
be set to 0 if VRC is not used.  The use of VRC shall be negotiated by 
external means.


4.1 Encapsulating Packet that Begins with PSC

Any packet that begins with a picture start code (PSC), i.e. the first 
packet of a picture frame, shall be encapsulated using only the first 
32-bit word of the payload header since a picture header is already 
included in the data bitstream.  In this case, PLEN shall be 0.

Here is an example of encapsulating the first packet in a frame:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |0 0|SBIT |EBIT |0 0 0 0 0|0 0 0| TID | Trun  |       RR        |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | bitstream data starts with complete picture header ...        .
  +---------------------------------------------------------------+


4.2 Encapsulating Packet that Begins with GBSC or SSC

Any packet that begins with either a GOB start code (GBSC) or a slice 
start code (SSC) shall include a copy of the picture header in the 
payload header for resiliency.  PLEN shall be set to specify the length 
of the included picture header in bytes.  Hence, PLEN > 0.  The end bit 
position corresponding to the last byte of the picture header data is 
indicated by PEBIT.  Actual bitstream data shall begin on an 8-bit byte 
boundary following the payload header.

Notice that only the last six bits of the picture start code, '100000', 
are included in the payload header.  A complete H.263+ picture header 
with byte aligned picture start code can be conveniently assembled if 
needed on the receiving end by prepending the sixteen leading '0' bits.

Assuming a PLEN of 9, below is an example of a packet that begins with a 
GBSC or a SSC:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |0 0|SBIT |EBIT |0 1 0 0 1|PEBIT| TID | Trun  |       RR        |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |1 0 0 0 0 0| picture header starting with TR, PTYPE, ...       |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | ...                                                           |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | ...           | bitstream data begins with GBSC/SCC ...       .
  +-+-+-+-+-+-+-+-+-----------------------------------------------+


4.3 Encapsulating Follow-On Packet

When slice structure coding option is not applied, some GOBs in the 
bitstream may be larger than the size of one packet.  Similarly, when 
ISD option is applied, a picture segment may be larger than the required 
packet size.  The remaining fragment of a picture segment larger than 
the required packet size is termed "follow-on" packet in this document.

These follow-on packets with data fragmented at the macroblock 
boundaries are not independently recoverable.  In this case, the payload 
header includes only the first 32-bit word and PLEN shall be set to 0.  
A receiver should discard any follow-on packet it receives if the 
preceding packet containing the segment header information has been 
lost.

Here is an example of a follow-on packet:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |0 0|SBIT |EBIT |0 0 0 0 0|0 0 0| TID | Trun  |       RR        |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | sub-segment bitstream data ...                                .
  +---------------------------------------------------------------+

Even though they may have identical payload headers, a follow-on packet 
can be differentiated from the first packet in a frame since the data in 
a follow-on packet does not begin with a PSC.


5. Security Considerations

RTP packets using the payload format defined in this specification are
subject to the security considerations discussed in the RTP
specification [1], and any appropriate RTP profile (for example [3]).
This implies that confidentiality of the media streams is achieved by
encryption.  Because the data compression used with this payload format
is applied end-to-end, encryption may be performed after compression so
there is no conflict between the two operations.

A potential denial-of-service threat exists for data encodings using
compression techniques that have non-uniform receiver-end computational
load.  The attacker can inject pathological datagrams into the stream
which are complex to decode and cause the receiver to be overloaded.
However, this encoding does not exhibit any significant non-uniformity.

As with any IP-based protocol, in some circumstances a receiver may be
overloaded simply by the receipt of too many packets, either desired or
undesired.  Network-layer authentication may be used to discard packets
from undesired sources, but the processing cost of the authentication
itself may be too high.  In a multicast environment, pruning of specific
sources may be implemented in future versions of IGMP [5] and in
multicast routing protocols to allow a receiver to select which sources
are allowed to reach it.

A security review of this payload format found no additional
considerations beyond those in the RTP specification.


6. References

[1] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP : A 
    Transport Protocol for Real-Time Applications", RFC 1889.

[2] "Video Codec for Audiovisual Services at px64 kbits/s", ITU-T 
    Recommendation H.261, 1993.

[3] "RTP Profile for Audio and Video Conference with Minimal Control", 
    RFC 1890.

[4] "Video Coding for Low Bitrate Communication", Draft ITU-T 
    Recommendation H.263, Draft 20, September 1997.

[5] T. Turletti, C. Huitema, "RTP Payload Format for H.261 Video 
    Streams", RFC 2032.

[6] C. Zhu, "RTP Payload Format for H.263 Video Streams", RFC 2190.