Internet Engineering Task Force Audio-Video Transport WG INTERNET-DRAFT C. Bormann / Univ. Bremen L. Cline / Intel G. Deisher / Intel T. Gardos / Intel C. Maciocco / Intel D. Newell / Intel J. Ott / Univ. Bremen S. Wenger / TU Berlin C. Zhu / Intel RTP Payload Format for the 1998 Version of ITU-T Rec. H.263 Video (H.263+) Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. 1. Introduction This document specifies an RTP payload header format applicable to the transportation of video streams generated based on the 1998 version of ITU-T Recommendation H.263. The 1998 version of ITU-T Recommendation H.263 added numerous coding options to improve codec performance over the 1996 version. The 1998 version is referred to as H.263+ in this document. Among the new options, the ones with the biggest impact on the RTP payload are the slice structured mode (SS), independent segment decoding mode (ISD), and the scalability mode. This section summarizes the impact of these new coding options on packetization. Refer to [4] for more information on coding options. Slice structure was added to H.263+ for three purposes: to provide enhanced error resilience capability, to make the bitstream more amenable to use with an underlying packet transport such as RTP, and to minimize video delay. The slice structured mode supports fragmentation at macroblock boundaries. When the independent segment decoding option is employed, a video picture frame is broken into segments and encoded in such a way that each segment is independently decodable. Utilizing ISD in a lossy network environment helps prevent the propagation of errors from one segment of the picture to others. H.263+ also includes bitstream scalability as an optional coding mode. Three kinds of scalability are defined: temporal, signal-to-noise ratio (SNR), and spatial scalability. Temporal scalability is achieved via the disposable nature of bi-directionally predicted frames, or B-frames. SNR scalability permits refinement of encoded video frames, thereby improving the quality (or SNR). Spatial scalability is similar to SNR scalability except the refinement layer is twice the size of the base layer in the horizontal dimension, vertical dimension, or both. 2. Usage of RTP When transmitting H.263+ video streams over the internet, the output of the encoder can be packetized directly. All the bits resulting from the bitstream including the fixed length codes and variable length codes will be included in the packet. For H.263+ bitstreams coded with temporal, spatial, or SNR scalability, each layer may be transported to a different network address. More specifically, each layer may use a unique IP address and port combination. In addition, temporal relations between layers shall be expressed using the RTP timestamp so that they can be synchronized at the receiving ends in multicast or unicast applications. The H.263+ video streams will be carried as payload data within RTP packets. A new H.263+ payload header, H.263+ payload header, is defined in section 4. This section defines the usage of the RTP fixed header and H.263+ video packet structure. 2.1 RTP Header Usage Each RTP packet starts with a fixed RTP header. The following fields of the RTP fixed header are used for H.263+ video streams: Marker bit (M bit): The Marker bit of the RTP header is set to 1 when the current packet carries the end of current frame, and is 0 otherwise. Payload Type (PT): The Payload Type shall specify H.263+ video payload format. A dynamic payload can be used initially until a static payload type is assigned. Timestamp: The RTP Timestamp encodes the sampling instance of the first video frame contained in the RTP data packet. The RTP timestamp may be the same on successive packets if a video frame occupies more than one packet. In a multilayer scenario, all pictures corresponding to the same temporal reference should pertain the same timestamp. If temporal scalability is used and B-frames are present, the timestamp may not be monotonically increasing in the video stream. If B-frames are transmitted on a separate layer and address, they must be synchronized properly with the reference frames. Please refer to the 1998 ITU Recommendation for H.263 [4] for information on required transmission order to a decoder. For an H.263+ video stream, the RTP timestamp is based on a 90 kHz clock, the same as that of the RTP payload for H.261 stream [5]. 2.2 Video Packet Structure An H.263+ compressed bitstream is carried as a payload within each RTP packet. For each RTP packet, the RTP header is followed by an H.263+ payload header, which is followed by a standard H.263+ compressed bitstream. The size of the H.263+ payload header is variable depending on the payload involved as detailed in the section 4. The layout of the RTP H.263+ video packet is shown as: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | H.263+ Payload Header ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | H.263+ Compressed Data Stream ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3. Design Considerations The goal of this payload format is to specify an efficient way of encapsulating an H.263+ standard compliant bitstream and enhance the resiliency towards packet losses. Due to the large number of different possible coding schemes in H.263+, a copy of the picture header with configuration information is inserted into the payload header when appropriate. There are a few assumptions and constraints associated with this H.263+ payload header design. The purpose of this section is to point out various design issues and also discuss several coding options provided by H.263+ that may impact the performance of network video. . It is reasonable to assume that no single macroblock will be too large to fit in a packet. . The optional slice structured mode described in annex K of H.263+ [4] enables more flexibility for packetization. Furthermore, packets based on a slice structure are also inherently more loss resilient. Similar to a picture segment that begins with a GOB header, the motion vector predictors in a slice are restricted to reside within its boundaries. For these reasons, the use of the slice structured mode is strongly recommended for network applications. . In non-rectangular slice structured mode, only complete slices should be included in a packet. In other words, slices should not be fragmented across packets. Optimally, a packet will contain only one slice. . When the slice structure is not applied, the insertion of a GOB header in every GOB is recommended to reduce the dependency on motion vector prediction across GOBs. See section 3.3 of [6] for more information. . The independently segmented decoding described in annex R of [4] does not allow any data dependency across slice or GOB boundaries in reference picture. It can be utilized to further improve resiliency in high loss conditions. . If ISD is used in conjunction with the slice structure, the rectangular slice submode shall be enabled and the dimensions and quantity of the slices present in a frame shall remain the same between two intra-coded frames (I-frames). The ISD segments may be entirely intra coded from time to time to realize quick error recovery without adding latency time associated with sending complete I-frames. . For resiliency, sending a full picture header for every frame is recommended. In other words, the sender should always set the subfield UFEP in PLUSPTYPE to '001' in the video bitstream. . In a multi-layer scenario, each layer can be transmitted to a different network address. The configuration of each layer such as the enhancement layer number (ELNUM), reference layer number (RLNUM), and scalability type should be determined at the start of the session and should not change during the course of the session. 4. H.263+ Payload Header For H.263+ video streams, each RTP packet carries only one H.263+ video packet. The H.263+ payload header is always present for each H.263+ video packet. The payload header has variable length. If a picture header is included in the payload header, the length of the picture header in number of bytes is specified by PLEN. The minimum length of the payload header is 32 bits, corresponding to PLEN equals 0. The H.263+ payload header is structured as follow: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=0|SBIT |EBIT | PLEN |PEBIT| TID | Trun | RR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1 0 0 0 0 0| picture header starting with TR, PTYPE, ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ V: 2 bits Version number. Set to '00' for this payload format. [Ed. Note: The version control will not take effect until a draft has been formally submitted to the IETF.] SBIT: 3 bits Start bit position specifies the number of bits that should be ignored in the first data byte of the payload. EBIT: 3 bits End bit position indicates the number of bits that should be ignored in the last data byte of the payload. PLEN: 3 bits Picture header length in number of bytes. PEBIT: 3 bits End bit position indicates the number of bits that should be ignored in the last byte of the picture header. TID: 3 bits Thread id. Used only in optional video redundancy coding mode (VRC). See annex N of [4]. All three bits must be set to 0 unless VRC mode is applied. Trun: 4 bits Cyclic packet number. Used only in optional VRC mode. These bits must be set to 0 unless VRC mode is applied. RR: 9 bits Reserved bits. Notice that the TID and Trun fields are associated only with the video redundancy coding usage scenario derived from the reference picture selection mode specified in annex N of [4]. The TID and Trun bits must be set to 0 if VRC is not used. The use of VRC shall be negotiated by external means. 4.1 Encapsulating Packet that Begins with PSC Any packet that begins with a picture start code (PSC), i.e. the first packet of a picture frame, shall be encapsulated using only the first 32-bit word of the payload header since a picture header is already included in the data bitstream. In this case, PLEN shall be 0. Here is an example of encapsulating the first packet in a frame: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0|SBIT |EBIT |0 0 0 0 0|0 0 0| TID | Trun | RR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | bitstream data starts with complete picture header ... . +---------------------------------------------------------------+ 4.2 Encapsulating Packet that Begins with GBSC or SSC Any packet that begins with either a GOB start code (GBSC) or a slice start code (SSC) shall include a copy of the picture header in the payload header for resiliency. PLEN shall be set to specify the length of the included picture header in bytes. Hence, PLEN > 0. The end bit position corresponding to the last byte of the picture header data is indicated by PEBIT. Actual bitstream data shall begin on an 8-bit byte boundary following the payload header. Notice that only the last six bits of the picture start code, '100000', are included in the payload header. A complete H.263+ picture header with byte aligned picture start code can be conveniently assembled if needed on the receiving end by prepending the sixteen leading '0' bits. Assuming a PLEN of 9, below is an example of a packet that begins with a GBSC or a SSC: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0|SBIT |EBIT |0 1 0 0 1|PEBIT| TID | Trun | RR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1 0 0 0 0 0| picture header starting with TR, PTYPE, ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | bitstream data begins with GBSC/SCC ... . +-+-+-+-+-+-+-+-+-----------------------------------------------+ 4.3 Encapsulating Follow-On Packet When slice structure coding option is not applied, some GOBs in the bitstream may be larger than the size of one packet. Similarly, when ISD option is applied, a picture segment may be larger than the required packet size. The remaining fragment of a picture segment larger than the required packet size is termed "follow-on" packet in this document. These follow-on packets with data fragmented at the macroblock boundaries are not independently recoverable. In this case, the payload header includes only the first 32-bit word and PLEN shall be set to 0. A receiver should discard any follow-on packet it receives if the preceding packet containing the segment header information has been lost. Here is an example of a follow-on packet: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0|SBIT |EBIT |0 0 0 0 0|0 0 0| TID | Trun | RR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | sub-segment bitstream data ... . +---------------------------------------------------------------+ Even though they may have identical payload headers, a follow-on packet can be differentiated from the first packet in a frame since the data in a follow-on packet does not begin with a PSC. 5. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [1], and any appropriate RTP profile (for example [3]). This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption may be performed after compression so there is no conflict between the two operations. A potential denial-of-service threat exists for data encodings using compression techniques that have non-uniform receiver-end computational load. The attacker can inject pathological datagrams into the stream which are complex to decode and cause the receiver to be overloaded. However, this encoding does not exhibit any significant non-uniformity. As with any IP-based protocol, in some circumstances a receiver may be overloaded simply by the receipt of too many packets, either desired or undesired. Network-layer authentication may be used to discard packets from undesired sources, but the processing cost of the authentication itself may be too high. In a multicast environment, pruning of specific sources may be implemented in future versions of IGMP [5] and in multicast routing protocols to allow a receiver to select which sources are allowed to reach it. A security review of this payload format found no additional considerations beyond those in the RTP specification. 6. References [1] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP : A Transport Protocol for Real-Time Applications", RFC 1889. [2] "Video Codec for Audiovisual Services at px64 kbits/s", ITU-T Recommendation H.261, 1993. [3] "RTP Profile for Audio and Video Conference with Minimal Control", RFC 1890. [4] "Video Coding for Low Bitrate Communication", Draft ITU-T Recommendation H.263, Draft 20, September 1997. [5] T. Turletti, C. Huitema, "RTP Payload Format for H.261 Video Streams", RFC 2032. [6] C. Zhu, "RTP Payload Format for H.263 Video Streams", RFC 2190.