HTTP/1.1 200 OK Date: Tue, 09 Apr 2002 01:02:35 GMT Server: Apache/1.3.20 (Unix) Last-Modified: Fri, 09 Jun 1995 22:00:00 GMT ETag: "2e9de6-6246-2fd8c460" Accept-Ranges: bytes Content-Length: 25158 Connection: close Content-Type: text/plain Internet Engineering Task Force Audio-Video Transport WG INTERNET-DRAFT T. Turletti / C. Huitema draft-ietf-avt-h261-00.txt INRIA June 9, 1995 Expires: 12/1/95 RTP payload format for H.261 video streams 1. Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 2. Abstract This draft describes a scheme to packetize an H.261 video stream for transport using the Real-time Transport Protocol, RTP, with any of the underlying protocols that carry RTP. This specification is a product of the Audio/Video Transport working group within the Internet Engineering Task Force. Comments are solicited and should be addressed to the working group's mailing list at rem-conf@es.net and/or the authors. 3. Purpose of this document The ITU-T recommendation H.261 [7] specifies the encodings used by ITU-T compliant video-conference codecs. Although these encodings were originally specified for fixed data rate ISDN circuits, experiments [4],[9] have shown that they can also be used over packet-switched networks such as the Internet. The purpose of this memo is to specify the RTP payload format for encapsulating H.261 video streams in RTP [1]. 3.1. Main changes - The title has been renamed in order to establish the three-word prefix ``RTP payload format'' for consistency across RTP relative documents. The I-D name is now ``draft-ietf-avt-h261-XX.txt'' instead of ``draft-ietf- avt-video-packet-XX.txt''. - In a previous version of the draft, we specified a "fragment offset" field set to the byte offset of the current packet into the frame. This field was used to estimate how much memory should be left for the late packets. After attempting implementation, it appeared that this offset did not serve its intended purpose (which was a single copy reassembler). So, in this new draft, we replaced it by new fields used to encode some MB state information. These new fields allow each packet to be processed independently and hence reduce the effect of packet loss. - The GOB-start pattern is now encoded in the packets. Turletti/Huitema Expires 12/1/95 [Page 2] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 - A new NACK packet is defined. 4. Structure of the packet stream 4.1. Overview of the ITU-T recommendation H.261 The H.261 coding is organized as a hierarchy of groupings. The video stream is composed of a sequence of images, or frames, which are themselves organized as a set of Groups of Blocks (GOB). Each GOB holds a set of 3 lines of 11 macro blocks (MB). Each MB carries information on a group of 16x16 pixels: luminance information is specified for 4 blocks of 8x8 pixels, while chrominance information is given by two "red" and "blue" color difference components at a resolution of only 8x8 pixels. These components and the codes representing their sampled values are as defined in the ITU-R Recommendation 601 [8]. This grouping is used to specify information at each level of the hierarchy: - At the frame level, one specifies information such as the delay from the previous frame, the image format, and various indicators. - At the GOB level, one specifies the GOB number and the default quantifier that will be used for the MBs. - At the MB level, one specifies which blocks are present and which did not change, and optionally a quantifier and motion vectors. Blocks which have changed are encoded by computing the discrete cosine transform (DCT) of their coefficients, which are then quantized and Huffman encoded (Variable Length Codes). The H.261 Huffman encoding includes a special "GOB start" pattern, composed of 15 zeroes followed by a single 1, that cannot be imitated by any other code words. This pattern is included at the beginning of each GOB header (and also at the beginning of each picture header) to mark the separation between two GOBs, and is in fact used as an indicator that the current GOB is terminated. The encoding also includes a Turletti/Huitema Expires 12/1/95 [Page 3] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 stuffing pattern, composed of seven zeroes followed by four ones; that stuffing pattern can only be entered between the encoding of MBs, or just before the GOB separator. 4.2. Considerations for packetization H.261 codecs designed for operation over ISDN circuits produce a bit stream composed of several levels of encoding specified by H.261 and companion recommendations. The bits resulting from the Huffman encoding are arranged in 512-bit frames, containing 2 bits of synchronization, 492 bits of data and 18 bits of error correcting code. The 512-bit frames are then interlaced with an audio stream and transmitted over px64 kbps circuits according to specification H.221 [6]. When transmitting over the Internet, we will directly consider the output of the Huffman encoding. We will not carry the 512-bit frames, as protection against bit errors can be obtained by other means. Similarly, we will not attempt to multiplex audio and video signals in the same packets, as UDP and RTP provide a much more efficient way to achieve multiplexing. Directly transmitting the result of the Huffman encoding over an unreliable stream of UDP datagrams would, however, have very poor error resistance characteristics. The result of the hierachical structure of H.261 bit stream is that one needs to receive the information present in the frame header to decode the GOBs, as well as the information present in the GOB header to decode the MBs. Without precautions, this would mean that one has to receive all the packets that carry an image in order to properly decode its components. If each image could be carried in a single packet, this requirement would not create a problem. However, a video image or even one GOB by itself can sometimes be too large to fit in a single packet. Therefore, the MB is taken as the unit of fragmentation. Packets must start and end on a MB boundary, i.e. a MB cannot be split across multiple packets. Multiple MBs may be carried in a single packet when they will fit within the maximal packet size allowed. This practice is recommended to reduce the packet send rate and packet overhead. Turletti/Huitema Expires 12/1/95 [Page 4] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 To allow each packet to be processed independently for efficient resynchronization in case of packet losses, some state information from the frame header and GOB header must be carried with each packet to allow the MBs in that packet to be decoded. This state information includes the current GOB number, the current MB number encoded, the last quantizer used in the current GOB (GQUANT or MQUANT) and the current motion vector components. Moreover, since the compressed MB may not fill an integer number of octets, the data header contains two three-bit integers, SBIT and EBIT, to indicate the number of unused bits in the first and last octets of the H.261 data, respectively. 5. Specification of the packetization scheme 5.1. Usage of RTP The H.261 information is carried as payload data within the RTP protocol. The following fields of the RTP header are specified: - The payload type should specify H.261 payload format (see the companion RTP profile document RFC TBD). - The RTP timestamp encodes the sampling instant of the first video image contained in the RTP data packet. The RTP timestamp may be the same on successive packets if a video images occupies more than one packet. For H.261 video streams, the timestamp is a 65536 Hz clock. This is the same tick rate as the middle 32 bits of a 64-bit NTP timestamp, as defined in RFC 1305 [2]. However, this is not really a NTP timestamp because an NTP timestamp implies some synchronization with real time that is not implied here. Furthermore, the initial value of the timestamp is random (unpredictable) to make known- plaintext attacks on encryption more difficult, see RTP [1]. - The marker "M" bit of the RTP header can be used as a flag to trigger display the new image on the screen. This marker has a value of one in the last packet of a video frame. Thus it is not necessary to wait for a following packet to detect that the image should be displayed. Turletti/Huitema Expires 12/1/95 [Page 5] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 The H.261 data will follow the RTP header, as in: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . . RTP header . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | H.261 header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | H.261 stream ... . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The H.261 header is defined as following: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |SBIT |EBIT |I|V| MBAGOB | QUANT | HMVD | VMVD | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The fields in the H.261 header have the following meanings: Start bit position (SBIT): 3 bits Number of bits that should be ignored in the first data octet. End bit position (EBIT): 3 bits Number of bits that should be ignored in the last data octet. INTRA-frame encoded data (I): 1 bit Set to 1 if this packet contains only INTRA-frame coded blocks. Set to 0 if this packet may or may not contain INTRA-frame coded blocks. Motion Vector flag (V): 1 bit Set to 0 if motion vectors are not used in this packet. Set to 1 if motion vectors may or may not be used in this packet. MBA and GOB (MBAGOB): 9 bits Encodes the current GOB number and the value of the MBA predictor (last MB number encoded in the current GOB) Turletti/Huitema Expires 12/1/95 [Page 6] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 jointly in 9 bits, using MBAGOB = 13 * MBA + GOB. MBA = MBAGOB / 13, GOB = MBAGOB % 13. Quantizer (QUANT): 5 bits Last quantizer used (MQUANT or GQUANT). Set to 0 in case it is a start of a GOB. Current horizontal movement vector data (HMVD): 5 bits Set to 0 in case V flag is 0. Current vertical movement vector data (VMVD): 5 bits Set to 0 in case V flag is 0. Note that the I and V flags are hint flags, i.e. they can be inferred from the bit stream. They are included to allow decoders to make optimizations that would not be possible if these hints were not provided before video image was processed. Therefore, these bits should be used consistently for the duration of the stream. A conformant implementation can always set V=1 and I=0. 5.2. Recommendations for operation with hardware codecs Packetizers for hardware codecs can trivially figure out GOB boundaries using the GOB-start pattern included in the H.261 data. (Note that software encoders already know the boundaries.) The cheapest packetization implementation is to packetize at the GOB level all the GOBs that fit in a packet. But when a GOB is too large, the packetizer has to parse it to do MB fragmentation. (Note that only the Huffman encoding must be parsed and that it is not necessary to fully decompress the stream, so this requires relatively little processing; example implementations can be found in some public H.261 codecs such as IVS [5] and VIC [10].) It is recommended to do MB level fragmentation whenever feasible in order to obtain more bandwidth efficient packetization. At the receiver, the data stream can be depacketized and directed to a hardware codec's input. If the hardware decoder operates at a fixed bit rate, synchronization may be maintained by inserting the stuffing pattern between MBs (i.e., between packets) when the packet arrival rate is slower than the bit rate. Turletti/Huitema Expires 12/1/95 [Page 7] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 6. Packet loss issues On the Internet, most packet losses are due to network congestion rather than transmission errors. Using UDP, no mechanism is available at the sender to know if a packet has been successfully received. It is up to the application, i.e. coder and decoder, to handle the packet loss. Each RTP packet includes in its header a sequence number field which can be used to detect packet loss. H.261 uses the temporal redundancy of video to perform compression. This differential coding (or INTER-frame coding) is very sensitive to packet loss. After a packet loss, parts of the image may remain corrupted until all corresponding MBs have been encoded in INTRA-frame mode, which is absolute encoding without relation to the previous frame. There are several ways to mitigate packet loss: (1) One way is to use only INTRA-frame encoding mode. This mode of operation can be selected in some hardware codecs using a sending option. (2) Another way is to adjust the INTRA-frame encoding refreshment rate according to the packet loss observed by the receivers. The H.261 recommendation specifies that a MB is INTRA-frame encoded at least every 132 times it is transmitted. However, the INTRA-frame refreshment rate can be raised in order to speed the recovery when the measured loss rate is significant. (3) The fastest way to repair a corrupted image is to request an INTRA-frame coded image refreshment after a packet loss is detected. One means to accomplish this is for the decoder to send to the coder a list of packets lost. The coder can decide to encode every MB of every GOB of the following video frame in INTRA-frame mode (i.e. Full INTRA-frame encoded), or if the coder can deduce from the packet sequence numbers which MBs were affected by the loss, it can save bandwidth by sending only those MBs in INTRA-frame mode. This mode is particularly efficient in point-to-point connection or when the number of decoders is low. The next section specifies how the refresh function may be implemented. Turletti/Huitema Expires 12/1/95 [Page 8] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 Note that the method (1) is currently implemented in the VIC videoconferencing software [10]. Methods (2) and (3) are currently implemented in the IVS videoconferencing software [5]. 6.1. Use of optional H.261-specific control packets This specification defines two H.261-specific RTCP control packets, "Full INTRA-frame Request" and "Negative Acknowledgement", described in the next section. Their purpose is to speed up refreshment of the video in those situations where their use is feasible. Support of these H.261-specific control packets by the H.261 sender is optional; in particular, early experiments have shown that the usage of this feature could have very negative effects when the number of sites is very large. Thus, these control packets should be used with caution. The H.261-specific control packets differ from normal RTCP packets in that they are not transmitted to the normal RTCP destination transport address for the RTP session (which is often a multicast address). Instead, these control packets are sent directly via unicast from the decoder to the coder. The destination port for these control packets is the same port that the coder uses as a source port for transmitting RTP (data) packets. Therefore, these packets may be considered "reverse" control packets. As a consequence, these control packets may only be used when no RTP mixers or translators intervene in the path from the coder to the decoder. If such intermediate systems do intervene, the address of the coder would no longer be present as the network-level source address in packets received by the decoder, and in fact, it might not be possible for the decoder to send packets directly to the coder. Some reliable multicast protocols use similar NACK control packets transmitted over the normal multicast distribution channel, but they typically use random delays to prevent a NACK implosion problem [3]. The goal of such protocols is to provide reliable multicast packet delivery at the expense of delay, which is appropriate for applications such as a shared whiteboard. On the other hand, interactive video transmission is more Turletti/Huitema Expires 12/1/95 [Page 9] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 sensitive to delay and does not require full reliability. For video applications it is more effective to send the NACK control packets as soon as possible, i.e. as soon as a loss is detected, without adding any random delays. In this case, multicasting the NACK control packets would generate useless traffic between receivers since only the coder will use them. But this method is only effective when the number of receivers is small. e.g. in IVS [5] the H.261 specific control packets are used only in point-to-point connections or in point-to- multipoint connections when there are less than 10 participants in the conference. 6.2. H.261 control packets definition 6.2.1. Full INTRA-frame Request (FIR) packet 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P| MBZ | PT=RTCP_FIR | length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SSRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This packet indicates that a receiver requires a full encoded image in order to either start decoding with an entire image or to refresh its image and speed the recovery after a burst of lost packets. The receiver requests the source to force the next image in full "INTRA-frame" coding mode, i.e. without using differential coding. The various fields are defined in the RTP specification [1]. SSRC is the synchronization source identifier for the sender of this packet. The value of the packet type (PT) identifier is the constant RTCP_FIR (192). 6.2.2. Negative ACKnowledgements (NACK) packet The format of the NACK packet is as follow: Turletti/Huitema Expires 12/1/95 [Page 10] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P| MBZ | PT=RTCP_NACK | length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SSRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FSN | BLP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The various fields T, P, PT, length and SSRC are defined in the RTP specification [1]. The value of the packet type (PT) identifier is the constant RTCP_NACK (193). SSRC is the synchronization source identifier for the sender of this packet. The two remaining fields have the following meanings: First Sequence Number (FSN): 16 bits Identifies the first sequence number lost. Bitmask of following lost packets (BLP): 16 bits A bit is set to 1 if the corresponding packet has been lost, and set to 0 otherwise. BLP is set to 0 only if no packet other than that being NACKed (using the FSN field) has been lost. BLP is set to 0x00001 if the packet corresponding to the FSN and the following packet have been lost, etc. 7. Security Considerations Security concerns are not discussed in this memo. Turletti/Huitema Expires 12/1/95 [Page 11] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 Addresses of Authors Thierry Turletti INRIA Sophia Antipolis 2004 route des Lucioles BP 93, 06902 Sophia Antipolis FRANCE electronic mail: turletti@sophia.inria.fr Christian Huitema INRIA Sophia Antipolis 2004 route des Lucioles BP 93, 06902 Sophia Antipolis FRANCE electronic mail: huitema@sophia.inria.fr Acknowledgements This draft is based on discussion within the AVT working group chaired by Stephen Casner. Steve McCanne, Stephen Casner, Ronan Flood, Mark Handley, Van Jacobson and Henning G. Schulzrinne provided valuable comments. Stephen Casner also helped greatly with getting this document into readable form. References [1] Henning Schulzrinne, Stephen Casner, Ron Frederick, Van Jacobson, RTP: A Transport Protocol for Real-Time Applications, INTERNET-DRAFT, March 21, 1995. [2] D.L. Mills, ``Network time protocol (version 3) -- specification, implementation and analysis,'' Network Working Group Request for Comments RFC 1305, University of Delaware, Mar. 1992. [3] Sridhar Pingali, Don Towsley and James F. Kurose, A comparison of sender-initiated and receiver-initiated reliable multicast protocols, IEEE GLOBECOM '94. [4] Thierry Turletti, H.261 software codec for videoconferencing over the Internet INRIA Research Report no 1834, January 1993. [5] Thierry Turletti, INRIA Videoconferencing tool (IVS), available by anonymous ftp from zenon.inria.fr in the Turletti/Huitema Expires 12/1/95 [Page 12] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 "rodeo/ivs/last_version" directory. See also URL . [6] Frame structure for Audiovisual Services for a 64 to 1920 kbps Channel in Audiovisual Services ITU-T (International Telecommunication Union - Telecommunication Standardisation Sector) Recommendation H.221, 1990. [7] Video codec for audiovisual services at p x 64 kbit/s ITU-T (International Telecommunication Union - Telecommunication Standardisation Sector) Recommendation H.261, 1993. [8] Digital Methods of Transmitting Television Information ITU-R (International Telecommunication Union - Radiocommunication Standardisation Sector) Recommendation 601, 1986. [9] M.A Sasse, U. Bilting, C-D Schulz, T. Turletti, Remote Seminars through MultiMedia Conferencing: Experiences from the MICE project, Proc. INET'94/JENC5, Prague, June 1994, pp. 251/1-251/8. [10] Steve MacCanne, Van Jacobson, VIC Videoconferencing tool, available by anonymous ftp from ee.lbl.gov in the "conferencing/vic" directory. Turletti/Huitema Expires 12/1/95 [Page 13] INTERNET-DRAFT draft-ietf-avt-h261-00 June 94, 1995 Table of Contents 1 Status of this Memo ................................... 1 2 Abstract .............................................. 2 3 Purpose of this document .............................. 2 3.1 Main changes ........................................ 2 4 Structure of the packet stream ........................ 3 4.1 Overview of the ITU-T recommendation H.261 .......... 3 4.2 Considerations for packetization .................... 4 5 Specification of the packetization scheme ............. 5 5.1 Usage of RTP ........................................ 5 5.2 Recommendations for operation with hardware codecs .................................................... 7 6 Packet loss issues .................................... 8 6.1 Use of optional H.261-specific control packets ...... 9 6.2 H.261 control packets definition .................... 10 6.2.1 Full INTRA-frame Request (FIR) packet ............. 10 6.2.2 Negative ACKnowledgements (NACK) packet ........... 10 7 Security Considerations ............................... 11 Addresses of Authors ................................... 12 Acknowledgements ....................................... 12 References ............................................. 12 Turletti/Huitema Expires 12/1/95 [Page 14]