HTTP/1.1 200 OK Date: Tue, 09 Apr 2002 01:04:23 GMT Server: Apache/1.3.20 (Unix) Last-Modified: Fri, 24 Mar 1995 23:00:00 GMT ETag: "2e9e0a-6ab7-2f734ef0" Accept-Ranges: bytes Content-Length: 27319 Connection: close Content-Type: text/plain Internet Engineering Task Force Audio-Video Transport Working Group Internet Draft H. Schulzrinne ietf-avt-profile-04.txt GMD Fokus March 24, 1995 Expires: 9/1/95 RTP Profile for Audio and Video Conferences with Minimal Control STATUS OF THIS MEMO This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress''. To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. ABSTRACT This note describes a profile for the use of the real-time transport protocol (RTP) and the associated control proto- col, RTCP, within audio and video multiparticipant confer- ences with minimal control. It provides interpretations of generic fields within the RTP specification suitable for audio and video conferences. In particular, this document defines a set of default mappings from payload type numbers to encodings. The document also describes how audio and video data may be carried within RTP. It defines a set of standard encodings and their names when used within RTP. However, the definitions are independent of the particular transport mechanism used. The descriptions provide pointers to reference implementations and the detailed standards. This document is meant as an aid for implementors of audio, video and other real-time multimedia applications. H. Schulzrinne [Page 1] Internet Draft AV Profile March 24, 1995 1. Introduction This profile defines aspects of RTP left unspecified in the RTP protocol definition (RFC TBD). This profile is intended for the use within audio and video conferences with minimal session control. In particular, no support for the negotiation of parameters or member- ship control is provided. Other profiles may make different choices for the items specified here. The profile specifies the use of RTP over unicast and multicast UDP. (This does not preclude the use of these definitions when RTP is carried by other lower-layer proto- cols.) (Ed.: How to indicate usage of the profile? Port numbers are not likely to be well-defined.) 2. RTP and RTCP Packet Forms and Protocol Behavior This profile follows the default and/or recommended aspects of the RTP specification for these items: (Ed.: Maybe the main spec should number these items, so that they can be easily aligned between spec and profile?) o The standard format of the fixed RTP data header is used (one marker bit). o No additional fixed fields are appended to the RTP data header. o The suggested constants are to be used for the RTCP report interval calculation. o No extension section is defined for the RTCP SR or RR packet. o No additional RTCP packet types are defined by this profile specification. o The RTP default security services are also the default under this profile. o The standard mapping of RTP and RTCP to transport-level addresses is used. o No encapsulation of RTP packets is specified. o No RTP header extensions are defined, but applications operating under this profile may use such extensions. Thus, applications should not assume that the RTP header X bit is always zero and should be prepared to ignore the header extension. Extensions should register the content of the first 16 bits with IANA. (Ed.: Yet another IANA space? Other ideas?) H. Schulzrinne [Page 2] Internet Draft AV Profile March 24, 1995 o Applications may use any of the SDES items described. New encodings are to be registered with the Internet Assigned Numbers Authority. When registering a new encoding, the following information should be provided: o name and description of encoding, in particular the RTP times- tamp clock rate; o indication of who has change control over the encoding (for example, CCITT/ITU, other international standardization bodies, a consortium or a particular company or group of companies); o any operating parameters; o a reference to a further description, if available, for example (in order of preference) an RFC, a published paper, a patent fil- ing, a technical report or a computer manual; o for proprietary encodings, contact information (postal and email address). o the payload type value for this profile. 3. Audio 3.1. Encoding-independent recommendations The following recommendations are default operating parameters. Applications should be prepared to handle other values. The ranges given are meant to give guidance to application writers, allowing a set of applications conforming to these guidelines to interoperate without additional negotiation. These guidelines are not intended to restrict operating parameters for applications that can negotiate a set of interoperable parameters, e.g., through a conference control protocol. For packetized audio, the default packetization interval should have a duration of 20 ms, unless otherwise noted when describing the encoding. The packetization interval determines the minimum end-to-end delay; longer packets introduce less header overhead but higher delay and make packet loss more noticeable. For non-interactive applications such as lectures or links with severe bandwidth constraints, a higher packetiza- tion delay may be appropriate. For N-channel encodings, each sampling period (say, 1/8000 of a second) generates N samples. (This terminology is standard, but somewhat confusing, as the total number of samples gen- erated per second is then the sampling rate times the channel count.) If multiple audio channels are used, channels are numbered left-to- right, starting at one. In RTP audio packets, information from lower- H. Schulzrinne [Page 3] Internet Draft AV Profile March 24, 1995 numbered channels precedes that from higher-numbered channels. For more than two channels, the convention followed by the AIFF-C audio inter- change format should be followed [1]. For two-channel stereo, the numbering sequence is left, right; for three channels, left, right, center; for quadrophonic systems, front left, front right, rear left, rear right; for four-channel systems, left, center, right, and surround sound; for six-channel systems left, left center, center, right, right center and surround sound. All channels belonging to a single sampling instance must be within the same packet. The sampling frequency should be drawn from the set: 8000, 11025, 16000, 22050, 44100 and 48000 Hz. (The Apple Macintosh computers have native sample rates of 22254.54 and 11127.27, which can be converted to 22050 and 11025 with acceptable quality by dropping 4 or 2 samples in a 20 ms frame.) A receiver should accept packets representing between 0 and 200 ms of audio data.[1] Receivers should be prepared to accept multi-channel audio, but may choose to only play a single channel. 3.2. Guidelines for Sample-Based Audio Encodings In sample-based encodings, each audio sample is represented by a fixed number of bits. Within the compressed audio data, codes for indi- vidual samples may span octet boundaries. An RTP audio packet may con- tain any number of audio samples, subject to the constraint that the number of bits per sample times the number of samples per packet yields an integral octet count. Fractional encodings produce less than one octet per sample. For sample-based encodings producing one or more octets per sample, samples from different channels, but the same sam- pling instant are consecutive. For example, for a two-channel encoding, the octet sequence is (left channel, first sample), (right channel, first sample), (left channel, second sample), (right channel, second sample), .... For multi-octet encodings, octets are transmitted in net- work byte order (i.e., most significant octet first). The packing order for fractional encodings is that described for the IMA Wave types [2]. For audio encodings yielding four bits per sample, eight such compressed samples from channel 1 are packet into one 32-bit word, followed by eight compressed samples from channel 2, until all channels have been accomodated and the packing resumes at channel 1. For audio encodings yielding three bits per sample, 32 such compressed samples at three bits each from channel 1 are packed into 12 octets, followed by 32 samples from channel 2, etc. _________________________ [1] This restriction allows reasonable buffer sizing for the receiver. H. Schulzrinne [Page 4] Internet Draft AV Profile March 24, 1995 3.3. Guidelines for Frame-Based Audio Encodings Frame-based encodings encode a fixed-length block of audio into another block of compressed data, typically also of fixed length. For frame-based encodings, the sender may choose to combine several such frames into a single message. The receiver can tell the number of frames contained in a message since the frame duration is defined as part of the encoding. For frame-based codecs, the channel order is defined for the whole block. That is, for two-channel audio, right and left samples are coded independently, with the encoded frame for the left channel preceding that for the right channel. All frame-oriented audio codecs should be able to encode and decode several consecutive frames within a single packet. Since the frame size for the frame-oriented codecs is given, there is no need to use a separate designation for the same encoding, but with different number of frames per packet. 3.4. Audio Encodings encoding sample/frame bits/sample ms/frame ______________________________________________________ 1016 frame N/A 30 G721 sample 4 G723 sample 3 GSM frame N/A 20 IDVI sample 4 LPC frame N/A 20 L8 sample 8 L16 sample 16 MPA frame N/A PCMU sample 8 PCMA sample 8 Table 1: Properties of Audio Encodings 1016: Encoding 1016 is a frame based encoding using code-excited linear prediction (CELP) and is specified in Federal Standard FED-STD 1016 [3,4,5,6]. The U. S. DoD's Federal-Standard-1016 based 4800 bps code excited linear prediction voice coder ver- sion 3.2 (CELP 3.2) Fortran and C simulation source codes are available for worldwide distribution at no charge (on DOS diskettes, but configured to compile on Sun SPARC stations) from: Bob Fenichel, National Communications System, Washing- ton, D.C. 20305, phone +1-703-692-2124, fax +1-703-746-4960. and H. Schulzrinne [Page 5] Internet Draft AV Profile March 24, 1995 ftp://ftp.super.org/pub/speech/celp_3.2a.tar.Z G721: G721 is specified in ITU recommendation G.721. Reference implementations for G.721 and G.723 are available as part of the CCITT/ITU-T Software Tool Library (STL) from the ITU Gen- eral Secretariat, Sales Service, Place du Nations, CH-1211 Geneve 20, Switzerland. The library is covered by a license and is available at ftp://gaia.cs.umass.edu/pub/hgschulz/ccitt/ccitt_tools.tar.Z G723: G721 is specified in ITU recommendation G.723. See G721 for information about a reference implementation. GSM: GSM (group speciale mobile) denotes the European GSM 06.10 provisional standard for full-rate speech transcoding, prI-ETS 300 036, which is based on RPE/LTP (residual pulse excitation/long term prediction) coding at a rate of 13 kb/s. A reference implementation was written by Carsten Borman and Jutta Degener (TU Berlin, Germany) and is available at ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/ IDVI: IDVI is specified, with reference implemention, in [2]. Each packet contains a single DVI block. The "header" word for each channel has the following structure: int16 valpred; /* previous predicted value, network byte order */ u_int8 index; /* index into stepsize table */ Header words for all channels precede the compressed data. Note that the first 16 bits differ in definition from the IMA and Microsoft DVI ADPCM Wave type [7]. There, the first 16 bits contain the first (uncompressed) sample. (Ed.: This discrepancy is unfortunate, creating all kinds of problems with hardware-based codecs common with PCs.) L8: L8 denotes linear audio data, using 8-bits of precision with an offset of 128, that is, the most negative signal is encoded as 0. L16: L16 denotes uncompressed audio data, using 16-bit signed representation with 65535 equally divided steps between minimum and maximum signal level, ranging from -32768 to 32767. The value is represented in two's complement notation H. Schulzrinne [Page 6] Internet Draft AV Profile March 24, 1995 and network byte order. MPA: MPA denotes MPEG-I or MPEG-II audio encapsulated as elementary streams. The encoding is defined in ISO standards ISO/IEC 11172-3 and 13818-3. The encapsulation is specified in RFC TBD, Section 4. Sampling rate and channel count are contained in the payload. PCMU: PCMU is specified in CCITT/ITU-T recommendation G.711. Audio data is encoded as eight bits per sample, after companding. Code to convert between linear and mu-law companded data is available in [2]. PCMA: PCMA is specified in CCITT/ITU-T recommendation G.711. Audio data is encoded as eight bits per sample, after companding. Code to convert between linear and A-law companded data is available in [2]. LPC: LPC designates an experimental linear predictive encoding written by Ron Frederick, Xerox PARC, available from ftp://parcftp.xerox.com/pub/net-research/lpc.tar.Z VDVI: VDVI is a variable-rate version of IDVI, yielding speech bit rates of between 10 and 25 kbps. It is specified for single- channel operation only. It uses the following encoding: IDVI codeword VDVI bit pattern 0 00 1 010 2 1100 3 11100 4 111100 5 1111100 6 11111100 7 11111110 8 10 9 011 10 1101 11 11101 12 111101 13 1111101 14 11111101 15 11111111 H. Schulzrinne [Page 7] Internet Draft AV Profile March 24, 1995 TSP0: TSP0 designates the proprietary variable-rate, frame-based encoding called True Speech. The encoding is defined for a sampling rate of 7200 Hz and has an average data rate of 7200 bits per second. Further information is available by contact- ing VocalTec (see VSC encoding) or the address: DSP Group, Inc. email: tsplayer@dsgp.com VSC: VSC designates the proprietary variable-rate encoding called Vocaltec Software Compression. The encoding is defined for a sampling rate of 5500 Hz and has an average data rate of 963 bytes per second. Further information is available by contact- ing Alon Cohen VocalTec Ltd. Maskit 1, Herzliya Israel phone: +972-9-5612121 email: alon@vocaltec.com The standard audio encodings and their payload types are listed in Table 5. 4. Video The following video encodings are currently defined, with their abbreviated names used for identification: CelB: The CELL-B encoding is a proprietary encoding proposed by Sun Microsystems. The byte stream format is described in RFC TBD. CPV: This proprietary encoding, "Compressed Packet Video is imple- mented by Concept, Bolter, and ViewPoint Systems video codecs. For further information, contact: Glenn Norem, President ViewPoint Systems, Inc. 2247 Wisconsin Street, Suite 110 Dallas, TX 75229-2037 United States Phone: +1-214-243-0634 JPEG: The encoding is specified in ISO Standards 10918-1 and 10918-2. The RTP payload format is as specified in RFC TBD. H261: The encoding is specified in CCITT/ITU-T standard H.261. The packetization and RTP-specific properties are described in RFC TBD. HDCC: The HDCC encoding is a proprietary encoding used by Silicon Graphics. [TBD: Need contact information.] H. Schulzrinne [Page 8] Internet Draft AV Profile March 24, 1995 MPV: MPV designates the use MPEG-I and MPEG-II video encoding ele- mentary streams as specified in ISO Standards ISO/IEC 11172 and 13818-2, respectively. The RTP payload format is as speci- fied in RFC TBD, Section 4. MP2T: MP2T designates the use of MPEG-II transport streams, for either audio or video. The encapsulation is described in RFC TBD, Section 3. nv: The encoding is implemented in the program 'nv' developed at Xerox PARC by Ron Frederick. CUSM: The encoding is implemented in the program CU-SeeMe developed at Cornell University by Dick Cogger, Scott Brim, Tim Dorcey and John Lynn. PicW: The encoding is implemented in the program PictureWindow developed at Bolt, Beranek and Newman (BBN). RGB8: 8-bit encoding of RGB values, sequenced TBD. Each pixel can assume values from 0 to 255. Each frame is prefixed by a header containing TBD. 5. Payload Type Definitions Table 5 defines the static payload type values to be carried in the PT field of the RTP data header when this profile is in use. Addi- tional static payload type values marked 'unassigned' in the table may be defined by RTP Payload Format specifications and registered with IANA. In addition, payload type values in the range 96--127 may be defined dynamically through a conference control protocol which is beyond the scope of this document. Note that the single name space does not imply in any sense that changes between all such encodings are useful. In particular, a single RTP session is likely to carry either video or audio, but not both. It is not per- missible to use distinct payload types to multiplex several media concurrently onto a single RTP session (e.g., to concurrently send PCMU audio and CelB video over the same RTP session). Some payload types may designate a combination of both audio and video, both within the same packet or differentiated by information within the payload. Currently, the MPEG Transport encapsulation is the only such payload type. The payload type range marked 'reserved' has been set aside so that RTCP and RTP packets can be reliably dis- tinguished (see Section XXX of the RTP protocol specification). Audio applications operating under this profile should at minimum be able to send and receive payload types 0 and 5. This allows interoperability without format negotiation and successful negota- tion with a conference control protocol. (Ed.: Is this helpful? It H. Schulzrinne [Page 9] Internet Draft AV Profile March 24, 1995 does give guidance to application writers and reflects current practice of widest-use encodings. Should the same be done for video? It would be nice if saying that application FOO is compliant with RTP and profile RFC TBD, they could interoperate. This seems similar to requiring certain minimum IPv6 security mechanisms.) If there is no strong technical reason to the contrary, video encod- ings typically use a timestamp frequency of 65536 Hz. The standard video encodings and their payload types are listed in Table 5. PT encoding audio/video clock rate channels name (A/V) (Hz) (audio) ___________________________________________________________________ 0 PCMU A 8000 1 1 1016 A 8000 1 2 G721 A 8000 1 3 GSM A 8000 1 4 G723 A 8000 1 5 IDVI A 8000 1 6 IDVI A 16000 1 7 LPC A 8000 1 8 unassigned A 9 unassigned A 10 L16 A 44100 2 11 L16 A 44100 1 12 TSP0 A 7200 1 13 VSC A 5500 1 14 MPA A 90000 (see text) 15--22 unassigned A 23 RGB8 V 65536 N/A 24 HDCC V 65536 N/A 25 CelB V 65536 N/A 26 JPEG V 65536 N/A 27 CUSM V 65536 N/A 28 nv V 65536 N/A 29 PicW V 65536 N/A 30 CPV V 65536 N/A 31 H261 V 65536 N/A 32 MPV V 90000 N/A 33 MP2T A/V 90000 N/A 33--71 unassigned V 65536 N/A 72--76 reserved N/A N/A N/A 77--95 unassigned ? 96--127 dynamic ? N/A Table 2: Payload types (PT) for standard audio and video encodings H. Schulzrinne [Page 10] Internet Draft AV Profile March 24, 1995 6. Port Assignment As specified in the RTP protocol definition, RTP data is to be car- ried on an even UDP port number and the corresponding RTCP packets are to be carried on the next higher (odd) port number. Applica- tions operating under this profile may use any such UDP port pair. For example, the port pair may be allocated randomly by a session management program. A single fixed port number pair cannot be required because multiple applications using this profile are likely to run on the same host, and there are some operating sys- tems that do not allow multiple processes to use the same UDP port with different multicast addresses. However, port numbers 5004 and 5005 have been registered for use with this profile for those applications that choose to use them as the default pair. Applica- tions that operate under multiple profiles may use this port pair as an indication to select this profile if they are not subject to the constraint of the previous paragraph. Applications need not have a default and may require that the port pair be explicitly specified. The particular port numbers were chosen to lie in the range above 5000 to accomodate port number allocation practice within the Unix operating system, where port numbers below 1024 can only be used by privileged processes and port numbers between 1024 and 5000 are automatically assigned by the operating system. 7. Address of Author Henning Schulzrinne GMD Fokus Hardenbergplatz 2 D-10623 Berlin Germany electronic mail: hgs@fokus.gmd.de H. Schulzrinne [Page 11]