Negotiating Human Language in Real-Time Communications

A mutually comprehensible language is helpful for human communication. This document addresses the negotiation of human (natural) language and media modality (spoken, signed, written) in real-time communications. A companion document addresses language selection in email. Unless the caller and callee know each other or there is contextual or out-of- band information from which the language(s) and media modalities can be determined, there is a need for spoken, signed, or written languages to be negotiated based on the caller's needs and the callee's capabilities. This need applies to both emergency and non-emergency calls. For example, it is helpful for a caller to a company call center or a Public Safety Answering Point (PSAP) to be able to indicate preferred signed, written, and/or spoken languages, and for the callee to be able to indicate its capabilities in this area, allowing the call to proceed using the language(s) and media forms supported by both. For various reasons, including the ability to establish multiple streams using different media (e.g., voice, text, video), it makes sense to use a per-stream negotiation mechanism known as the Session Description Protocol (SDP). Utilizing Session Description Protocol (SDP) enables the solution described in this document to be applied to all interactive communications negotiated using SDP, in emergency as well as non-emergency scenarios. By treating language as another SDP attribute that is negotiated along with other aspects of a media stream, it becomes possible to accommodate a range of users' needs and called party facilities. For example, some users may be able to speak several languages, but have a preference. Some called parties may support some of those languages internally but require the use of a translation service for others, or may have a limited number of call takers able to use certain languages. Another example would be a user who is able to speak but is deaf or hard-of-hearing and and desires a voice stream to send spoken language plus a text stream to receive written language. Making language a media attribute allows the standard session negotiation mechanism to handle this by providing the information and mechanism for the endpoints to make appropriate decisions. The term "negotiation" is used here rather than "indication" because human language (spoken/written/signed) can be negotiated in the same manner as media (audio/text/video) and codecs. For example, if we think of a user calling an airline reservation center, the user may have a set of languages he or she speaks, with perhaps preferences for one or a few, while the airline reservation center will support a fixed set of languages. Negotiation should select the user's most preferred language that is supported by the call center. Both sides should be aware of which language was negotiated. In the offer/answer model used here, the offer contains a set of languages per media (and direction) that the offerer is capable of using, and the answer contains one language per media (and direction) that the answerer will support. Supporting languages and/or modalities can require taking extra steps, such as having a call handled by an agent who speaks a requested language and/or with the ability to use a requested modality, or bridging external translation or relay resources into the call, etc. The answer indicates the media and languages that the answerer is committing to support (possibly after additional steps have been taken). This model also provides knowledge so both ends know what has been negotiated. Note that additional steps required to support the indicated languages or modalities may or may not be in place in time for any early media. Since this is a protocol mechanism, the user equipment (UE client) needs to know the user's preferred languages; while this document does not address how clients determine this, reasonable techniques could include a configuration mechanism with a default of the language of the user interface; in some cases, a UE could tie language and media preferences, such as a preference for a video stream using a signed language and/or a text or audio stream using a written/spoken language.

Within this document, it is assumed that the negotiating endpoints have already been determined, so that a per-stream negotiation based on the Session Description Protocol (SDP) can proceed. When setting up interactive communications sessions it is necessary to route signaling messages to the appropriate endpoint(s). This document does not address the problem of language-based routing.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 RFC 8174 when, and only when, they appear in all capitals, as shown here.

The desired solution is a media attribute (preferably per direction) that may be used within an offer to indicate the preferred language(s) of each (direction of a) media stream, and within an answer to indicate the accepted language. The semantics of including multiple languages for a media stream within an offer is that the languages are listed in order of preference. (Negotiating multiple simultaneous languages within a media stream is out of scope of this document.)

RFC 4566 specifies an attribute 'lang' which appears similar to what is needed here, but is not sufficiently specific or flexible for the needs of this document. In addition, 'lang' is not mentioned in and there are no known implementations in SIP. Further, it is useful to be able to specify language per direction (sending and receiving). This document therefore defines two new attributes.

An SDP attribute (per direction) seems the natural choice to negotiate human (natural) language of an interactive media stream, using the language tags of BCP 47 .

This document defines two media-level attributes starting with 'hlang' (short for "human interactive language") to negotiate which human language is selected for use in each interactive media stream. (Note that not all streams will necessarily be used.) There are two attributes, one ending in "-send" and the other in "-recv", registered in . Each can appear in offers and answers for media streams. In an offer, the 'hlang-send' value is a list of one or more language(s) the offerer is willing to use when sending using the media, and the 'hlang-recv' value is a list of one or more language(s) the offerer is willing to use when receiving using the media. The list of languages is in preference order (first is most preferred). When a media is intended for interactive communication using a language in one direction only (e.g., a user with difficulty speaking but able to hear who indicates a desire to send using text and receive using audio), either hlang-send or hlang-recv MAY be omitted. When a media is not primarily intended for language (for example, a video or audio stream intended for background only) both SHOULD be omitted. Otherwise, both SHOULD have the same value. Note that specifying different languages for each direction (as opposed to the same or essentially the same language in different modalities) can make it difficult to complete the call (e.g., specifying a desire to send audio in Hungarian and receive audio in Portuguese). In an answer, 'hlang-send' is the language the answerer will send if using the media for language (which in most cases is one of the languages in the offer's 'hlang-recv'), and 'hlang-recv' is the language the answerer expects to receive if using the media for language (which in most cases is one of the languages in the offer's 'hlang-send'). In an offer, each value MUST be a list of one or more language tags per BCP 47 , separated by white space. In an answer, each value MUST be one language tag per BCP 47. BCP 47 describes mechanisms for matching language tags. Note that Section 4.1 advises to "tag content wisely" and not include unnecessary subtags. When placing an emergency call, and in any other case where the language cannot be inferred from context, in an offer each media stream primarily intended for human language communication SHOULD specify both (or for asymmetrical language use, one of) the 'hlang-send' and 'hlang-recv' attributes. Clients acting on behalf of end users are expected to set one or both 'hlang-send' and 'hlang-recv' attributes on each media stream primarily intended for human communication in an offer when placing an outgoing session, and either ignore or take into consideration the attributes when receiving incoming calls, based on local configuration and capabilities. Systems acting on behalf of call centers and PSAPs are expected to take into account the attributes when processing inbound calls. Note that media and language negotiation might result in more media streams being accepted than are needed by the users (e.g., if more preferred and less preferred combinations of media and language are all accepted). This is not a problem.

A consideration with the ability to negotiate language is if the call proceeds or fails if the callee does not support any of the languages requested by the caller. This document does not mandate either behavior. If the call is rejected due to lack of any languages in common, it is suggested to use SIP response code 488 (Not Acceptable Here) or 606 (Not Acceptable) and include a Warning header field in the SIP response. The Warning header field contains a warning code of [TBD: IANA VALUE, e.g., 308] and a warning text indicating that there are no mutually-supported languages; the text SHOULD also contain the supported languages and media. Example: [TBD: IANA VALUE, e.g., 308] proxy.example.com "Incompatible language specification: Requested languages not supported. Supported languages are: es, en; supported media are: audio, text."

A sign-language tag with a video media stream is interpreted as an indication for sign language in the video stream. A non-sign-language tag with a text media stream is interpreted as an indication for written language in the text stream. A non-sign-language tag with an audio media stream is interpreted as an indication for spoken language in the audio stream. This document does not define any other use for language tags in video media (such as how to indicate visible captions in the video stream). In the IANA registry of language subtags per BCP 47 , a language subtag with a Type field "extlang" combined with a Prefix field value "sgn" indicates a sign-language tag. The absence of such "sgn" prefix indicates a non-sign-language tag. This document does not define the use of sign-language tags in text or audio media. This document does not define the use of language tags in media other than interactive streams of audio, video, and text (such as "message" or "application"). Such use could be supported by future work or by application agreement.

Some examples are shown below. For clarity, only the most directly relevant portions of the SDP block are shown. An offer or answer indicating spoken English both ways: m=audio 49170 RTP/AVP 0 a=hlang-send:en a=hlang-recv:en An offer indicating American Sign Language both ways: m=video 51372 RTP/AVP 31 32 a=hlang-send:ase a=hlang-recv:ase An offer requesting spoken Spanish both ways (most preferred), spoken Basque both ways (second preference), or spoken English both ways (third preference): m=audio 49250 RTP/AVP 20 a=hlang-send:es eu en a=hlang-recv:es eu en An answer to the above offer indicating spoken Spanish both ways: m=audio 49250 RTP/AVP 20 a=hlang-send:es a=hlang-recv:es An alternative answer to the above offer indicating spoken Italian both ways (as the callee does not support any of the requested languages but chose to proceed with the call): m=audio 49250 RTP/AVP 20 a=hlang-send:it a=hlang-recv:it An offer or answer indicating written Greek both ways: m=text 45020 RTP/AVP 103 104 a=hlang-send:gr a=hlang-recv:gr An offer requesting the following media streams: video for the caller to send using Argentine Sign Language, text for the caller to send using written Spanish (most preferred) or written Portuguese, audio for the caller to receive spoken Spanish (most preferred) or spoken Portuguese: m=video 51372 RTP/AVP 31 32 a=hlang-send:aed m=text 45020 RTP/AVP 103 104 a=hlang-send:sp pt m=audio 49250 RTP/AVP 20 a=hlang-recv:sp pt An answer for the above offer, indicating text in which the callee will receive written Spanish, and audio in which the callee will send spoken Spanish. The answering party had no video capability: m=video 0 RTP/AVP 31 32 m=text 45020 RTP/AVP 103 104 a=hlang-recv:sp m=audio 49250 RTP/AVP 20 a=hlang-send:sp An offer requesting the following media streams: text for the caller to send using written English (most preferred) or written Spanish, audio for the caller to receive spoken English (most preferred) or spoken Spanish, supplemental video: m=text 45020 RTP/AVP 103 104 a=hlang-send:en sp m=audio 49250 RTP/AVP 20 a=hlang-recv:en sp m=video 51372 RTP/AVP 31 32 An answer for the above offer, indicating text in which the callee will receive written Spanish, audio in which the callee will send spoken Spanish, and supplemental video: m=text 45020 RTP/AVP 103 104 a=hlang-recv:sp m=audio 49250 RTP/AVP 20 a=hlang-send:sp m=video 51372 RTP/AVP 31 32 Note that, even though the examples show the same (or essentially the same) language being used in both directions (even when the modality differs), there is no requirement that this be the case. However, in practice, doing so is likely to increase the chances of successful matching.

IANA is kindly requested to add two entries to the 'att-field (media level only)' table of the SDP parameters registry: The first entry is for hlang-recv: hlang-recv Randall Gellens rg+ietf@coretechnologyconsulting.com hlang-value hlang-offv / hlang-ansv ; hlang-offv used in offers ; hlang-ansv used in answers Language-Tag *( SP Language-Tag ) ; Language-Tag as defined in BCP 47 1*" " ; one or more space (%x20) characters Language-Tag Described in of TBD: THIS DOCUMENT media NORMAL No See of TBD: THIS DOCUMENT See of TBD: THIS DOCUMENT TBD: THIS DOCUMENT The second entry is for hlang-send: hlang-send Randall Gellens rg+ietf@coretechnologyconsulting.com hlang-value hlang-offv / hlang-ansv Described in of TBD: THIS DOCUMENT media NORMAL No See of TBD: THIS DOCUMENT See of TBD: THIS DOCUMENT TBD: THIS DOCUMENT

IANA is requested to add a new value in the warn-codes sub-registry of SIP parameters in the 300 through 329 range that is allocated for indicating problems with keywords in the session description. The reference is to this document. The warn text is "Incompatible language specification: Requested languages not supported. Supported languages and media are: [list of supported languages and media]."

The Security Considerations of BCP 47 apply here. In addition, if the 'hlang-send' or 'hlang-recv' values are altered or deleted en route, the session could fail or languages incomprehensible to the caller could be selected; however, this is also a risk if any SDP parameters are modified en route.

Language and media information can suggest a user's nationality, background, abilities, disabilities, etc.

RFC EDITOR: Please remove this section prior to publication.

Deleted Section 3 ("Expected Use") Reworded modalities in Introduction from "voice, video, text" to "spoken, signed, written" Reworded text about "increasingly fine-grained distinctions" to instead merely point to BCP 47 Section 4.1's advice to "tag content wisely" and not include unnecessary subtags Changed IANA registration of new SDP attributes to follow RFC 4566 template with extra fields suggested in 4566-bis (expired draft) Deleted "(known as voice carry over)" Changed textual instanced of RFC 5646 to BCP 47, although actual reference remains RFC due to xml2rfc limitations

Added Examples Added Privacy Considerations section Other editorial changes for clarity

Deleted most of and replaced with a very short summary Replaced "wishes to" with "is willing to" in Reworded description of attribute usage to clarify when to set both, only one, or neither Deleted all uses of "IMS" Other editorial changes for clarity

Editorial changes to wording in Section 5.

Updated title to reflect WG adoption

Removed Use Cases section, per face-to-face discussion at IETF 93 Removed discussion of routing, per face-to-face discussion at IETF 93

Updated NENA usage mention Removed background text reference to draft-saintandre-sip-xmpp-chat-04 since that draft expired

Revision to keep draft from expiring

Changed name from -mmusic- to -slim- to reflect proposed WG name As a result of the face-to-face discussion in Toronto, the SDP vs SIP issue was resolved by going back to SDP, taking out the SIP hint, and converting what had been a set of alternate proposals for various ways of doing it within SIP into an informative annex section which includes background on why SDP is the proposal Added mention that enabling a mutually comprehensible language is a general problem of which this document addresses the real-time side, with reference to which addresses the non-real-time side.

Added clarifying text on leaving attributes unset for media not primarily intended for human language communication (e.g., background audio or video). Added new section ("Alternative Proposal: Caller-prefs") discussing use of SIP-level Caller-prefs instead of SDP-level.

Relaxed language on setting -send and -receive to same values; added text on leaving on empty to indicate asymmetric usage. Added text that clients on behalf of end users are expected to set the attributes on outgoing calls and ignore on incoming calls while systems on behalf of call centers and PSAPs are expected to take the attributes into account when processing incoming calls.

Updated text to refer to RFC 5646 rather than the IANA language subtags registry directly. Moved discussion of existing 'lang' attribute out of "Proposed Solution" section and into own section now that it is not part of proposal. Updated text about existing 'lang' attribute. Added example use cases. Replaced proposed single 'hlang' attribute with 'hlang-send' and 'hlang-recv' per Harald's request/information that it was a misuse of SDP to use the same attribute for sending and receiving. Added section describing usage being advisory vs required and text in attribute section. Added section on SIP "hint" header (not yet nailed down between new and existing header). Added text discussing usage in policy-based routing function or use of SIP header "hint" if unable to do so. Added SHOULD that the value of the parameters stick to the largest granularity of language tags. Added text to Introduction to be try and be more clear about purpose of document and problem being solved. Many wording improvements and clarifications throughout the document. Filled in Security Considerations. Filled in IANA Considerations. Added to Acknowledgments those who participated in the Orlando ad-hoc discussion as well as those who participated in email discussion and side one-on-one discussions.

Updated text for (possible) new attribute "hlang" to reference RFC 5646 Added clarifying text for (possible) re-use of existing 'lang' attribute saying that the registration would be updated to reflect different semantics for multiple values for interactive versus non-interactive media. Added clarifying text for (possible) new attribute "hlang" to attempt to better describe the role of language tags in media in an offer and an answer.

Changed name of (possible) new attribute from 'humlang" to "hlang" Added discussion of silly state (language not appropriate for media type) Added Voice Carry Over example Added mention of multilingual people and multiple languages Minor text clarifications

Gunnar Hellstrom deserves special mention for his reviews and assistance.

Many thanks to Bernard Aboba, Harald Alvestrand, Flemming Andreasen, Francois Audet, Eric Burger, Keith Drage, Doug Ewell, Christian Groves, Andrew Hutton, Hadriel Kaplan, Ari Keranen, John Klensin, Mirja Kuhlewind, Paul Kyzivat, John Levine, Alexey Melnikov, Addison Phillips, James Polk, Eric Rescorla, Pete Resnick, Alvaro Retana, Natasha Rooney, Brian Rosen, Peter Saint-Andre, and Dale Worley for reviews, corrections, suggestions, and participating in in-person and email discussions.