MMUSIC M. Taylor Internet Draft N. Larkin Intended status: Informational Metaswitch Networks Expires: February 28, 2017 August 31, 2016 RTP media failover: problem statement draft-taylor-mmusic-rtp-failover-problem-01.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on February 31, 2009. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Taylor & Larkin Expires February 28, 2017 [Page 1] Internet-Draft RTP media failover: problem statement August 2016 Abstract Network-based functions that terminate large numbers of RTP media streams and that offer high availability, such as session border controllers or conference bridges, typically preserve the same IP address towards sources of RTP media across a failover event because it is impractical to signal a change of IP address towards large numbers of RTP sources sufficiently rapidly to keep media interruption intervals within acceptable limits. The need to preserve the IP address of RTP media terminating functions across a failover event imposes architectural requirements that can be difficult or costly to meet, particularly in network function virtualization environments. This document describes the problem, outlines the key requirements for a solution, and discusses the merits and shortcomings of various existing approaches to solving the problem, before arguing that a new solution is needed. Table of Contents 1. Introduction...................................................3 2. Problem Space..................................................4 2.1. Geographic Redundancy.....................................4 2.2. Resource Efficiency.......................................5 2.3. Evolution to Cloud-Centric Virtualized Network Functions..6 2.4. Absence of Layer 2 Connectivity...........................6 3. Requirements for Improved Failover of RTP Media Streams........7 3.1. Upper Limit on Media Interruption Time....................7 3.2. Geographic Redundancy.....................................8 3.3. Resource Efficiency.......................................8 3.4. Network Compatibility.....................................8 3.5. Backwards Compatibility...................................8 3.6. Compatibility with Hosted NAT Traversal...................8 4. Available Solutions and Their Limitations......................9 4.1. Use of SIP Re-INVITE or UPDATE to Update SDP..............9 4.2. Restriction of Size of Fault Zone........................10 4.3. Re-Routing at the IP Layer Using BGP.....................10 4.4. Re-routing at the IP Layer Using Link-State Protocols....11 4.5. Anycast..................................................12 4.6. RTP Proxy / Load Balancer................................14 4.7. Multipath RTP............................................14 5. Proposed New Approach to RTP Media Failover...................15 6. References....................................................15 6.1. Normative References.....................................15 6.2. Informative References...................................16 7. Change Log....................................................17 7.1. Changes in draft-taylor-mmusic-rtp-failover-problem-01...17 Taylor & Larkin Expires February 28, 2017 [Page 2] Internet-Draft RTP media failover: problem statement August 2016 1. Introduction Session Description Protocol (SDP) [RFC4566], typically conveyed via Session Initiation Protocol (SIP) [RFC 3261] requests, provides a means for Real Time Protocol (RTP) [RFC3550] endpoints to negotiate via the Offer/Answer Model described in RFC3264 [RFC3264] the details of media sessions to be established between them. An endpoint conveys the specific IP address and port number on which it wishes to receive a given media stream via the c= (connection information) and m= (media description) lines defined by SDP. An endpoint that wishes to change the IP address and port number on which it is to receive a given media stream needs to send updated SDP to the transmitter of that media stream. Some services that make use of SIP and SDP to negotiate the establishment of media sessions for voice, video or real-time streaming purposes employ RTP media relay functions in the network, for example associated with a SIP back-to-back user agent in the form of a session border controller. A single such RTP media relay instance may support the relaying of tens of thousands of concurrent media streams. Likewise, a large-scale conference bridge may support many thousands of concurrent RTP sessions. With network functions that terminate such large numbers of RTP sessions (referred to in the remainder of this document as "RTP- terminating network functions"), it is desirable to provide some means to protect against hardware or software failures in a manner that preserves the RTP sessions, such that failover can be accomplished with minimal transient impairment to the audio or video streams as perceived by users of the service. This may be accomplished by deploying a second, identical instance of the network function to act as a backup. The two instances work together as a pair, with one instance actively performing RTP session termination and the other instance standing by, ready to take over if the active instance fails. Some means is provided to enable the backup instance to detect failure of the active instance, for example by means of a heartbeat protocol between the two instances. On detecting failure of the active instance, the backup instance becomes active and can take over the processing of all the media streams that were previously handled by the active instance. The handover between active and standby network function instances is typically handled in a manner that is transparent to the RTP endpoints that are currently sending media to the active instance. This may be accomplished by assigning a virtual IP address that is shared between the active and standby instances of the network Taylor & Larkin Expires February 28, 2017 [Page 3] Internet-Draft RTP media failover: problem statement August 2016 function. It is this IP address that is conveyed over SDP to the set of RTP endpoints served by the network function as the destination to which they should send their media streams. Under normal operating conditions, the virtual IP address is associated with the currently active member of the RTP-terminating network function pair. When the standby member of the network function pair detects a failure of the active member, it becomes active and claims the virtual IP address, for example by issuing a gratuitous Address Resolution Protocol (ARP) [RFC826] message. By this means, all the media streams that are currently being transmitted to the formerly active member of the network function pair may be re-directed to the newly active member without any of the transmitting endpoints being aware of the change. Fault tolerance schemes that take advantage of IP address swapping in the manner described above are widely employed by network functions that terminate large numbers of RTP streams and are often embodied in physical appliances such as session border controllers. 2. Problem Space The following points describe problematic aspects of highly available network functions that terminate large numbers of RTP media streams, for which an improved solution (or solutions) is sought. A common theme among these problems is the fact that failure recovery of the network function needs to be transparent to the sources of the RTP media streams handled by a failed RTP- terminating network function instance, in the sense that such sources are not aware that failover of the RTP-terminating network function instance serving them has taken place, except to the extent that they may experience some momentary interruption of received media. In particular, RTP endpoints continue to send media to the same IP address before and after an RTP-terminating network function failover event. 2.1. Geographic Redundancy A pair of RTP-terminating network function instances deployed in the same physical location in active-standby mode and sharing the same virtual IP address can provide protection against equipment failure such as the failure of the active instance itself or the failure of network connectivity to the active instance. However, this arrangement does not protect against failure of the site at which the RTP-terminating network function is deployed, or failure of network connectivity to the site as a whole. Network operators typically protect against site failure or site connectivity failure by implementing some form of geographic Taylor & Larkin Expires February 28, 2017 [Page 4] Internet-Draft RTP media failover: problem statement August 2016 redundancy. This usually involves replicating the equipment needed to support a given service on at least two sites, such that in the event of the failure of one site, the service can continue to be supported by making use of the equipment on one of the other sites. Since redundant equipment is deployed within each site to protect against equipment failure, protection against site failure requires yet more equipment to be deployed, which has obvious cost implications. Note that site failure is considered to be a far less frequent event than equipment failure, and typically no effort is made to preserve active real-time media sessions across a site failover, unlike the case of equipment failover. Network operators can potentially reduce the cost of meeting service availability targets by protecting against both equipment failure and site failure with a single common failure recovery mechanism. For example, a pair of RTP-terminating network function instances could be deployed with one member of the pair located in one site and the other member located in another site. If the network operator determines that real-time media sessions must be preserved across equipment failure, then we need to be able to switch all of the media streams addressed to the failed RTP-terminating network function instance to the standby instance located in the backup site sufficiently quickly that users experience no more than brief transient interruption to their incoming media streams. This can be accomplished quite efficiently at Layer 2 (by swapping a virtual IP address from one member of the network function pair to the other), but this approach requires the establishment of a Layer 2 connection between sites, which can be complex and inconvenient to accomplish. Other methods for preserving real-time media streams across geographic failover are discussed below in Section 4. 2.2. Resource Efficiency While it is common practice to deploy RTP-terminating network functions as active-standby pairs to provide high availability, this arrangement is relatively wasteful of hardware resources because, at any one time, only half the hardware supporting the RTP-terminating network function is doing useful work. The cost of hardware to support RTP processing can be relatively high, particularly if the function is required to perform compute-intensive work on media streams such as encryption/decryption of Secure Real Time Protocol (SRTP) [RFC3711], or audio or video transcoding. The amount of hardware resources required to support any given capacity of RTP-terminating network function could be very considerably reduced if it were possible to provide protection against hardware or software failure by means of a pooling Taylor & Larkin Expires February 28, 2017 [Page 5] Internet-Draft RTP media failover: problem statement August 2016 arrangement. This could be in the form of a group of RTP- terminating network function instances, all of which are active all of the time, and where their total aggregate capacity exceeds the maximum expected load by a sufficient margin that the load carried by any given instance can be successfully load-balanced across the remaining instances in the event of the failure of this instance. An alternative approach would be to deploy a small number of standby instances to protect a much larger number of active instances, and to switch all of the RTP sessions carried by a failed active instance over to one of the standby instances. This type of high availability scheme is often known as N+k redundancy. While the latter example above of N+k redundancy (N x active, k x standby) is compatible with the swapping of virtual IP addresses, the former example (active-active load-balanced) is not. Most network operators express a strong preference for active-active N+k schemes, regardless of any consideration as to whether active- active N+k can actually be shown to deliver higher availability than active-standby N+k. 2.3. Evolution to Cloud-Centric Virtualized Network Functions Many network operators are embracing network functions virtualization (NFV), whereby network functions that would previously have been embodied as physical appliances are now embodied as software components deployed in a virtualized cloud computing environment. With the move to NFV, network operators are expressing a strong preference for cloud-centric approaches to network function design. This tends to imply the deployment of relatively large numbers of relatively small instances of network functions, where all instances are active, and protection against failures at any level from individual instance through physical host up to a complete site is provided by means of active-active N+k redundant pools of virtualized network function instances. It is difficult in practice to architect highly available solutions for RTP-terminating network functions based on active-active N+k redundancy that meet the requirement that failover must be transparent to sources of RTP media. Possible solutions and their limitations are discussed later in this document. 2.4. Absence of Layer 2 Connectivity Widely used active-standby techniques for RTP-terminating network functions that involve the sharing and swapping of a virtual IP address typically require that the active and standby members in a high availability arrangement are directly connected via a Layer 2 network segment. Taylor & Larkin Expires February 28, 2017 [Page 6] Internet-Draft RTP media failover: problem statement August 2016 As discussed in section 2.1 above, this can be problematic if the active and standby RTP-terminating network function instances are located in different geographic sites, although this problem is soluble, for example with the aid of a Layer 2 Virtual Private Network. A more intractable problem arises when a network operator chooses to design a network functions virtualization infrastructure with a Layer 3 centric fabric that does not provide L2 connectivity between virtualized workloads. While this is not yet a common approach to cloud network design, scaling issues with L2-centric fabrics are expected to drive increasing popularity of L3-centric approaches in the future. In L3-centric cloud network fabrics, failover of RTP- terminating network functions based on virtual IP address swapping cannot be supported with the usual approach based on gratuitous ARP [RFC826]. Approaches based on Network Address Translation (NAT) [RFC3022] such as OpenStack's Floating IP Address concept could potentially address this need, but the insertion of additional network elements into the RTP path to perform NAT introduces additional failure scenarios that need to be protected against. Also, such approaches require that that the infrastructure management plane is capable of responding very quickly to a NAT re-configuration request, such that the interruption in incoming media streams experienced by users is perceived as no more than momentary. Practical experience suggests that this cannot currently be achieved with real-world cloud infrastructure solutions. 3. Requirements for Improved Failover of RTP Media Streams For the reasons described in section 2 above, it is considered desirable to specify new behaviors of RTP endpoints so as to provide an improved method for failover of RTP media streams that supports high availability of RTP-terminating functions in the network. When considering any new solution for failing over large numbers of RTP media streams, the following requirements should be met. 3.1. Upper Limit on Media Interruption Time A new solution designed to preserve RTP media in the face of failure of an RTP-terminating network function instance MUST successfully re-establish a viable RTP media path for each and every flow that was previously handled by the failed instance within a maximum Taylor & Larkin Expires February 28, 2017 [Page 7] Internet-Draft RTP media failover: problem statement August 2016 elapsed time of two seconds, and SHOULD re-establish all media flows within 500 milliseconds. 3.2. Geographic Redundancy A new solution for failover of RTP media streams MUST be capable of preserving media sessions across the failure of a physical site or the failure of network connectivity to a physical site, even when the two sites are separated by hundreds of miles. 3.3. Resource Efficiency A new solution for failover of RTP media streams MUST support N+k redundancy of RTP-terminating network functions, where k << N. 3.4. Network Compatibility A new solution for failover of RTP media streams MUST not assume the existence of Layer 2 connectivity between RTP-terminating network function instances that are protecting each other, and MUST not assume the existence of any network capabilities beyond basic IP unicast connectivity. 3.5. Backwards Compatibility It will take time to upgrade the installed base of RTP endpoints to embody any new behaviors required to support a new solution for RTP media failover. RTP-terminating network functions that embody a new solution for failover of RTP streams MUST remain compatible with RTP endpoints that do not support the new behaviors. RTP-terminating network functions that support a new solution for failover of RTP media streams MAY continue to support legacy methods for failover of RTP media streams, but are not required to do so. 3.6. Compatibility with Hosted NAT Traversal A new solution for failover of RTP media streams MUST be compatible with the method of Hosted NAT Traversal described in RFC7362 [RFC7362]. If the solution requires that, following failover, the RTP endpoint is to transmit RTP media streams to an RTP-terminating network function at an IP address and port number that is different than prior to failover, the RTP endpoint MUST commence transmission of RTP packets towards the new IP address and port number without waiting to receive RTP media packets from the new IP address and port number. Taylor & Larkin Expires February 28, 2017 [Page 8] Internet-Draft RTP media failover: problem statement August 2016 4. Available Solutions and Their Limitations In this section, we discuss alternative ways of supporting high availability of RTP-terminating network functions without any change to the existing behavior of SIP- and SDP-signaled RTP endpoints. It will be seen that none of these methods meets the full set of requirements identified in Section 3 above. 4.1. Use of SIP Re-INVITE or UPDATE to Update SDP A SIP User Agent in an active session state associated with a currently active RTP transmitter can be instructed to transmit RTP to a different destination IP address and port number by sending it an in-dialog re-INVITE or UPDATE request that includes SDP with the new connection details. This use of a re-INVITE or UPDATE request to update SDP within an active session may be leveraged to manage failover of an RTP- terminating network function instance in the network. The SIP User Agent instance that is associated with the RTP-terminating network function instance, upon detecting the failure of said instance, could send a re-INVITE or UPDATE request to each and every SIP UA that is in an active session and sending RTP media to the failed RTP-terminating network function instance, with an SDP body that directs each RTP endpoint to send RTP media to a different RTP- terminating network function instance. In practice, it is found that the processing resources required to transmit the required number of re-INVITE or UPDATE requests and process all of the responses so as to achieve resumption of all active RTP media flows within an acceptable elapsed time far exceed the processing resources that would normally be required to support the SIP signaling load associated with that number of concurrent sessions. It is therefore very costly to support RTP media failover by means of this technique. One use case for RTP-terminating network functions is in peering arrangements for the connection of large numbers of concurrent RTP sessions between different networks. In this situation, if a SIP UA associated with an RTP-terminating network function were to send large numbers of in-dialog re-INVITE or UPDATE requests in a short elapsed time to its peer SIP UA in the other network so as to request that a large number of incoming RTP streams be sent to a different IP address and port number, the receiving SIP UA might easily be overwhelmed by the incoming load of SIP message traffic. This could have the doubly deleterious effect of failing to achieve the failover of many of the RTP streams in a timely fashion, and Taylor & Larkin Expires February 28, 2017 [Page 9] Internet-Draft RTP media failover: problem statement August 2016 failing to complete requests for the establishment of new sessions while the signalling overload condition persists. 4.2. Restriction of Size of Fault Zone In a network functions virtualization environment, it is possible to terminate large numbers of RTP sessions by deploying large numbers of small scale RTP-terminating network function instances. These instances could be deployed without any form of redundancy, such that the failure of any instance causes the complete loss of all RTP media sessions currently being handled by it. With this type of arrangement it could be argued that, if the maximum number of sessions that are handled by a single RTP- terminating network function instance is low enough, then the failure of one instance and the consequent loss of all the media sessions that it is currently handling represents a relatively minor impact to the service as a whole. Some network operators may take the view that this approach meets their criteria for an acceptable quality of service. However it should be pointed out that, with a reasonably efficient implementation of the RTP-terminating function, a minimally-sized instance occupying just a single virtual CPU could be handling several hundred concurrent sessions. For most network operators, the loss of several hundred concurrent media sessions arising from the failure of an unprotected network element would be unacceptable. It is also worth pointing out that deploying large numbers of small instances of a network function may restrict the size of the fault zone as it relates to failure of small-scale resources such as virtual machines, hypervisors or compute nodes, but it does not restrict the size of the fault zone as it relates to failure of large-scale resources such as an availability zone, an entire cloud instance or an entire site. Protection is still required in the event of these resources failing. 4.3. Re-Routing at the IP Layer Using BGP It is possible to cause IP packets to be delivered to a different host system by means of appropriate interaction with the routing protocols of the IP network control plane. This capability can be exploited to support a highly available RTP-terminating network function. In an IP network that employs Internal Border Gateway Protocol (BGP) [RFC4271], one way to accomplish this is to add a BGP speaker function to the RTP-terminating network function. The RTP- Taylor & Larkin Expires February 28, 2017 [Page 10] Internet-Draft RTP media failover: problem statement August 2016 terminating network function uses BGP to advertise a route to the RTP service address via its own host address. The IP infrastructure to which the RTP-terminating network function instance is connected effectively treats the host address of this instance as the next hop towards the RTP service address, and routes IP packets addressed to the RTP service address towards that RTP-terminating network function instance. In the event of the failure of such an RTP-terminating network function instance, another RTP-terminating network function instance that is providing protection for the failed instance issues a BGP message that withdraws the original RTP service route via the host address of the failed instance, and advertises a new route via its own host address. The IP infrastructure will now route all IP packets addressed to the RTP service address towards the protecting RTP-terminating network function instance. This approach places a number of demands on the IP routing infrastructure to which the active and standby RTP-terminating network function instances are connected which it may be difficult to meet in practice. In particular, the routing infrastructure must be able to respond to the withdrawal of a route and the advertisement of a new route to the RTP service address sufficiently rapidly to meet the requirement described in Section 3.1 on the upper limit for media interruption time. It also requires that the routing policy prevailing in the infrastructure allows for individual host routes (e.g. IPv4 /32 or IPv6 /128 routes) to be installed in routing tables. In many cases it may not be practicable or even possible to meet these demands. 4.4. Re-routing at the IP Layer Using Link-State Protocols In IP networks that employ Interior Gateway Protocols other than IBGP, for example OSPF [RFC2328] or IS-IS [RFC1142], it may be possible to re-route RTP media at the IP layer using methods conceptually similar to that described in section 4.2. However, link-state protocols rely on the detection of a link failure to initiate re-routing of IP traffic, and it isn't likely that the failure of an RTP-terminating network function instance could always be detected as a link failure by neighboring routers sufficiently quickly to meet the requirement on the upper limit for media interruption time described in section 3.1. Taylor & Larkin Expires February 28, 2017 [Page 11] Internet-Draft RTP media failover: problem statement August 2016 4.5. Anycast Anycast [RFC4786] is a routing scheme whereby multiple host systems share a single address, and IP packets destined for that address are routed to the host that is "nearest" the sender. Anycast techniques can be employed to implement a scheme that is conceptually similar to that described in Section 4.2 above, but which relies on the active and standby members of an RTP-terminating network function pair to advertise different route weights such that IP traffic is routed to the active member. Failover requires that the advertised route weights are adjusted to ensure that IP traffic is routed to the standby member. Anycast techniques can also be employed to support a form of load- balancing. If multiple RTP-terminating network function instances are advertised to be reachable at the same address and with equal distance, the IP routing infrastructure can distribute load across the instances using Equal Cost Multi Path (ECMP) routing. Furthermore, if some means is provided for the detection of the failure of any given RTP-terminating network function instance and subsequent transmission of a BGP message withdrawing the route to that instance, then ECMP should act to re-distribute the load across the remaining instances. This use of Anycast appears to address the N+k active-active use case very effectively, although it should be noted that, in the case of an RTP-terminating network function that is acting as a media relay, for example as a component of a session border controller, it is not generally possible to ensure that the two streams that make up a bi-directional RTP session are handled by the same media relay function instance. This may well add considerably to the complexity of the design of the media relay function. A more serious problem with using Anycast in this way is that, in a virtualized environment, it becomes extremely challenging to manage the placement of the RTP-terminating network function instances. These challenges arise because, at each router supporting ECMP that sees multiple available routes to the Anycast address with the same distance, the router splits the traffic evenly between all these routes. If there is more than one router between the source of the traffic and the set of RTP-terminating network function instances that are the destination of the traffic, these instances must be arranged so as to create a symmetrical routing tree in order to ensure that each instance receives a similar share of the overall traffic load. Taylor & Larkin Expires February 28, 2017 [Page 12] Internet-Draft RTP media failover: problem statement August 2016 To illustrate this, consider the following scenario, described in the diagram below. All RTP media traffic from a given set of RTP endpoints transits via Router A (which might be, for example, an end-of-rack L3 switch), and then via either Router B or Router C (which might be, for example, top-of-rack L3 switches) to RTP- terminating network function instances M1 through M5. The routes to instances M1 and M2 are via Router B, while the routes to instances M3, M4 and M5 are via Router C. All RTP-terminating network function instances are advertising the same RTP service address. +--------+ | | | | | +---> M1 | Router | +--------+ | B | | | | +---> M2 | +-----+ | | | | | RTP flows -----> Router | +--------+ | A | +--------+ | | | | | +-----+ +---> M3 | | | | +--------+ | Router +---> M4 | C | | +---> M5 | | | | +--------+ From the point of view of Router A, there are two possible routes to the RTP service address, via Router B and Router C respectively. It therefore sends half of the RTP flows to Router B, and half to Router C. Router B will distribute half of the RTP flows that it receives from Router A to each of M1 and M2, while Router C will distribute one third of the flows it receives from Router A to each of M3, M4 and M5. It can be seen that the load is not evenly balanced over the population of RTP-terminating network function instances. In the general case, placing the instances of RTP-terminating network functions so as to form a symmetrical routing tree presents an extremely difficult problem for the workload scheduling algorithm in a virtualized environment, particularly if the intention is to spread the load between RTP-terminating network function instances Taylor & Larkin Expires February 28, 2017 [Page 13] Internet-Draft RTP media failover: problem statement August 2016 on two or more separate sites. Topology-aware scheduling is not a capability offered by current generations of cloud orchestration software, and even if it were, dynamically scaling the population of RTP-terminating network function instances while maintaining a symmetric routing tree would be cumbersome and inflexible. 4.6. RTP Proxy / Load Balancer It is possible to imagine a solution based on an RTP proxy or load balancer which sits between RTP-terminating network functions and a population of RTP endpoints that are sending RTP media towards those RTP-terminating network functions. The RTP proxy or load balancer presents a single IP address towards the population of SIP UAs. In the event that an instance of an RTP-terminating network function fails, the RTP proxy or load balancer can detect the failure of the instance, and re-direct incoming RTP media to a different instance of an RTP-terminating network function which has been configured so as to receive and correctly process the incoming RTP media streams that were previously being sent to the failed instance. The problem with this approach is that the RTP proxy or load balancer itself represents a single point of failure that must be protected by some means in order to provide a high availability service. All that is achieved in deploying an RTP proxy or load balancer is that the RTP failover problem is moved from the RTP- terminating network functions to an RTP-proxying function. The fundamental problem remains the same: a population of RTP endpoints expects to be able to transmit RTP media streams to the IP address and port number that was negotiated when the session was set up, and this address must be preserved across a failover of the RTP proxy or load balancer in order to ensure session continuity. 4.7. Multipath RTP Multipath RTP [I-D.ietf-avtcore-mprtp] (MPRTP) is a proposed extension to RTP which splits a single RTP stream into multiple subflows that are transmitted over different network paths. It is primarily intended to leverage pooling of the resource capacity of multiple network paths to improve user experience by enabling higher bit-rate and higher quality codecs to be used. It is possible to imagine using MPRTP to support failover of individual RTP streams, by defining two MPRTP sub-flows at session establishment time and then sending all media over one of the sub- flows. If an RTP-terminating network function involved in such an MPRTP session were to fail, media could then be transmitted and received via the other sub-flow. Taylor & Larkin Expires February 28, 2017 [Page 14] Internet-Draft RTP media failover: problem statement August 2016 There are a number of concerns about the use of MPRTP to support the simple case of failover. MPRTP is primarily concerned with the support of multiple simultaneous sub-flows that must be merged by the receiver. This needs additional RTP header information which would require extensive enhancements to the RTP stack in each endpoint. This additional RTP header information would not be required for the simple failover case. Furthermore, MPRTP mandates that endpoints keep alive sub-flows on which no media is being sent. This would result in the unnecessary consumption of resources in RTP-terminating network functions. Finally, MPRTP does not support any mechanism for signaling to a transmitting RTP endpoint that it should stop sending media on one sub-flow and start sending it on another. Thus any solution for RTP failover based on the use of MPRTP would require further protocol extensions to address this requirement. 5. Proposed New Approach to RTP Media Failover This document has argued that currently available solutions for RTP media failover are inadequate because they are inefficient from a hardware resources standpoint and not well suited to the evolving environment of network functions virtualization. It has also pointed out that many of the challenges faced by RTP media failover solutions arise from the need to preserve the destination IP address of the RTP-terminating network function across a failover event. The need for robust and flexible high availability solutions for SIP User Agents is addressed by existing standards by permitting SIP UAs to establish multiple flows over which SIP signaling messages can be sent and received [RFC5626]. This document proposes that an analogous scheme be defined for RTP endpoints. The details of such a proposed scheme will be described in another Internet Draft. 6. References 6.1. Normative References [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session Description Protocol", RFC 4566, July 2006. [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,A., Peterson, J., Sparks, R., Handley, M., and E.Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002. Taylor & Larkin Expires February 28, 2017 [Page 15] Internet-Draft RTP media failover: problem statement August 2016 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003 [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with Session Description Protocol (SDP)", RFC 3264, June 2002. [RFC826] Plummer, D., "Ethernet Address Resolution Protocol: Or Converting Network Protocol Addresses to 48.bit Ethernet Address for Transmission on Ethernet Hardware", STD 37, RFC 826, November 1982 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC 3711, March 2004. [RFC3022] Srisuresh, P. and K. Egevang, "Traditional IP Network Address Translator (Traditional NAT)", RFC 3022, January 2001 [RFC7362] Ivov, E., Kaplan, H., and D. Wing, "Latching: Hosted NAT Traversal (HNT) for Media in Real-Time Communication", RFC 7362, September 2014 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998 [RFC1142] Oran, D., Ed., "OSI IS-IS Intra-domain Routing Protocol", RFC 1142, February 1990 [RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast Services", BCP 126, RFC 4786, December 2006 [RFC5626] Jennings, C., Ed., Mahy, R., Ed., and F. Audet, Ed., "Managing Client-Initiated Connections in the Session Initiation Protocol (SIP)", RFC 5626, October 2009 6.2. Informative References [I-D.ietf-avtcore-mprtp] Singh, V., Ott, J., Karkkainen, T., Ahsan, S., Eggert, L., "Multipath RTP (MPRTP)", draft-ietf-avtcore-mprtp-03 (work in progress), July 2016. Taylor & Larkin Expires February 28, 2017 [Page 16] Internet-Draft RTP media failover: problem statement August 2016 7. Change Log 7.1. Changes in draft-taylor-mmusic-rtp-failover-problem-01 Corrected missing section header "Re-Routing at the IP Layer Using BGP" Added new section 4.7 on MPRTP Authors' Addresses Martin Taylor Metaswitch Networks 100 Church St Enfield EN2 6BQ UK Email: martin.taylor@metaswitch.com Nic Larkin Metaswitch Networks 100 Church St Enfield EN2 6BQ UK Email: nic.larkin@metaswitch.com Taylor & Larkin Expires February 28, 2017 [Page 17]