Monday, 24 February 2014

Understanding how Lync establishes audio/video paths using ICE

The real time aspects of Lync require a different approach to SIP signally to ensure a quality of service. SIP signalling can be delayed without causing too many issues, however audio/video this is much more important, and to do this Lync uses Interactive Connectivity Establishment (ICE). ICE is the overall process that helps discover and exchange candidates to finds most optimal media path.

Definitions

  • SIP signalling - allows clients to send invites to other parties.
  • Interactive Connectivity Establishment (ICE) - Process used to discover and exchange candidates in order to find the most optimal media path.
  • Candidates – A list of possible IP addresses that could be used to establish a media path.
  • Reflective/Session Traversal Utilities for NAT (STUN) - STUN reflects or returns the public NAT address to the Lync client e.g. a home based user sends a packet to edge server, which discovers the public IP address (a candidate) , and returns it to the client.
  • Relay/Traversal Using Relays around NAT (TURN) - TURN allows the media traffic to be relayed/proxied by the Edge server to the client by providing the client a relay addresses to send media.
  • ICE endpoints – An ICE endpoint is anything that is involved in media e.g. Lync Clients, Lync Web App, Lync Phone, FE Server (App Sharing MCU, RGS, Call Park A/V Conf etc), Mediation Server, SBA, Exch UM. Session Border controllers and the director role would not be considered as ICE endpoints. Edge server is doing STUN and TURN but not an ICE endpoint, more and ICE server.

The 5 Phases of Media Path Establishment

1. TURN provisioning and credentials (MRAS)
The Lync client does an SRV lookup to find an Edge server to register against and then performs a SIP register. The server provides a 200 OK which includes in band provisioning details, including MRAS (media relay authentication services) which tells the client there is an Edge server service deployed. With this the client sends SIP service request to Front End which includes the client’s location (internal or external). Because the Edge is not on the domain it can’t authenticate client directly, so the Front End server requests the credentials on behalf of the client. The AV Edge service creates credentials using AV Edge certificate for the Front End which sends a 200 OK back to client with the Edge server it should connect to, ports and username and password. Credentials are valid for 8 hours and for this period the client can now go straight to Edge server. In conferencing scenario the same thing happens, however because can join anonymously the Front End checks to see if a meeting exists, and then gets and passes the credentials to meeting participant.



Tip - Always make sure you use the same external certificate for all Edge servers. The certificate is used to create credentials for the client to connect. If an Edge server goes down, and the client try’s to connect to another Edge server using a different certificate, it will not be able to validate the credentials and authentication will fail.

Search for “MRAS” in Snooper to find authentication messages. There should be 3 messages per request. Port 5062 for MRAS.

2. Address Discovery (Allocation)
Address discovery is the process the client goes through to determine what IP addresses it might be reached on. These IP addresses are the client’s candidates.

Audio/Video
  • Discover local UDP candidate for every network card (peer to peer so UDP is best)
  • Connect to media relay (Edge server) to discover reflexive address (the address the Edge server sees the client connect from) and allocate a candidate on the media relay for UDP then TCP
File Transfer and Desktop Sharing (RDP over RTP) - Both require TCP
  • Discovers local TCP candidate
  • Media Relay TCP only

3. Address Exchange (SIP Invite/200OK)
Address exchange is the process of sharing candidates with other endpoints that will be part of the call (peers). This is achieved by sending a SIP invite to the peer, who in turn will discover their own candidates, and send them back as part of a SIP 183 Session Progress.

4. Connectivity Checks
This is the process of taking the provided candidates and determining a possible media path. The Lync client validates the list of candidates by opening connections to all entries in the list simultaneously. The first to respond is used to establish the “Early Media” connection, however the media path may change during the call using a process called candidate promotion. When the called party picks up it will again send its candidates to the caller, but this time part of a 200OK.

  • Connect directly (peer to peer)
  • Connect to reflective address
  • Connect via media relay by connecting to the Edge and asking it to contact a candidate and establish a connection on its behalf.
**If there is no Edge server it only does local candidates.

5. Candidate Promotion
This is the process of determining the best possible candidate for the session. If a better path is found the then media path can change during the call.
  • Host/Local Candidate (UDP) – The most preferred candidate is always a local candidate and is the reason that peer media sessions between clients on the same network will never use the Edge server.
  • Reflexive/STUN Candidate (UDP) – The next preferred option is to use the server reflexive candidate which is provided by the Edge Server using STUN. This scenario involves attempting to connect to the reflexive IP addresses for each externally connected user. The reflexive IP address is the public IP address of the external user e.g. a home router.
  • Relay/TURN Candidate (UDP) – In the event that STUN fails then the final option is to utilise the Edge Server as a media relay. The calling client will establish a media session directly with the A/V Edge Server as will the receiving client. This connectivity is relayed through the public IP address of the Audio/Video Edge service.
  • Relay/TURN Candidate (TCP) – when connectivity is not available on UDP. TCP Relay is a last resort.


SIP Messages in Media Path Establishment


  • Out INVITE (SDP session description protocol – tells other party what I can do e.g. codecs). First set of candidates is ICE v6 (ms-proxy-2007fallback) second set is ICE V19. OCS r2 + uses V19 includes both for back compact. Candidates come in peers - one for RTC and one for RTCP.

    A=candidate 1 1 Protocol(UDP/TCP Passive – candidate they I expect to send traffic to /TCP Active – candidate that sends me traffic) priority (high best) IPAddress Port Typ(host/relay/server reflective)

    A=candidate 1 2 UDP priority(high best) IPAddress Port Typ(host/relay/server reflective).

  • In SIP 183 Peer sends its candidates. You may see multiple – one for each end point.
  • In SIP 200 OK Peer picks up the call. This still includes a full candidate set as the best have not been negotiated yet.
  • Out INVITE Re-invite which will include the 1 chosen candidate peer as decided in the earlier process.
  • In SIP 200 OK Includes other party’s final candidates.
NOTE: The Edge server is used in discovery process, but not necessarily once media path has established. This is why it can be important for internal clients to be able to access the internal NIC for edge. If the candidate list doesn’t include UDP and TCP reflective then it probably can’t talk to Edge server. If you see only UDP or only TCP then firewall might be blocking ports.

Call Scenarios and Connections Options

Inside <-> Inside
  • Peer to Peer
Inside <-> Outside
  • Peer to peer will not work
  • Outside connects to reflective candidate UDP or TCP
  • Outside connects to own edge server (relay) which hairpins traffic to internal user


Outside <-> Outside
  • Peer to peer might work if clients are on the same network
  • Reflective candidate UDP or TCP
  • Relay via Edge server


Federation OCS 2007
  • Edge servers connect to each other on the 50k port range directly and relay the call. Ports need to be open in both directions.
Federation 2007 R2 (tunnel mode introduced)
  • The Edge server sends a special packet to UDP port 3478 on the other Edge to find out if it is OCS 2007 R2 or above. If it is then tunnel mode can be used, and all UDP traffic can be sent on these ports. Candidate data still includes the 50k ports, but the Edge server just contacts the other Edge server to share this information and connect.
  •  TCP is very similar, but because a connection to a source IP/port and destination IP/port can only be in use at one time, the Edge server allocates a port in 50k range as a source port, and then opens a connection to the other Edge server on port 443. This gets around having to have 50k ports open which is required for OCS 2007.

While the 50k port range is not required for OCS 2007 R2 and above, there are still benefits to opening it. In a situation where 2 Edge servers would normally be involved in relaying media, this situation allows both clients to connect to the same Edge server. The initiating client connects to its home Edge server, gets candidates and passes those to the other party. The other party then attempts to connect to the 50k range directly on the initiators home Edge server. Without these ports open this would not work, and the client would need to involve its own Edge server and ask that it connects to the initiating Edge to relay on its behalf. This introduces a longer media path.

Troubleshooting Media Connectivity

  • Get client login from fresh sign in – is there MRAS? If no it can’t talk to edge.
  • Check if FE can telnet to FQDN on internal edge
  • Check logs for STUN and TURN candidates. If none then there is an issue between client and edge
  • User port query to test UDP
  • When edge sends candidates in NAT situation, edge uses external IP configured in topology and sends this to client – make sure it’s correct.
  • Search Snooper logs for MRAS for authentication
  • Search a=candidate to see candidate
  • Search a=remote-candidate to find final candidates that are chosen
  • After call pickup it can take several seconds before final candidates are chosen, and media path might change. Final re-invite will include this but the result may not be in logs for a few seconds after connection.

Thanks to Thomas Binder for this excellent deep dive as well as Jeff Schertz for his summary.


4 comments:

  1. Good summary. I've been trying to get a handle on the entire process as we have deployed Lync 2013 in our environment (without enterprise voice), One question I can't seem to find an answer on is, what path does an external client that is using Lync Web App take when connecting to a conference for desktop sharing. The behavior I see in our environment is that when a guest is connected from external using the Lync Web App connected either to someone external using the full client or someone internal using the full client, we have a difficult time establishing a desktop sharing session or maintaining one. Looking at our firewall logs, I see dropped communication between our front end server ip's and the external edge nic ip's. I see the external edge ip trying to talk to the front end server ip on 3478 stun and occasionally other 50,000 or higher ports, as well as the front end server trying to talk to the external edge ip in the 50,000 range. This traffic will all drop because I was under the impression that communication with the front end servers or internal clients should go through the edge internal interface due to our persistent routes on edge.

    I started exploring the log files and candidate lists. With an internal full lync client user and an external lwa user scenario, I see the internal user candidate list look correct.. it sends an invitation and has its local ip of typ host on tcp-act and tcp-pass. I also see the relay ip of the external edge ip in the list. What seems odd is that, the internal client then receives a SIP 200 OK with a candidate list, but the list is the local ip of the front end server's nics (both the ip of the default nic for communication and the ip of the nic used to connect to some back end storage for the lync file share). It also shows the relay address of the edge's external ip.

    Looking at the logs on the lwa client, I see a candidate list that looks correct for its local ip information, but I also see candidate lists show up which list the information for our front end server as well.

    ReplyDelete
  2. That's an interesting problem and one that I have not seen before. The Lync Web App is still an ICE client and will attempt to establish the media path in the same way. Signally will occur via the FE web services. The FE server is also an ICE client and this maybe why you are seeing its IP addresses in some logs, however I would not expect this to be in the client invites. It may be that the FE forwards the request to the internal client, acting as an ICE proxy, however that's just me speculating.

    What are you using to publish web services externally? Make sure that the time-out is set at 200 seconds or more, I normally configure 3600 seconds.

    ReplyDelete
  3. We are publishing web services externally using an F5 Big IP as a reverse proxy which is a supported device. I thought perhaps this was a routing issue on the Edge server but I have double checked the persistent routes.

    ReplyDelete
  4. Shouldn't be a routing issue if your other client types are working externally. Did you check what your time out is set at?

    Is the issue only effecting content sharing? Does the audio and video also drop?

    Are you using 1 or 3 IP's on your edge server? Are you using NAT?

    ReplyDelete