Communications Technology: Archive

Current Issue

Advertising Information

Meet the Editors

Advisory Board

Annual Awards

Custom Publishing

WebEvents

Show Dailies

Reprints

List Rentals

Archives

Archives

November 2003 Issue

VoIP Availability: Will Your Net Thrive Under Pressure?

By Navin Thadani, Cisco Systems

Cable operators are planning to roll out commercial deployments of voice-over-Internet protocol (VoIP) services, and indeed, some have already started to do so. To be successful at voice and compete effectively with the incumbent telecommunications providers, cable companies need to operate highly available and reliable networks. But how does one assess the availability and performance of a network and evaluate design modifications to bring the existing cable IP network up to par and perhaps even exceed the public switched telephone network (PSTN) in terms of availability?

In addition to examining network-wide availability or downtime, it is also important to focus on ?end-user service availability.? That is, what is the level of service availability that customers will experience? For example, voice service availability is calculated by calls dropped and ineffective call attempts, in addition to network downtime. Figure 1 shows the three primary components of end-user service availability, encompassing end-to-end network-availability and service-availability metrics.

What is service availability?

The first component of service availability includes the end-to-end products already in place and their redundant parts, such as redundant route processors, line cards and power supplies, as well as product-level hardware features that have been implemented.

The second is the end-to-end architecture. The availability of a product cannot be viewed in isolation. Rather, it is tied intrinsically to the architecture in which it is deployed. For example, consider a product that consists of an ingress line card, a route processor and an egress wide area network (WAN) line card that connects it to the rest of the network. The WAN card fails. Now in one case, assume the architecture allows voice traffic to start flowing in 50-100 milliseconds (ms), and in another case, the traffic starts flowing in 45 seconds to 1 minute. That difference in time impacts the product availability. It is, therefore, necessary to evaluate in detail the underlying architecture?Layer 2 paths, routing, Layer 3 architecture, etc.?and how that affects product availability.

The third aspect, operations, includes important factors such as whether a facility has tested spares or is staffed 24 hours a day. For example, a product in a remote hub without 24-hour staffing or tested spares may take up to four hours to repair, whereas the same product in a facility with 24-hour staffing and tested spares may be repaired in less than 10 minutes. Operations, therefore, play a critical role in end-user service availability. Other operational aspects such as operator training, change management, diagnostics and upgrade policies also impact availability.

Availability requirements and PSTN myths

The PSTN always has been thought of as the benchmark for availability. Before offering a competitive voice service, cable operators have to make sure that their networks are capable of providing the same (or higher) level of service availability as the PSTN. Reliability has been the legacy of the PSTN, but there exist several myths and misconceptions regarding its true availability. Some of these myths include:

The PSTN provides 99.999 percent (or ?five-nines?) availability end to end.
Operators need five-nines availability on every platform to achieve PSTN equivalence.
Every failure in the network is recovered in less than 50 ms.

PacketCable?s report ?VoIP Availability and Reliability Model for the PacketCable Architecture? (PKT-TR-VoIPAR-V01-001128) does an excellent job in dispelling some of these myths. It notes that the idea of PSTN reliability being equivalent to 99.999 percent is incorrect (Editor?s note: Bellcore originally defined four nines, or 99.99 percent, as the telco network availability goal. See ?What?s in a Number? Defining Availability is Tricky,? December 1999 Communications Technology; broadband-pbimedia.com/ct/archives/1299/hranac.htm). The PacketCable report clearly breaks down the different subsections of the PSTN network and draws a direct analogy to an equivalent IP network. Per these requirements, the end-to-end availability of a VoIP network should be greater than 99.94 percent to achieve equivalence with the PSTN.

PacketCable also specifies service-availability metrics as the number of calls dropped and the number of ineffective attempts. Figure 2 shows the PacketCable availability reference model for an on-net to off-net call.

Calls dropped

Per the PacketCable report, there should not be more than one in 8,000 calls dropped, and no more than five in 10,000 ineffective attempts. These numbers are exactly the same as the PSTN requirements for availability and service availability as set forth in Bellcore GR series specifications.

Dropped calls arise from failures in the bearer path of the voice call. At the two endpoints of the bearer path?at the cable modem termination system (CMTS) facing the customer and the PSTN gateway facing the PSTN in the case of an on-net to off-net call?a dropped call may occur due to a failure on a line card. It also may occur due to a failure to copy call or state information to the standby line card (if there is redundancy).

However, in the rest of the network, there is no concept of call state. For example, if a core router in the network fails and it takes 40 seconds to reroute traffic to the alternative path, one can imagine an end user becoming frustrated and hanging up the phone after a certain period of time. This should also be considered a dropped call.

Therefore, in the realm of IP, a dropped call could occur for two reasons: inability to maintain call state at the endpoints in the event of a failure, or inability to recover traffic within a certain call-dropped threshold in the event of a failure at the endpoints or in the core of the network.

In reality, the call-dropped threshold is user-dependent, but most IP providers agree on three seconds as that threshold. That is, if there is a failure in the network and the user experiences ?dead air? for more than three seconds, if the user hangs up that call is considered a dropped call. Calls dropped also can be measured in defects per million (DPM). For example, one in 8,000 calls dropped can be referred to as 125 DPM(calls dropped) or DPM(CD).

Ineffective attempts

Ineffective attempts arise because of failures in the signaling path of a voice call. Per the PacketCable definition, ?an ineffective attempt occurs when any valid bid for service does not complete because of a fault condition (for example, hardware or software failure).? That means if a user is trying to make a call but cannot because of a signaling-path failure, it is counted as an ineffective attempt.

However, here again a threshold is necessary. The popular ineffective-attempts threshold that exists in the industry today is at 30 seconds. For example, if a user tries to make a call and it doesn?t go through, and the user tries again and the call is completed the second time, it is not counted as an ineffective attempt as long as the whole process takes less than 30 seconds. Ineffective attempts also can be expressed in DPM(ineffective attempts) or DPM(IA). So five in 10,000 ineffective attempts could be stated as 500 DPM(IA).

50-ms recovery and VoIP

The practical outage thresholds for calls dropped and ineffective attempts indicate that there is no physical requirement for a 50-ms switchover. In fact, failures in the PSTN have to be recovered in less than 50 ms because of the nature of the time division multiplexing (TDM) signaling architecture, which is based on bit robbing. If failures are not recovered within that time, signaling errors occur in the T1 circuits and erroneous signals could propagate throughout the network, possibly overloading the processors on the Class 5 switches (because the processors directly interpret these bits on the TDM network).

IP equipment and networks are designed differently and do not have the same signaling-protocol dependencies. All signaling in IP is message-based. While 50 ms failover is possible for VoIP, it raises the cost of the end-to-end solution without any clear benefit (especially because there is no physical requirement or signaling protocol constraint). It is important, however, to recover a failure within the user-perceived calls-dropped threshold, or within three seconds.

Calculating availability in complex systems

Availability: Availability is defined as the probability that a customer will find the network in an operating condition. It is commonly expressed as the relationship between mean time between failure (MTBF) and mean time to repair (MTTR), or MTBF/(MTBF+MTTR).

Such a definition for availability is good for a simple system comprising one device. However, in a network that consists of numerous trunks and routers, most failures are partial failures.

As a result of a partial failure, some customers will not receive service, while others have uninterrupted service. Also, even within a router or a switch, only one line card may go down, and users connected to other line cards may not see any disruption in service. Thus, availability is defined with respect to a customer of the network. To compute availability, it is only necessary to consider the components along the path needed to provide service to a single customer. This is then considered to be the average experience for all users.

Also consider redundancy. For example, certain components such as line cards may be configured in terms of 1:N active standby or 1:N load sharing.

To correctly calculate the availability of a single part such as a line card or route processor, it is necessary to take into account several factors such as:

Switchover time: The amount of time it takes to switch over from the active
component to the standby component.
Active coverage factor: The probability that a failure is successfully detected and switched over.
Standby coverage factor: The probability that the standby device is in working
condition and can successfully take over.

A Markov State definition can be used for each component, such as a route processor or line card, within a router or a switch. Figure 3 illustrates 1:N redundancy and a combined effective coverage factor C. Similar state diagrams can model load-sharing redundancy.

For a given type of redundant part, the combined or series-equivalent part availability, series-equivalent part MTBF, and series-equivalent part MTTR are calculated as follows:

Based on the preceding equations, once availability of each and every component along the path of a voice call is calculated, a product of these individual availability numbers is used to generate the availability of the overall system or network.

Service Availability

It is important to note that this provides only the availability and downtime of the system or network and does not provide any insight into whether the service (in this case, voice) is available or not. For that information, two other service-availability metrics?calls dropped and ineffective attempts?are used.

Calls dropped: Based on the definition of dropped calls, this metric is a function of the MTBF. The calls-dropped contribution by each component (line cards, route processor, chassis, power supplies, etc.) along the path of a single user needs to be calculated. For each component, calculate the DPM(CD) as follows:

For each failure, the switchover time (in case of a redundant part) or the repair time (for a nonredundant part) is greater than the calls-dropped threshold of three seconds. In the preceding formula, the failure rate is the inverse of the series-equivalent MTBF.

Ineffective attempts: A similar process helps calculate the ineffective-attempts contribution per component along the path of a single user:

Again, for each failure the switchover time or repair time is greater than the ineffective-attempts threshold of 30 seconds.

Planned downtime

DPM(CD) and DPM(IA) are defined for a certain call rate. In the previous section where unplanned DPM were calculated, the calculation assumed a mean call rate over a 24-hour period.

However, when a company upgrades equipment or performs any kind of planned maintenance, there also is a certain amount of downtime and associated loss of service in terms of calls dropped and ineffective attempts. To calculate the effect of scheduled outages on service availability, it is possible to use a method similar to the preceding one with the exception that, instead of a summation across numerous random failures, this method considers only calls dropped or ineffective attempts caused by the outage time during the upgrade (for example, twice a year).

Scheduled maintenance usually is completed late at night to minimize the effects of downtime on service availability. Typically the incoming call rate at that time is significantly lower than the mean call rate. This means it is important to factor down the DPM(CD) and DPM(IA) by the ratio of the late-night call rate to the mean call rate, to arrive at terms that are comparable and additive. Figure 4 assumes a 3-a.m. scheduled maintenance.

Conclusion

There is much more to availability than the percentage of uptime on a platform or device. Cable operators must consider end-to-end network availability and, more importantly, service availability.

Considering the findings in the PacketCable report, it is clear that:

The PSTN does not offer 99.999-percent availability end to end. Although certain components in the PSTN network may offer five-nines availability (as is also the case with IP equipment), the end-to-end network meets greater than 99.94 percent.
All devices in the network do not have to offer five-nines availability; rather the end-to-end network should be greater than 99.94 percent.
All failures do not have to be recovered in less than 50 ms. Failures should be recovered within the dropped-calls and ineffective-attempts thresholds and the end-to-end network should cause no more than one in 8,000 dropped calls and no more than five in 10,000 ineffective attempts. The industry-accepted practical thresholds for dropped calls is three seconds and that of ineffective attempts is 30 seconds. Ì

Part 2 of this article, which will be published in a future issue of Communications Technology, outlines a methodology to estimate the network and service availability of an end-to-end cable IP network and gives an example of a network that meets the specifications set forth by PacketCable.

Navin Thadani is the manager for cable industry development at Cisco Systems. He may be reached at .

The author would like to thank John Chapman, Jim Forster, Madhav Marathe, Henry Zhu and Jim Huang from Cisco Systems for their contribution to this article.

Did this article help you?
Send comments to .

Bottom Line

When considering availability in a VoIP network, be sure to reevaluate some common misconceptions and consider the following:

The PSTN does not offer 99.999-percent availability end to end. Although certain components in the PSTN network may offer five-nines availability, the end-to-end network meets greater than 99.94 percent.
All devices in the network do not have to offer five-nines availability; rather the end-to-end network should be greater than 99.94 percent.
All failures do not have to be recovered in less than 50 ms. Failures should be recovered within the dropped-calls and ineffective-attempts thresholds and the end-to-end network should cause no more than one in 8,000 dropped calls and no more than five in 10,000 ineffective attempts. The industry-accepted practical threshold for dropped calls is three seconds and that of ineffective attempts is 30 seconds.

Figure 1: End User Service Availability

Figure 2: PacketCable Availability Model

Figure 3: Markov State Diagram for 1:N Redundancy

Figure 4: Calculating Impact of Scheduled Maintenance

Back to November 2003 Issue