Terravision zetel

DEFINITION OF HА
High Availability in the mobile network is similar to other systems, standing for the overall service availability time considering both planned and unplanned disruption. As an example, Three-Nines suggests 99.9% availability, which means approximately 8 hours downtime per year, and Five-Nines means 5.5 minutes.
From a certain point on, the cost goes exponentially up for implementing higher levels of availability. For example, an improvement from Four-Nines to Five-Nines can incur 5 to 10 times more investment or beyond.
For service presented to the end user, both infrastructure and application availability need to be considered, and sometimes these definitions of such are not constant. (See Appendix)
It's only when ALL elements in the broader defined service are inherently highly available, or designed to achieve high availability through extra modules and methods, would the overall service be highly available. It's crucial to understand this context.
It's also important to understand that the HA target can also be different from industry to industry, even if all underlying computing and networking infrastructure are the same. (See Appendix)
Redundancy
Redundancy should not be confused with High Availability. The two concepts are related, but not the same, nor are they intended to address the same business problem.
Redundancy is one way to improve availability. Under certain circumstances, it may also be regarded as Service Continuity, with either or reduced service capacity and capability. HA also refers to inherent robust design, fault tolerance, self-remediation and ultra fast self-recovery.
Lte & 5G network service components
It's not feasible, nor effective, to enumerate everything from all aspects in this White Paper, so the architecture view is based on Zetel's implementation, but also included with the maximum possible extension of potential other options.
Architecture layers and building blocks - from the HA View
Any LTE network implementation may deploy some or all of these НА blocks in order to achieve the desired SLO based on business requirement, budget and other considerations.
.jpg)
S1-Flex and Geo-Redundancy
S1-Flex enables a multiplex between the eNodeBs and Core Network (MME and SGW) to avoid single point of failure as well as allow load sharing.
MME and SGW (Geo-Redundancy) Pool maintains not only service continuity, but also the more challenging session continuity, e.g. for loss of S1 or S11 links over a complete site failure.
Session continuity is made possible by mirroring user data (contexts) between MMEs in a Pool during normal operation with stateful replication (including VoLTE calls and SIP sessions).
Without session continuity, another potential service disruptive factor is that after a rare event but all UEs could attempt to re-attach to the network simultaneously through the same entry, creating an attach signaling storm that could cause significant service degradation.
Non-disruptive session migration also saves maintenance effort, virtually eliminating planned downtime.
Zetel MME HA Agent and S6a-Flex
These are Zetel features:
The MME HA Agent addresses the situation when S1-Flex isn't available, e.g. not fully supported by eNodeBs, etc. It acts as a Floating Agent and Load Balancer in front of the MME pool, and manages SCTP connections to operative MMEs only, much like S1-Flex.
It does not provide Session Continuity however.
The S6a-Flex feature is similar to S1-Flex as well. It maintains multiple MME to multiple HSS connections in order to increase the total availability.
Layer-2/3 Networking HA
Provided networking switches and routers, using common technologies like multi-pathing, link redundancy, VLAN trunk, floating IP, DNS, etc.
Host Clustering
Provided by ESXi Clustering with Linux VMs, or Linux Clustering, etc. using common techniques like storage mirroring or replication, floating IP, heart-beating, voting, I/O fencing, etc.
MySQL Geo-Replicated Clusters
Provided by MySQL Cluster and Geo Replications.
.jpg)
LTE UE Only Hypothetical
It’s only hypothetical to discuss when the individual UE (CPE) has HA requirement on itself, in order to provide guaranteed service to upstream applications.
It’s generally recommended to have multiple such CPEs to provide a connection pool and use Layer-3/4 networking redundancy for the application.
Appendix
To present a service consumed by the end user, both infrastructure and application availability need to be considered, and sometimes the definitions of such are not constant.
For example in the LTE network, EPC - the Core Network - is perceived as infrastructure to mobile applications, but within EPC, the MME, HSS, and SGW modules are all applications, as to the underlying network infrastructure, computing unit and operating systems (proprietary or non-proprietary). There are similar cases in RAN - the Radio Access Network - within the implementation of eNodeB and Radio Backhaul.
Another example is NMS/EMS and BOSS, which form a part of the mobile network infrastructure as to end user and applications, and some do impact the overall service availability, but to the RAN and Core Network itself, they’re all applications.
The HA design needs to consider ALL tiers of application and infrastructure in order to make the whole service presented as a robust and resilient network infrastructure.
Implication of Availability - Business Dependent
It’s important to understand that the HA target can also be different from industry to industry, even if all underlying computing and networking infrastructure are the same.
They may vary considering the application requirement, regulatory requirement and industry differences.
For example, in a typical mine production scenario, Data Integrity is above Service Continuity, while in Telco’s case, it’s usually the opposite. This will impact how the overall HA is designed, despite the underlying technologies may be the same, e.g. an Oracle database, a Linux Operating System, or a Layer-3 cluster.
PCRF
Large carriers require standalone PCRF modules in order to process complex and dynamic PCRF rules. All subscriber transactions are dependent on PCRF, and hence it requires extra HA facility, and also another local redundancy back to PGW.
On the other hand, in the compact case, we build the PCRF embedded within PGW, so it’s inherently available, as also it does not require the complexity provided by standalone PRCFs.