5G networks are becoming a reality across the globe, giving rise to new business models and innovative applications across several industries. This next generation of cellular networks is not simply an upgrade from previous ones but rather a structural evolution of contemporary telco networks, underpinned by emerging technologies.
One of these fundamental technologies is Network Functions Virtualization (NFV) which helps CSPs reduce the total cost of ownership and increase their service scalability and agility. These benefits are achieved by allowing functions from the 5G mobile core and radio to run as lightweight, cloud-native software on standardized commodity hardware.
Despite its advantages, NFV is confronted with a challenge that can easily break its promise for greater flexibility and cost efficiency. Making optimal use of compute resources, in line with the runtime demand of network functions, is a very hard task which becomes harder as mobile networks expand from the core to the edge. Evidently, efficient resource management is a key prerequisite for NFV to deliver on its promises.
One of the key industry priorities, since the early days of NFV was, and still remains, to guarantee carrier grade performance for certain Virtual Network Functions (VNFs). In many cases this leads to hardware overprovisioning. For example, a common practice of CSPs is to deploy performance critical VNFs in isolation, reserving upfront a large portion of server resources for their exclusive use. In this way, the VNFs do have sufficient resources even for demand peaks, and are protected from contention with other workloads that could affect their Service Level Objectives (SLOs). The drawback, however, is that a significant amount of server resources remains unutilized, resulting thus in considerable OpEx and CapEx attributed to the NFV infrastructure.
To make things even more complicated, edge servers are expected to consolidate an unusually disparate mixture of mobile core functions, RAN components, over-the-top services and new types of applications requiring proximity to end users. This increase in software diversity and density will introduce uncertainty on how workloads on a single server will interfere with each other which, in turn, gives rise to the challenge of optimally distributing shared compute resources to meet SLOs with minimal resource footprint.
In this respect, CSPs can no longer depend on conventional approaches to manage resources efficiently, in real time and at scale. Techniques from rule based automation to decision making based on analytics and historical data, would be inadequate as they would fail to capture the highly dynamic nature of 5G deployments. Preferably, an ideal solution must be able:
Such qualities suggest not simply an intelligent resource management solution but rather one that is able to autonomously learn and adapt to its environment.
Such an autonomous resource management solution demands advanced AI-powered cognitive agents, able to learn by themselves to make optimal resource decisions in a live telco network while starting from zero prior knowledge. Learned knowledge can potentially be then disseminated among agents across edge locations, to accelerate optimal decision making all over the entire network.
The only technology that, currently, meets the ambitious objectives of such requirements is Reinforcement Learning (RL). In RL, an agent gets trained in a way that rewards desired behaviors and/or punishes undesired ones. RL has two distinct phases: Training (or exploration) phase, and Inference (or exploitation) phase. During training the agent takes random actions in its environment, exploring which will lead to bigger reward. During inference the agent has already learned what the best actions in each different state are and leverages this knowledge to maximize the reward gathered. Recent advances in Deep Learning have led to a new approach, namely Deep RL, which leverages the power of Deep Neural Networks to tackle much more complex environments than "pure" RL method.
In the NFV context, RL can be used to determine the ideal distribution of shared resources among VNFs colocated on a server, so that:
Adopting RL in NFV signifies a promising step forward over other approaches for the following reasons:
These factors combined, lead to an explosion of the decision space to extreme numbers of state action pairs. On top of that, decisions must be taken in real time. These two imperatives make Deep RL the only realistic solution.
Despite its great potential, RL comes with certain challenges in production environments:
RL must observe and act randomly upon a certain state many times in order to determine which action is the optimum for each individual state. As an indication, full optimization of an application may require as many as 50,000 observations. Trying such a large number of random actions in a live system could lead to serious service degradation and even failure.
In a production server, taking samples of performance metrics at less than a second frequency is usually impractical and of questionable value. On the contrary, it is very common that samples are collected at multisecond granularity (e.g. 10 to 60 secs), which results in an RL agent needing between 6 days to a month (!) to get fully trained.
If the environment, in which the RL agent is trained, changes then there is a strong possibility that the VNFs can no longer meet their SLOs. Such changes may involve: amount or type of hardware resources, application incoming load, or performance constraints set by the user. Because of these changes, the agent has to be retrained, which is a potentially very difficult task in the context of a large data center.
To overcome these challenges, Intracom Telecom created Self Learning Resource Allocation (SLRA), an advanced Deep RL system capable of optimizing any application.
SLRA introduces three key innovations:
SLRA is a new feature of the NFV-RI™ solution and aspires to become the state-of-the-art paradigm for autonomous resource decisions in modern NFV deployments.
The potential of Deep RL has been revealed through a series of use cases and workloads that are representative in modern telco networks. In those, SLRA managed to minimize the use of resources, under varying load conditions, without incurring any SLO violations.
Specifically:
These results indicate that Deep RL has great potential in dynamically optimizing various types of applications and can be generalizable across resources of different types and semantics, rendering it a more effective alternative over existing approaches.
At a larger scale, Intracom Telecom evaluates SLRA in a PoC project of the ETSI Experiential Networked Intelligence Industry Specification Group. In the PoC, SLRA is challenged to fine tune resources for multiple colocated VNFs implementing differentiated 5G slices, in order to meet latency SLOs fully automatically and always in line with incoming traffic.
To find out more on the NFV Resource Intelligence (NFV-RI™) solution and the SLRA application, please visit www.intracom-telecom.com/nfvri.