Addressing Big Data Monitoring Challenges
March 2022 Konstantinos Anastasakos,
Product Manager, Telco & Enterprise Software


Today one of the key success indicators across all industries is their degree of digital transformation adoption. This digital conversion is originated and empowered by the unprecedented evolution of Data Analytics and Big Data, regarding larger and more complex structures. The soaring demand for faster processing power and larger storage space has led to new and innovative technologies with a variety of sophisticated tools. In practice often it is needed to combine multiple tools to meet hardware and software requirements, but then it becomes increasingly difficult to control and monitor such systems due to their complexity and diversity.

The Challenges

There can be numerous specific challenges when monitoring large data structures, starting from the collection of traces and logfiles, to the analysis phase until finally resolving an incident. Equally hard is to provide observability towards distinguished applications and hardware units, since the volume of potentially displayed information is enormous and the processing time should be kept to a minimum.

At the collection stage, the more diverse the gathered traces, the greater the number of access points for the monitored resources, with the latter scattered across the big data ecosystem. There might exist applications or hardware units completely lacking monitoring capabilities without generating any logfiles, or these could be partially supervised by the overall monitoring system.

At the analysis stage, during incident’s investigation, the troubleshooter is usually confronted with massive volumes of generated logfiles and alerts, which create an irrepressible and chaotic situation. Furthermore, the processed logfiles might be unsynchronized and incompatible between them, since they were generated at different times by various systems or applications. Lineage and historical data retention might not be supported by all systems, which could contribute significantly to the root cause exploration.

A significant manual effort and labor cost accompanies the troubleshooting process, proportional to the number of faults and the complexity of the environment. As a result, the efficiency of monitoring mechanisms regarding large data analytics structures is rather poor, hard to audit and intricate to improve.

The Traditional Approaches are Insufficient

Typical solutions for monitoring Big Data clusters involve a combination of numerous monitoring tools, dedicated to individual resources and targeting specific metrics. Some of the common tools that are integrated within enterprise data environments - such as Cloudera Manager for Cloudera distributions, are limited to basic monitoring functions. Other commercial tools such as Datadog, Splunk and others, may offer high degree of monitoring capabilities and customizations, but their cost escalates proportionally to the volume of the data stored and function served.

Alternative methods, using open-source tools, include Elastic Stack for logfiles and tracing, Nagios for checking the status of servers, hosts and network. Regarding monitoring of hardware structures, racks and power units, these are often monitored via their own specialized software lacking any graphical user interface.

There is a clear need for an innovative monitoring solution, capable to handle metrics from a diverse range of resources and applications, correlate their metrics and visualize the results via a common tool and a user-friendly interface.

“The 4 Golden Signals”

Any proper monitoring system focuses its metric mechanism to “4 Golden Signals” the Latency, the Traffic, the Errors and the Saturation. “Latency” measures the amount of time needed to send a request and receive a response, regardless of being successful or unsuccessful, which is highly impacted by the distributed number of servers within the cluster. “Traffic” measures the total number of requests served by the system. “Errors” is the number of failed requests, while “Saturation” depicts the utilization of the system, emphasizing on the most critical resources.

A Solution Derived from Intraom Telecom's Rich Experience

Intracom Telecom has an extensive experience on installation, configuration and consultation of Big Data ecosystems, for more than a decade across various industries and business interests. Through this knowledge and expertise emerges “BigStreamer™ Monitoring”, a solution capable to collect, analyze and visualize critical information generated by Big Data components, applications, data sources and hardware units.

BigStreamer™ Monitoring is capable of ingesting metrics either directly from data sources, or indirectly by available metrics processed by other monitoring tools (see Figure 1).

Figure1: Correlate metrics from existing monitoring tools and various Big Data sources
Screen of a dashboard
Figure 2: Display critical metrics via customized widgets on central dashboard

Direct data sources include but are not limited to application servers, hardware structures, web servers, database servers, specific applications and others. Indirect data includes ready metrics generated by other commonly used monitoring tools such as Cloudera Manager, Nagios, various hardware supervision tools, Graphite, Elastic Stack.

Collected metrics are stored as timeseries and then further analyzed to create meaningful graphs and visualizations. The access and interaction with the tool is powered by an intuitive, web-based graphical user interface.

BigStreamer™ Monitoring offers a rich and interactive User Experience that supports a great number of visualizations, charts, graphs, while being highly customizable. Standard offered features may be extended through plug-ins to support additional functions if & when required (see Figure 2).

BigStreamer™ Monitoring Architecture

BigStreamer™ Monitoring consists of a main "Manager" module, which is the central starting dashboard and optional autonomous modules serving specialized monitoring functions (see Figure 3), including:

  • Advanced Statistics: A group of detailed performance metrics on flows and storage areas in a Big Data cluster.
  • User Statistics: A unique mechanism that allows the monitoring of volume changes on critical data
  • Alerting: A holistic alerting module with configured rule-based Alerts, Thresholds and KPIs per data entity.
  • Streaming Data: A mechanism to monitor real-time data processing tools, such as Kafka and StreamSets pipelines.
  • Docker & Kubernetes: A mechanism that allows monitoring of containers, Dockers and Kubernetes clusters.
  • API Integrator: A module that provides easy integration and communication of any new application. Northbound APIs allow the communication and transferring of data, and triggering of alerts towards any 3rd party systems. Southbound APIs allow the smooth integration of new instruments, applications or data flows.
Key Benefits and Takeaways
  • The major advantage of BigStreamer™ Monitoring is that it provides a single access point for monitoring multiple applications and services
  • It introduces the "5 second rule", allowing with a glance on the overview dashboard to check on the most critical information of the monitored systems
  • When needed, it allows drilling down into more detailed dashboards and graphs providing enhanced information
  • Metrics can be collected and correlated from a range of different types of resources and existing monitoring frames like Cloudera Manager, Elastic stack, etc., or directly from database servers, applications and hardware units
  • Data metrics can be also collected from multiple networks and Big Data clusters
  • Retrieved data metrics can be further analyzed and cluster's performance can be audited
  • All the metrics can be synchronized via common tools and interfaces, allowing easy and fast access for maintenance, troubleshooting, dimensioning and resource planning purposes.
Diagram of BigStreamer™ Monitoring Modules
Figure 3: BigStreamer™ Monitoring Modules

To find out more on BigStreamer™ Monitoring, please visit: