Observability - Keeping Business-Critical Applications Healthy

Monitoring,
Logging & Tracing

Your applications generate a vast amount of metrics and analyses. Managing this data deluge is crucial to ensuring the health and performance of your business-critical systems. At ConSol, we emphasize observability. We take a comprehensive approach to monitoring your IT applications, shielding you from unexpected outages that can cost you time, money, and, in the worst case, damage your reputation. Through Open Source Observability, we assist you with the design, tool selection, implementation, and integration of tracing, log aggregation, error management, and other observability practices.

All applications and the underlying infrastructure produce metrics, logs, and where useful traces as well. These are gathered and prepared by proven open-source tools like Prometheus (metrics), Loki (logs) or Jaeger (traces). Subsequently, these data are centrally visualized in Grafana dashboards. At this point, the user gets an overview of the applications and infrastructure components he is sharing. For a long-term storage of data, additional data bases like InfluxDB can be employed.

Observability – Chasing „Mister X“

Observability is basically composed of three components: monitoring, logging, and tracing. The monitoring provides details, when a defined service level or quality criterion has fallen short of. For this, the application developers define appropriate metrics which again are being provided directly from the application. In the logs, we find the error reports of each individual software component. They point out the place in the various services where the error occurs. Tracing allows us to identify the path a call has taken in between services bevor resulting in a problem. By means of correlation IDs, we are able to observe all this information together in a central dashboard. This way we keep the overview even in complex applications and quickly track the source of error.

More than
200 customers
trust ConSol
for their
IT & Software

Find out more

Observability Tools

For observability applications we favor mostly open-source solutions. Compared with commercial solutions, there is no disadvantage at all. For many years now, we use open-source solutions with our clients as well as in our own productive employment. They also offer a truly remarkable range of functions.

Prometheus is the de facto standard for cloud-native monitoring and alerting. It offers a simple configuration for where and how metrics can be collected. Most applications support the export of metrics to Prometheus. And there is also great support for exporting metrics to Prometheus for self-written applications in all common programming languages.

Loki allows for a simple importing and indexing of logs. Its configuration is derived from Prometheus and aims at quickly finding logs for certain criteria. Therefore, only a very small index can be written. By severe parallelizing the analyses, enquiries can be quickly executed even with large amounts of data.

Grafana is used to visualize metrics. It offers a very good integration of Prometheus, Loki and Jaeger and allows for metrics as well as traces to be displayed in the graphs. It is also possible to jump directly to individual traces and for certain metrics to display the logs to these metrics. Besides a great choice of predefined dashboards with various metrics, the user can also create dashboards himself.

Jaeger supports the OpenTracing standard, thus making it possible to easily integrate applications with Jaeger. For self created applications there is, quite like with Prometheus, a broad support of programming languages and frameworks. Other advantages of Jaeger besides its widespread use include its simple installation and scaling even with larger amounts of data.

ElasticSearch (and its open-source fork, OpenSearch) is utilized in a multitude of our log aggregation and tracing systems. Its diverse capabilities in terms of operation, data flow control, high availability, and data security make it the right solution in many environments. We provide support for system planning, setup, configuration, and integration.

The VictoriaMetrics framework is an efficient alternative to Prometheus for more complex infrastructures. It is available both as a free solution and as an enterprise version with advanced features, such as anomaly detection powered by AI technology.

Observability: Important Terms & Notes

Logging is used to record special events or problematic and faulty situations in order to be able to understand an error situation in the event of difficulties. How informative those recordings are, is up to the developers. For most programming languages there are logging frameworks providing standardized log formats. This becomes important, when logs are supposed to be collected centrally or to be found again in accordance with certain criteria. Since local log files are lost when restarting the container, a centralized log collection is mandatory, especially in volatile containers.

In todays distributed systems and especially in microservice architectures, simple logging is not sufficient anymore. The process has to be traceable through various services or methods, since it is often the interaction between microservices that results in problems or performance bottlenecks. This requires that, aside from the end-user calls, additional information on service calls is provided and stored in special tracing log events as well. Moreover, these tracing logs have also to be stored centrally for all services involved in order to be able to display the call hierarchy. When using external libraries or services, they too will have to meet those additional requirements.

Via OpenTracing there are frameworks for many programming languages available, making end-user calls easily assignable through various services via so-called spans or correlation IDs. This standard is already being supported by many open-source libraries.

Metrics are numerical representations of statuses (e. g. number of open connections) or throughputs (e. g. writing volume on a hard drive since a specific point in time, calling of a specific functionality). This way they differ from logs and tracing data relating to individual events.

Metrics can be queried of standard applications (e. g. NGINX, DBs or objects in Kubernetes) via so-called exporters or metric endpoints. Customer-specific applications should be instrumented in a way that allows to measure SLAs as well as to obtain further information on the usage (e. g. number and response time of critical calls) for detailed performance observations.

The currently widely used metrics format was introduced by Prometheus and standardized via OpenMetrics. Metric points here are composed as follows:

Metrics name: describes what is represented, e. g. server_open_connection_count
Labels: on the basis of labels, differently measured instances can be distinguished, e. g. instance=127.0.0.1:8080.
Timestamp: at what time was this value valid?
Value: numeric value

This way, the performance and where appropriate also the number of errors or specific statuses can be compactly represented and graphically visualized in Grafana.

Based on these numeric values, rules can be defined that provide a message if the system has exceeded limit values – e. g. if more than 90 % of the available connections were occupied for more than 10 minutes or if during 5 minutes an average of more than 2 % of the queries resulted in errors. A monitoring tool for metrics like Prometheus stores the metrics, checks these conditions, and can subsequently notify the ones in charge.

Monitoring refers to the controlling of application and infrastructure. In case of faulty statuses or performance bottlenecks, the operating teams in charge are being notified – ideally before the application’s users become aware of major problems.

State-of-the-art monitoring systems like Prometheus are metrics-based. In other words, they determine problematic statuses based on the metrics and then trigger alerts. In addition, they also store the metrics over an extended period of time. This way, with visualizing tools like Grafana these metrics can be used later on for an analysis of problematic situations as well.

A relatively new discipline within observability, error management focuses on the structured tracking of error messages occurring in the monitored software. Detected errors, such as those found in logs, are recorded and correlated with similar errors (so that a recurring error appearing 10,000 times generates only a single entry). SREs and software engineers are automatically alerted to new errors, enabling them to investigate and resolve issues more efficiently—far more effective than manually sifting through massive application logs!

Amid the myriad individual values in application metrics, crucial indicators of potential issues can be hidden—but how do you find that needle in the haystack? This is where anomaly detection comes into play, a fascinating use case for Artificial Intelligence (AI). In essence, an anomaly detection system automatically learns what observability data looks like when a system is functioning "normally."

Based on this understanding, the system can then automatically detect deviations from this normal operation (e.g., dramatically increased traffic, filling buffers, etc.) and issue warnings if necessary, even though it was never explicitly programmed to monitor specific metrics. This approach is a highly valuable complement to manually configured alerts, helping to identify "blind spots" in alert configurations and detect unforeseen problems in general.

Complex requirements - ConSol as a strong partner

Questions about Observability for Business-Critical Applications?

Let's talk!

Marc Mühlhoff

# IT Ops

# Observability

# Cloud Services

+49-211-339903-74

Name	Purpose	Lifetime	Type	Provider
CookieConsent	Saves your consent to using cookies.	1 year	HTML	Website
fe_typo_user	Assigns your browser to a session on the server.	session	HTTP	Website
_pk_id	Used to store a few details about the user such as the unique visitor ID.	13 months	HTML	Matomo
_pk_ref	Used to store the attribution information, the referrer initially used to visit the website.	6 months	HTML	Matomo
_pk_ses	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_cvar	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_hsr	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo

Name	Purpose	Lifetime	Type	Provider
_gcl_au	Used by Google AdSense to experiment with advertisement efficiency.	3 months	HTML	Google
AMP_TOKEN	Contains a token that can be used to retrieve a Client ID from AMP Client ID service. Other possible values indicate opt-out, request in progress or an error retrieving a Client ID from AMP Client ID service.	1 year	HTML	Google
_dc_gtm_--property-id--	Used by DoubleClick (Google Tag Manager) to help identify the visitors by either age, gender or interests.	2 years	HTML	Google
_ga	Used to distinguish users.	2 years	HTML	Google
_gat	Used to throttle request rate.	1 day	HTML	Google
_gid	Used to distinguish users.	1 day	HTML	Google
_ga_--container-id--	Persists session state.	2 years	HTML	Google
_gac_--property-id--	Contains campaign related information for the user. If you have linked your Google Analytics and Google Ads accounts, Google Ads website conversion tags will read this cookie unless you opt-out.	3 months	HTML	Google

Innovative Product Solutions – with Open Source

Outstanding Solutions – Thanks to Great Partners

Observability - Keeping Business-Critical Applications Healthy

Monitoring,
Logging & Tracing

Observability – Chasing „Mister X“

Observability Tools

Observability: Important Terms & Notes

Complex requirements - ConSol as a strong partner

Questions about Observability for Business-Critical Applications?

Portfolio

Company

Service

Custom IT-Solutions

IT Consulting & Design

Build & Operate

Innovate & Empower

Our customers

Product Solutions

Innovative Product Solutions – with Open Source

Openshift Consulting

Open Source Monitoring

Integration-Testing

Outstanding Solutions – Thanks to Great Partners

Observability - Keeping Business-Critical Applications Healthy

Monitoring,Logging & Tracing

Observability – Chasing „Mister X“

Observability Tools

Observability: Important Terms & Notes

Complex requirements - ConSol as a strong partner

Site Reliability Engineering – Automation First

Open Source Monitoring

IT Security Consulting and Services

Questions about Observability for Business-Critical Applications?

Portfolio

Company

Service

Monitoring,
Logging & Tracing