SRE: Site Reliability Engineering

Software
Engineering
for IT Operations

Site Reliability Engineering (SRE) is a DevOps approach by Google. It regards IT operations as a software task to be solved with software engineering. This results in optimized processes and systems that take the risk of errors into account and know how to handle them. Continuous delivery is key: a regular roll-out of many small releases reduces the risk of every single development step. In addition, SRE-tasks are designed to include time for improvements and the automation of recurring tasks.

Site Reliability Engineering is firmly established in ConSol's business processes: Experienced software engineers work hand in hand with the IT Operations team. At the same time, our cloud and monitoring experts proactively contribute their specialist knowledge. Because the main goal of SRE is also ours: The more business topics we can cover with software tools and automation, the more workload decreases in continuous development and operation.

Site Reliability Engineering in Few Points

What Does a Site Reliability Engineering Team Do?

An SRE team consists of software engineers and takes care of the productive operation of services.

Why Software Engineers Instead of Sysadmins?

In classical operation, the workload increases linearly with the number or size of the services. Especially in modern microservice architectures, this approach is no longer practicable. Site Reliability Engineering therefore solves operational tasks with software, not manually. The more software solutions, the less workload.

How Is an SRE Team Organized?

There are several ways to organize these teams. Google relies on three pillars: The amount of time Site Reliability Engineers spend on manual tasks is limited – giving them the capacity to develop SRE tools. On-call services are professionally organized so that there is sufficient time for a thorough post-mortem analysis in case an error occurs. When it comes to risk assessment for the go-live of new features, error budgets ensure that the service developers and the SRE team pull together.

What Are the Team's Responsibilities?

SRE teams are responsible for service availability, latency, performance, efficiency, deployment, monitoring, emergency response and capacity planning.

More than
200 customers
trust ConSol for their IT & software

Flexibility, automation, and the collaborative efforts of development, operations, cloud, and monitoring experts are crucial in our digital world. This approach enables new developments to reach market maturity quickly and with minimal risk, while reducing the effort required for their ongoing development and operation. That's why we at ConSol embrace Site Reliability Engineering.

Oliver Weise
Head of Platform Engineering

Our SRE Know-How

SRE Practical Tips

Ending points for liveness and readiness probes are often implemented very simply: They respond with 200 OK as soon as the application is started. In a number of projects, we have experienced that this may not be enough. Therefore, we have started to test the accessibility of all adjacent systems and message queues with the support of health checks. This enables us to detect problems during deployment and, ideally, resolve them automatically.

During one project, a problem with EJB Timer Services caused the transaction to roll back after each run. If one of the next runs would prove successful, this process would be unproblematic in itself. To find out if we deal with expected rollbacks or with real bugs, we implemented a metric measuring time passed since the last successful run. This allowed us to distinguish between temporary and permanent errors.

With Java applications, it is worth drawing regularly thread dumps regularly. Thread dumps help with post-mortem analysis and profiling. The development of thread dumps e.g. quickly brings to light when an external system is blocked and new threads with blocked calls are constantly being started. In particular, it is recommended to draw two to three thread dumps in the stop script in order to analyze the status after a restart of the application.

With Mapped Diagnostic Context (MDC), logging frameworks offer the possibility to log information like user names by default, allowing to trace which log lines belong together during log analysis. However, MDC data are not always available if, for example, the user has not yet been determined. Therefore, it is worthwhile to additionally include the thread in the log format. The thread provides a secure and easy way to track which log lines belong to the same request.

During a project in the telecommunications industry, we were facing the challenge of many microservices calling the same endpoint, yet the total number of calls could not exceed a certain threshold value per second. We solved this by using Zookeeper to coordinate the calls. The advantage: We were able to avoid a central coordination system as a single point of failure.

Manual steps during build and deployment are a common source of errors. It's worth to automate everything. Modern CI/CD pipelines do not only reduce the risk of errors, but also relieve the SREs of annoying recurring tasks.

The biggest challenge in load testing is to generate realistic test data. This does not only include the content of the data. In a large migration project, we found out, that the way data are fragmented in a database can have a significant impact on performance. The fact that we already determined this in the load test phase was decisive for the project’s success.

The more measuring points an application has, the better. This does not only help in operation. Load testing, for example, is much more valuable when it shows not only whether a service is adhering to SLOs, but also where potential bottlenecks are.

Software engineers like to use modern design patterns like circuit breakers. To avoid overload, however, you should not lose sight of classic configuration options. Pool sizes in Java application servers, for example, should be designed in a way that, in the event of an unexpected peak load, the stop occurs as soon as possible so that downstream components are not getting overloaded.

Site Reliability Engineers must familiarize themselves with the normal behavior of their services and regularly check their logs. Otherwise, in case of an error, a lot of time is lost in investigating oddities that have nothing to do with the acute fault.

Site Reliability Engineering: Technologies & Competencies

Thruk
OMD
Grafana
Prometheus
coshsh
Jolokia
Mod-Gearman

Any more Questions about SRE for Optimized Processes and Systems?

Let's talk!

Marc Mühlhoff

# IT Ops
# Observability
# Cloud Services
+49-211-339903-74
By submitting the form, you agree to our privacy policy.