ExperTeach Networking Logo

Site Reliability Engineering & Observability

Design and Operation of Resilient Systems

ExperTeach Networking Logo

Increasingly complex IT system architectures require a deep understanding of how services can be operated reliably, problems detected early and failures prevented. In the fast-moving, agile world, however, it is almost impossible to operate error-free systems.

The approach is therefore to make systems fault-tolerant - resilient - instead. Site reliability engineering is the discipline that puts basic resilience concepts into practice by using appropriate tools.

In addition to the design of the system itself, an efficient monitoring solution plays at least as important a role.

Course Contents

  • Microservices and Containerization
  • Resilience
  • Chaos Engineering
  • Monitoring, Alerting and Logging
  • Litmus
  • Prometheus
  • Victoria
  • Kubernetes
  • Docker
  • Netflix as a Reference
  • Postmortem
  • Incident

The detailed digital documentation package, consisting of an e-book and PDF, is included in the price of the course.

Premium Course Documents

In addition to the digital documentation package, the exclusive Premium Print Package is also available to you.

  • High-quality color prints of the ExperTeach documentation
  • Exclusive folder in an elegant design
  • Document pouch in backpack shape
  • Elegant LAMY ballpoint pen
  • Practical notepad
Premium Print
The Premium Print Package can be added during the ordering process for € 175,- plus VAT (only for classroom participation).
Request in-house training now

Target Group

This course is aimed at people who are involved in the development and operation of agile solutions or who are planning to convert their method to an agile concept.

Knowledge Prerequisites

Basic programming and command line skills are desirable.

Course Objective

In this course you will gain extensive know-how on how to design fault-tolerant systems and how to detect and analyze faults in these systems. This know-how is deepened in a laboratory using common tools.

Complementary and Continuative Courses

For details on the IaC tools presented, please visit our courses on Ansible and Terraform:

Ansible – Automation of Applications and Infrastructure 
Ansible Advanced – Orchestration in Detail
Terraform & OpenTofu – Automated Provisioning of Infrastructure

If you would like to learn more about containerized applications, the Docker Fundamentals course is a good place to start.

You can get a comprehensive overview of orchestration and automated management of containerized applications in the Kubernetes and Kubernetes Advanced courses.

The basics of automation from source code to deployment—known as CI/CD—are taught in our GitLab Advanced course.

You can gain a deeper insight into message queue systems in our course on Kafka and RabbitMQ.

For a comprehensive overview of logging and monitoring, we recommend the courses Elasticsearch – Overview and Use, Elastic Stack – Implementation and Operation, and Modern Monitoring Solutions.

Classroom training

Do you prefer the classic training method? A course in one of our Training Centers, with a competent trainer and the direct exchange between all course participants? Then you should book one of our classroom training dates!

Online training

You wish to attend a course in online mode? We offer you online course dates for this course topic. To attend these seminars, you need to have a PC with Internet access (minimum data rate 1Mbps), a headset when working via VoIP and optionally a camera. For further information and technical recommendations, please refer to.

Tailor-made courses

You need a special course for your team? In addition to our standard offer, we will also support you in creating your customized courses, which precisely meet your individual demands. We will be glad to consult you and create an individual offer for you.
Request in-house training now
PDF SymbolYou can find the complete description of this course with dates and prices ready for download at as PDF.

Increasingly complex IT system architectures require a deep understanding of how services can be operated reliably, problems detected early and failures prevented. In the fast-moving, agile world, however, it is almost impossible to operate error-free systems.

The approach is therefore to make systems fault-tolerant - resilient - instead. Site reliability engineering is the discipline that puts basic resilience concepts into practice by using appropriate tools.

In addition to the design of the system itself, an efficient monitoring solution plays at least as important a role.

Course Contents

  • Microservices and Containerization
  • Resilience
  • Chaos Engineering
  • Monitoring, Alerting and Logging
  • Litmus
  • Prometheus
  • Victoria
  • Kubernetes
  • Docker
  • Netflix as a Reference
  • Postmortem
  • Incident

The detailed digital documentation package, consisting of an e-book and PDF, is included in the price of the course.

Premium Course Documents

In addition to the digital documentation package, the exclusive Premium Print Package is also available to you.

  • High-quality color prints of the ExperTeach documentation
  • Exclusive folder in an elegant design
  • Document pouch in backpack shape
  • Elegant LAMY ballpoint pen
  • Practical notepad
Premium Print
The Premium Print Package can be added during the ordering process for € 175,- plus VAT (only for classroom participation).
Request in-house training now

Target Group

This course is aimed at people who are involved in the development and operation of agile solutions or who are planning to convert their method to an agile concept.

Knowledge Prerequisites

Basic programming and command line skills are desirable.

Course Objective

In this course you will gain extensive know-how on how to design fault-tolerant systems and how to detect and analyze faults in these systems. This know-how is deepened in a laboratory using common tools.

Complementary and Continuative Courses

For details on the IaC tools presented, please visit our courses on Ansible and Terraform:

Ansible – Automation of Applications and Infrastructure 
Ansible Advanced – Orchestration in Detail
Terraform & OpenTofu – Automated Provisioning of Infrastructure

If you would like to learn more about containerized applications, the Docker Fundamentals course is a good place to start.

You can get a comprehensive overview of orchestration and automated management of containerized applications in the Kubernetes and Kubernetes Advanced courses.

The basics of automation from source code to deployment—known as CI/CD—are taught in our GitLab Advanced course.

You can gain a deeper insight into message queue systems in our course on Kafka and RabbitMQ.

For a comprehensive overview of logging and monitoring, we recommend the courses Elasticsearch – Overview and Use, Elastic Stack – Implementation and Operation, and Modern Monitoring Solutions.

Classroom training

Do you prefer the classic training method? A course in one of our Training Centers, with a competent trainer and the direct exchange between all course participants? Then you should book one of our classroom training dates!

Online training

You wish to attend a course in online mode? We offer you online course dates for this course topic. To attend these seminars, you need to have a PC with Internet access (minimum data rate 1Mbps), a headset when working via VoIP and optionally a camera. For further information and technical recommendations, please refer to.

Tailor-made courses

You need a special course for your team? In addition to our standard offer, we will also support you in creating your customized courses, which precisely meet your individual demands. We will be glad to consult you and create an individual offer for you.
Request in-house training now

PDF SymbolYou can find the complete description of this course with dates and prices ready for download at as PDF.