Skip to the content.

Chaos Testing Guide

Table of Contents

Introduction

There are a couple of false assumptions that users might have when operating and running their applications in distributed systems:

The network is reliable. There is zero latency. Bandwidth is infinite. The network is secure. Topology never changes. The network is homogeneous. Consistent resource usage with no spikes. All shared resources are available from all places.

Various assumptions led to a number of outages in production environments in the past. The services suffered from poor performance or were inaccessible to the customers, leading to missing Service Level Agreement uptime promises, revenue loss, and a degradation in the perceived reliability of said services.

How can we best avoid this from happening? This is where Chaos testing can add value.

Test Strategies and Methodology

Failures in production are costly. To help mitigate risk to service health, consider the following strategies and approaches to service testing:

Best Practices

Now that we understand the test methodology, let us take a look at the best practices for an OpenShift cluster. On that platform there are user applications and cluster workloads that need to be designed for stability and to provide the best user experience possible:

Tooling

Now that we looked at the best practices, In this section, we will go through how Kraken - a chaos testing framework can help test the resilience of OpenShift and make sure the applications and services are following the best practices.

Workflow

Let us start by understanding the workflow of kraken: the user will start by running kraken by pointing to a specific OpenShift cluster using kubeconfig to be able to talk to the platform on top of which the OpenShift cluster is hosted. This can be done by either the oc/kubectl API or the cloud API. Based on the configuration of kraken, it will inject specific chaos scenarios as shown below, talk to Cerberus to get the go/no-go signal representing the overall health of the cluster ( optional - can be turned off ), scrapes metrics from in-cluster prometheus given a metrics profile with the promql queries and stores them long term in Elasticsearch configured ( optional - can be turned off ), evaluates the promql expressions specified in the alerts profile ( optional - can be turned off ) and aggregated everything to set the pass/fail i.e. exits 0 or 1. More about the metrics collection, cerberus and metrics evaluation can be found in the next section.

Kraken workflow

Cluster recovery checks, metrics evaluation and pass/fail criteria

Scenarios

Let us take a look at how to run the chaos scenarios on your OpenShift clusters using Kraken-hub - a lightweight wrapper around Kraken to ease the runs by providing the ability to run them by just running container images using podman with parameters set as environment variables. This eliminates the need to carry around and edit configuration files and makes it easy for any CI framework integration. Here are the scenarios supported:

Test Environment Recommendations - how and where to run chaos tests

Let us take a look at few recommendations on how and where to run the chaos tests:

Chaos testing in Practice within the OpenShift Organization

Within the OpenShift organization we use kraken to perform chaos testing throughout a release before the code is available to customers.

1. We execute kraken during our regression test suite.

    i. We cover each of the chaos scenarios across different clouds.

        a. Our testing is predominantly done on AWS, Azure and GCP.

2. We run the chaos scenarios during a long running reliability test.

    i. During this test we perform different types of tasks by different users on the cluster.

    ii. We have added the execution of kraken to perform at certain times throughout the long running test and monitor the health of the cluster.

    iii. This test can be seen here: https://github.com/openshift/svt/tree/master/reliability-v2

3. We are starting to add in test cases that perform chaos testing during an upgrade (not many iterations of this have been completed).

Using kraken as part of a tekton pipeline

You can find on artifacthub.io the kraken-scenario tekton-task which can be used to start a kraken chaos scenarios as part of a chaos pipeline.

To use this task, you must have :

You can create theses resources using the following sequence :

oc project default
oc adm policy add-scc-to-user privileged -z pipeline
oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/common.yaml

Then you must change content of kraken-aws-creds secret, kraken-kubeconfig and kraken-config-example configMap to reflect your cluster configuration. Refer to the kraken configuration and configuration examples for details on how to configure theses resources.

Start as a single taskrun

oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/taskrun.yaml

Start as a pipelinerun

oc apply -f https://github.com/startxfr/tekton-catalog/raw/stable/task/kraken-scenario/0.1/samples/pipelinerun.yaml