How to deploy a highly available PostgreSQL cluster environment in Kubernetes

This article mainly describes how to use Stolon in the Kubernetes environment to deploy highly available PostgreSQL. This paper begins with the structure of Stolon, which is described by the principle of shallow and deep, from the beginning of the installation to the final failover test. PostpreSQL is later deployed to provide a solution.

Creating a highly available PostgreSQL cluster environment is always a tricky thing. In the cloud environment is very difficult to deploy. I have found at least three projects that can provide highly available PostgreSQL solutions in Kubernetes.


Patroni is a template that uses Python to provide you with a customizable, highly available solution for maximum usability, and its configuration information is stored in ZooKeeper, etcd or Consul. If DBAs, DevOps engineers, or SRE are looking for a quick deployment of highly available PostgreSQL solutions in the data center, or for other purposes, I hope Patroni can help them.


The Crunchy container suite provides a Docker container that can quickly deploy PostgreSQL and also provides management and monitoring tools. And supports a variety of styles to deploy PostgreSQL clusters.


Stolon is a cloud native PostgreSQL high availability management tool. It is cloud native because it can provide high availability (Kubernetes integration) for PostgreSQL inside the container, and also supports other kinds of infrastructure (such as: cloud IaaS, old-style infrastructure, etc.)

Nice chart plus some users on 1 2 persuade me to try the crunchy container. But after a while, I changed my mind.

I do not want to say that some of the shortcomings of his design or what other bad. But it gives me the feeling that it is like I am installing the PostgreSQL manually in the container, and there is no cloud feeling.

So I tried a bit stolon. After another install and uninstall, I ran its statefulset example and created it with helm chart .

If you want to know more about stolon you can refer to the author's introduction .

Below I will show the installation process and demonstrate the cluster environment failover. We assume that the installation is helm chart.

Stolon Architecture Chart

Excerpt from Stolon's presentation .
Stolon is made up of three parts:

  • Keeper: he is responsible for managing the instance of PostgreSQL to converge to the clusterview provided by sentinel (s).
  • Sentinel: it is responsible for discovering and monitoring keeper, and calculating the ideal clusterview.
  • Proxy: Client access point. It forces the connection to the right PostgreSQL master and forces the connection to be connected to a non-elected master.

Stolon uses etcd or Consul as the primary cluster state store.


  $ Git clone 
$ Cd stolon-chart
$ Helm install ./stolon

You can also install direct from my repository

Helm repo add lwolf-charts
Helm install lwolf-charts / stolon

The installation process will do the following:

First, use statefulset to create 3 etcd nodes. Stolon-proxy and stolon-sentinel will also be deployed. Singe time job pauses the installation of the cluster until the etcd node state becomes availabe.

Chart also creates two services:

  • Stolon-proxy – The service comes from the official example. He always points to the current master that is written.
  • Stolon-keeper – Stolon itself does not provide load balancing for any read operation. But Kubernetes's service can do this. So for the user, stolon-keeper read operation is in the pod level to achieve load balancing.

When all the component states become RUNNING, we can try to connect them.

We can use NodePort this simple connection to deploy the service. With two terminals were to connect master service and slave service. In the post process, we assume that the stolon-proxy service (RW) has exposed the 30543 port, and the stolon-keeper service (RO) has exposed the 30544 port.

Connect master and build test table

  Psql --host <IP> --port 30543 postgres -U stolon -W 
Postgres = # create table test (id int primary key not null,
Value text not null);
Postgres = # insert into test values ​​(1, 'value1');
Postgres = # select * from test;
Id | value
1 | value1
(1 row)

Connect the slave and check the data. You can write some information to confirm that the request has been handled by the slave.

  Psql --host <IP> --port 30544 postgres -U stolon -W 
Postgres = # select * from test;
Id | value
1 | value1
(1 row)

After passing the test, let's try the failover feature.

Test failover

This case is an example of statefullset in the official code base. Simply put, is to simulate the master hang up, we first delete the master statefulset and delete the master of the pod.

  Kubectl delete statefulset stolon-keeper --cascade = false 
Kubectl delete pod stolon-keeper-0

Then, in the sentinel log we can see the new master was elected.

  No keeper info available db = cb96f42d keeper = keeper0 
No keeper info available db = cb96f42d keeper = keeper0
Master db is failed db = cb96f42d keeper = keeper0
Trying to find a standby to replace failed master
Electing db as the new master db = 087ce88a keeper = keeper1

Now, in the two terminals just now, if we repeat the last command, we can see the following output.

  Postgres = # select * from test; 
Server closed the connection unexpectedly
This probably means the server terminated abnormally
Before or while processing the request
The connection to the server was lost. Attempting reset:
Postgres = # select * from test;
Id | value
1 | value1
(1 row)

Kubernetes' service removes the unusable pod and forwards the request to the available pod. So the new read connection is routed to the healthy pod.

Finally, we need to re-create statefulset. The easiest way to do this is to update the helm chart that was deployed.

  Helm ls 
Factual-crocodile 1 Sat Feb 18 15:42:50 2017 DEPLOYED stolon-0.1.0 default
Helm upgrade factual-crocodile

2. Use chaoskube to simulate random pod hangs

Another good way to test cluster resilience is to use chaoskube . Chaoskube is a small service program, it can periodically in the cluster in the random kill off some of the pod. It can also be deployed with helm charts.

  Helm install --set labels = "release = factualcrocodile, 
Component! = Factual-crocodine-etcd "--set
Interval = 5m stable / chaoskube

This command will run chaoskube, which will delete a pod every 5 minutes. It will select the label in the release=factual-crocodile pod, but will ignore the etcd pod.

After doing a few hours of testing, my cluster environment is still consistent and working very well.

in conclusion

I still run stolon on my development server. So far I am still satisfied. He really wanted a local environment. Have good flexibility and automation of failover capabilities.

If you are interested in it – can view my official repository or my chart .

Source: How to deploy HA PostgreSQL cluster on Kubernetes (Translated by: Wang Xiaoxuan)

    Heads up! This alert needs your attention, but it's not super important.