September 2019

Fault-Tolerant, Highly-Available Privileged Access Management

Andy Harris

We’re often asked about high availability and fault tolerance for Osirium PAM. This is understandable, since when fully deployed Privileged Access Management is the default route to all administrative interfaces. If it is down – you have a problem.

Here’s a video of Osirium PAM (previously known as "PxM Platform") running on a VMware cluster built on modest systems:

Osirium PAM Options

Mesh

One option with PAM, is “Mesh,” where multiple PAM installations send their configurations to each other and one instance can take over the configuration of another if it becomes unavailable. While this works for multiple sites, many customers have just one data centre and would like a high-availability solution.

VMware

Our build systems deliver Osirum PAM in many formats: Azure, AWS, HyperV, VMware and even a kit form. The VMware version complies with all the pre-requisites for VMotion and VMware clustering. Every year the VMware offerings improve and the VMotion and clustering have become cheaper.

Recently we built a test VMware cluster to see the changes for ourselves. In particular, we wanted to see how fail-over and fail-back worked. We found that VMware 6.7 is pretty much seamless. In our test cluster, we could pull the power cable on either system and PAM would continue to run, and all sessions would be continue to run after a very short delay.

To stress the cluster we ran active streaming SSH sessions through PAM whilst pulling a power cable on the active ESXi in the cluster. The result was that no session was lost, no data was lost and there was a barely perceptible 0.3 second break in data transmission.

To recover the cluster, we just returned the power cable and allowed the ESXi system to boot and rejoin the cluster – this was all that was needed. We could see on the vCenter display when the ESXi rejoined and the change of Fault Tolerance status.

In the previous versions of VMware it was necessary to ‘fail-back’ the virtual machines to the primary ESXi system. The current version has a more balanced approach where ESXi systems can be part of a cluster but also host ordinary virtual machines.

The alternative solution – database replication – is not so friendly

It’s worth comparing this approach with ‘always on’ databases used by some other PAM tools. In this case, the loss of any worker system would mean that sessions would be dropped and users would have to restart these sessions on another worker. Perhaps of more concern is the behaviour of the database in a network partition scenario. Typically, always-on databases switch to read-only mode when their cluster becomes in-quorate. This means that credential data is always available, but not updateable. This is far from ideal. For example, in this state, password cycling may not be available and history will need to be saved elsewhere.

Global Scale

Using the PAM Mesh function, with network partitions, the PAM instances either side of the partition can have their own history and credential lifecycle management, they will re-mesh configurations once the network partition is healed. Besides the “belt & braces” protection, users also benefit from the performance boost by working with local PAM systems rather than traversing across global internet connections.

It’s an interesting thought that our larger customers could use VMware clusters and Mesh together to form collections of fault tolerance for highly-available Privileged Access Management services. A win-win for availability, resilience and user productivity.

As always – if you’d like to know more, please get in touch.

‍