Disclaimer: this is just a Proof of Concept.

If you deploy Azure Kubernetes Service clusters with availability zones, you’ll probaly need a high available storage solution.

In such situation you may use Azure Files as an external storage solution. But what if you need something that performs better? Or something running inside your cluster?

Well let me present you Rook

“Rook turns distributed storage systems into self-managing, self-scaling, self-healing storage services. It automates the tasks of a storage administrator: deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management.”

and Ceph.

“Ceph is an open-source software (software-defined storage) storage platform, implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for object-, block- and file-level storage.”

Combining these two technologies, Rook and Ceph, we can create a available storage solution using Kubernetes tools such as helm and primitives such as PVCs.

Let me show you how to deploy Rook and Ceph on Azure Kubernetes Service:

Deploy cluster with Rook and Ceph using Terraform

Create variables.tf with the following contents:

 1# Location of the services
 2variable "location" {
 3  default = "west europe"
 4}
 5
 6# Resource Group Name
 7variable "resource_group" {
 8  default = "aks-rook"
 9}
10
11# Name of the AKS cluster
12variable "aks_name" {
13  default = "aks-rook"
14}

Create provider.tf with the following contents:

 1terraform {
 2  required_version = "> 0.14"
 3  required_providers {
 4    azurerm = {
 5      version = "= 2.57.0"
 6    }
 7    kubernetes = {
 8      version = "= 2.1.0"
 9    }
10    helm = {
11      version = "= 2.1.2"
12    }
13  }
14}
15
16provider "azurerm" {
17  features {}
18}
19
20# Configuring the kubernetes provider
21# AKS resource name is aks: azurerm_kubernetes_cluster.aks
22provider "kubernetes" {
23  host                   = azurerm_kubernetes_cluster.aks.kube_config.0.host
24  client_certificate     = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
25  client_key             = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
26  cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
27}
28
29# Configuring the helm provider
30# AKS resource name is aks: azurerm_kubernetes_cluster.aks
31provider "helm" {
32  kubernetes {
33    host                   = azurerm_kubernetes_cluster.aks.kube_config.0.host
34    client_certificate     = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
35    client_key             = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
36    cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
37  }
38}

Create main.tf with the following contents:

 1# Create Resource Group
 2resource "azurerm_resource_group" "rg" {
 3  name     = var.resource_group
 4  location = var.location
 5}
 6
 7# Create VNET for AKS
 8resource "azurerm_virtual_network" "vnet" {
 9  name                = "rook-network"
10  address_space       = ["10.0.0.0/8"]
11  location            = azurerm_resource_group.rg.location
12  resource_group_name = azurerm_resource_group.rg.name
13}
14
15# Create the Subnet for AKS.
16resource "azurerm_subnet" "aks" {
17  name                 = "aks"
18  resource_group_name  = azurerm_resource_group.rg.name
19  virtual_network_name = azurerm_virtual_network.vnet.name
20  address_prefixes     = ["10.240.0.0/16"]
21}
22
23# Create the AKS cluster.
24# Cause this is a test node_count is set to 1 
25resource "azurerm_kubernetes_cluster" "aks" {
26  name                = var.aks_name
27  location            = azurerm_resource_group.rg.location
28  resource_group_name = azurerm_resource_group.rg.name
29  dns_prefix          = var.aks_name
30  kubernetes_version  = "1.21.2"
31
32  default_node_pool {
33    name               = "default"
34    node_count         = 3
35    vm_size            = "Standard_D2s_v3"
36    os_disk_size_gb    = 30
37    os_disk_type       = "Ephemeral"
38    vnet_subnet_id     = azurerm_subnet.aks.id
39    availability_zones = ["1", "2", "3"]
40  }
41
42  # Using Managed Identity
43  identity {
44    type = "SystemAssigned"
45  }
46
47  network_profile {
48    network_plugin = "azure"
49    network_policy = "calico"
50  }
51
52  role_based_access_control {
53    enabled = true
54  }
55
56  addon_profile {
57    kube_dashboard {
58      enabled = false
59    }
60  }
61}
62
63resource "azurerm_kubernetes_cluster_node_pool" "npceph" {
64  name                  = "npceph"
65  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
66  vm_size               = "Standard_DS2_v2"
67  node_count            = 3
68  node_taints           = ["storage-node=true:NoSchedule"]
69  availability_zones    = ["1", "2", "3"]
70  vnet_subnet_id        = azurerm_subnet.aks.id
71}
72
73data "azurerm_resource_group" "node_resource_group" {
74  name = azurerm_kubernetes_cluster.aks.node_resource_group
75}
76
77resource "azurerm_role_assignment" "kubelet_contributor" {
78  scope                = data.azurerm_resource_group.node_resource_group.id
79  role_definition_name = "Contributor" #"Virtual Machine Contributor"?
80  principal_id         = azurerm_kubernetes_cluster.aks.kubelet_identity[0].object_id
81}
82
83resource "azurerm_role_assignment" "identity_network_contributor" {
84  scope                = azurerm_virtual_network.vnet.id
85  role_definition_name = "Network Contributor"
86  principal_id         = azurerm_kubernetes_cluster.aks.identity[0].principal_id
87}

Please note the following:

  • The AKS cluster is availability zone aware.
  • A second node pool (npceph) is created for the Ceph storage. This pool is also availability zone aware.
  • The node pool (npceph) is configured to use the storage-node taint.

Create rook-ceph-operator-values.yaml with the following contents:

 1crds:
 2  enabled: true
 3csi:
 4  provisionerTolerations:
 5    - effect: NoSchedule
 6      key: storage-node
 7      operator: Exists
 8  pluginTolerations:
 9    - effect: NoSchedule
10      key: storage-node
11      operator: Exists
12agent:
13  # AKS: https://rook.github.io/docs/rook/v1.7/flexvolume.html#azure-aks
14  flexVolumeDirPath: "/etc/kubernetes/volumeplugins"

This is the helm configuration for the rook-ceph-operator. As you can see both the provisioner and the plugin tolerations are using the storage-node taint.

Create rook-ceph-cluster-values.yaml with the following contents:

 1operatorNamespace: rook-ceph
 2toolbox:
 3  enabled: true
 4cephBlockPools: []
 5cephObjectStores: []
 6cephClusterSpec:
 7  mon:
 8    volumeClaimTemplate:
 9      spec:
10        storageClassName: managed-premium
11        resources:
12          requests:
13            storage: 10Gi
14  storage:
15    storageClassDeviceSets:
16      - name: set1
17        # The number of OSDs to create from this device set
18        count: 3
19        # IMPORTANT: If volumes specified by the storageClassName are not portable across nodes
20        # this needs to be set to false. For example, if using the local storage provisioner
21        # this should be false.
22        portable: false
23        # Since the OSDs could end up on any node, an effort needs to be made to spread the OSDs
24        # across nodes as much as possible. Unfortunately the pod anti-affinity breaks down
25        # as soon as you have more than one OSD per node. The topology spread constraints will
26        # give us an even spread on K8s 1.18 or newer.
27        placement:
28          topologySpreadConstraints:
29            - maxSkew: 1
30              topologyKey: kubernetes.io/hostname
31              whenUnsatisfiable: ScheduleAnyway
32              labelSelector:
33                matchExpressions:
34                  - key: app
35                    operator: In
36                    values:
37                      - rook-ceph-osd
38          tolerations:
39            - key: storage-node
40              operator: Exists
41        preparePlacement:
42          tolerations:
43            - key: storage-node
44              operator: Exists
45          nodeAffinity:
46            requiredDuringSchedulingIgnoredDuringExecution:
47              nodeSelectorTerms:
48                - matchExpressions:
49                    - key: agentpool
50                      operator: In
51                      values:
52                        - npceph
53          topologySpreadConstraints:
54            - maxSkew: 1
55              # IMPORTANT: If you don't have zone labels, change this to another key such as kubernetes.io/hostname
56              topologyKey: topology.kubernetes.io/zone
57              whenUnsatisfiable: DoNotSchedule
58              labelSelector:
59                matchExpressions:
60                  - key: app
61                    operator: In
62                    values:
63                      - rook-ceph-osd-prepare
64        resources:
65          limits:
66            cpu: "500m"
67            memory: "4Gi"
68          requests:
69            cpu: "500m"
70            memory: "2Gi"
71        volumeClaimTemplates:
72          - metadata:
73              name: data
74            spec:
75              resources:
76                requests:
77                  storage: 100Gi
78              storageClassName: managed-premium
79              volumeMode: Block
80              accessModes:
81                - ReadWriteOnce

This is the configuration for the rook-ceph-cluster. Note that:

  • The configuration deploys 3 OSDs.
  • storage-node taint must be tolerated.
  • topologySpreadConstraints are used to spread the OSDs across nodes.
  • The toolbox is enabled.

Create rook-ceph.tf with the following contents:

 1# Install rook-ceph using the hem chart
 2resource "helm_release" "rook-ceph" {
 3  name             = "rook-ceph"
 4  chart            = "rook-ceph"
 5  namespace        = "rook-ceph"
 6  version          = "1.7.3"
 7  repository       = "https://charts.rook.io/release/"
 8  create_namespace = true
 9
10  values = [
11    "${file("./rook-ceph-operator-values.yaml")}"
12  ]
13
14  depends_on = [
15    azurerm_kubernetes_cluster_node_pool.npceph
16  ]
17}
18
19resource "helm_release" "rook-ceph-cluster" {
20  name       = "rook-ceph-cluster"
21  chart      = "rook-ceph-cluster"
22  namespace  = "rook-ceph"
23  version    = "1.7.3"
24  repository = "https://charts.rook.io/release/"
25
26  values = [
27    "${file("./rook-ceph-cluster-values.yaml")}"
28  ]
29
30  depends_on = [
31    azurerm_kubernetes_cluster_node_pool.npceph,
32    helm_release.rook-ceph
33  ]
34}

From the terraform folder run:

1terraform init
2terraform apply

Once the cluster is deployed it will take a few minutes until the rook-ceph-cluster is ready.

Check that the OSDs are running:

1az aks get-credentials --resource-group <resource group name> --name <aks name>
2kubectl get pods -n rook-ceph

Deploy a Test Application

Create ceph-filesystem-pvc.yaml with the following contents:

 1apiVersion: v1
 2kind: PersistentVolumeClaim
 3metadata:
 4  name: ceph-filesystem-pvc
 5spec:
 6  accessModes:
 7    - ReadWriteMany
 8  resources:
 9    requests:
10      storage: 1Gi
11  storageClassName: ceph-filesystem

With this PVC you are asking for a 1Gi of storage from the ceph cluster.

Create busybox-deployment.yaml with the following contents:

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  creationTimestamp: null
 5  labels:
 6    app: busy
 7  name: busy
 8spec:
 9  replicas: 1
10  selector:
11    matchLabels:
12      app: busy
13  strategy: {}
14  template:
15    metadata:
16      creationTimestamp: null
17      labels:
18        app: busy
19    spec:
20      containers:
21        - image: busybox
22          imagePullPolicy: Always
23          name: busy-rook
24          command:
25            - sh
26            - -c
27            - test -f /ceph-file-store/important.file || echo "yada yada yada" >> /ceph-file-store/important.file && sleep 3600
28          volumeMounts:
29            - mountPath: "/ceph-file-store"
30              name: ceph-volume
31          resources: {}
32      volumes:
33        - name: ceph-volume
34          persistentVolumeClaim:
35            claimName: ceph-filesystem-pvc
36            readOnly: false

The busybox deployment will mount the ceph-filesystem-pvc and will write a file to it.

Deploy the test application using the following command:

1kubectl apply -f ceph-filesystem-pvc.yaml
2kubectl apply -f busybox-deployment.yaml

Check that everything is running as expected::

Check the pvc status:

1kubectl get pvc

The output should look like this (Note the status is “Bound”):

1NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
2ceph-filesystem-pvc   Bound    pvc-344e2517-b421-4a81-98e2-5bc6a991d93d   1Gi        RWX            ceph-filesystem   21h   

Check that the important.file exists:

1kubectl exec $(kubectl get po -l app=busy -o jsonpath='{.items[0].metadata.name}') -it -- cat /ceph-file-store/important.file

you should get the contents of important.file:

1yada yada yada

Performance Tests

You can run some performance tests using kubestr.

Test the performance of the ceph storage:

Run:

1.\kubestr.exe fio -z 20Gi -s ceph-filesystem

You should get some results:

 1PVC created kubestr-fio-pvc-6hvqr
 2Pod created kubestr-fio-pod-nmbmr
 3Running FIO test (default-fio) on StorageClass (ceph-filesystem) with a PVC of Size (20Gi)
 4Elapsed time- 2m35.1500188s
 5FIO test results:
 6
 7FIO version - fio-3.20
 8Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1
 9
10JobName: read_iops
11  blocksize=4K filesize=2G iodepth=64 rw=randread
12read:
13  IOPS=1583.366333 BW(KiB/s)=6350
14  iops: min=1006 max=2280 avg=1599.766724
15  bw(KiB/s): min=4024 max=9120 avg=6399.233398
16
17JobName: write_iops
18  blocksize=4K filesize=2G iodepth=64 rw=randwrite
19write:
20  IOPS=223.526337 BW(KiB/s)=910
21  iops: min=124 max=305 avg=224.199997
22  bw(KiB/s): min=496 max=1221 avg=897.133362
23
24JobName: read_bw
25  blocksize=128K filesize=2G iodepth=64 rw=randread
26read:
27  IOPS=1565.778198 BW(KiB/s)=200950
28  iops: min=968 max=2214 avg=1583.266724
29  bw(KiB/s): min=123904 max=283392 avg=202674.265625
30
31JobName: write_bw
32  blocksize=128k filesize=2G iodepth=64 rw=randwrite
33write:
34  IOPS=225.524933 BW(KiB/s)=29396
35  iops: min=124 max=308 avg=227.033340
36  bw(KiB/s): min=15872 max=39424 avg=29077.132812
37
38Disk stats (read/write):
39  -  OK

Test the performance of the Azure Files Premium:

Run:

1.\kubestr.exe fio -z 20Gi -s azurefile-csi-premium

You should get some results:

 1PVC created kubestr-fio-pvc-mvf9v
 2Pod created kubestr-fio-pod-qntnw
 3Running FIO test (default-fio) on StorageClass (azurefile-csi-premium) with a PVC of Size (20Gi)
 4Elapsed time- 59.3141476s
 5FIO test results:
 6
 7FIO version - fio-3.20
 8Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1
 9
10JobName: read_iops
11  blocksize=4K filesize=2G iodepth=64 rw=randread
12read:
13  IOPS=557.804260 BW(KiB/s)=2247
14  iops: min=260 max=1294 avg=644.807678
15  bw(KiB/s): min=1040 max=5176 avg=2579.384521
16
17JobName: write_iops
18  blocksize=4K filesize=2G iodepth=64 rw=randwrite
19write:
20  IOPS=255.239807 BW(KiB/s)=1037
21  iops: min=6 max=428 avg=292.037048
22  bw(KiB/s): min=24 max=1712 avg=1168.333374
23
24JobName: read_bw
25  blocksize=128K filesize=2G iodepth=64 rw=randread
26read:
27  IOPS=537.072571 BW(KiB/s)=69278
28  iops: min=260 max=1358 avg=622.115356
29  bw(KiB/s): min=33280 max=173824 avg=79648.304688
30
31JobName: write_bw
32  blocksize=128k filesize=2G iodepth=64 rw=randwrite
33write:
34  IOPS=295.383789 BW(KiB/s)=38343
35  iops: min=144 max=872 avg=340.846161
36  bw(KiB/s): min=18432 max=111616 avg=43637.308594
37
38Disk stats (read/write):
39  -  OK

Note that those are the results of the FIO tests I ran, but I will not jump into any conclusions since I am not a performance expert.

Simulate a node crash

Let’s simulate a VM crash by deallocating one of the nodes:

1$resourceGroupName="aks-rook"
2$aksName="aks-rook"
3$resourceGroup=$(az aks show --resource-group $resourceGroupName --name $aksName --query "nodeResourceGroup" --output tsv)
4$cephScaleSet=$(az vmss list --resource-group $resourceGroup --query "[].{name:name}[? contains(name,'npceph')] | [0].name" --output tsv)
5az vmss deallocate --resource-group $resourceGroup --name $cephScaleSet --instance-ids 0

Check that you still have access to the important.file:

1kubectl exec $(kubectl get po -l app=busy -o jsonpath='{.items[0].metadata.name}') -it -- cat /ceph-file-store/important.file

Hope it helps!!!

Please find the complete sample here

References: