Disclaimer: this is just a Proof of Concept.
If you deploy Azure Kubernetes Service clusters with availability zones, you’ll probaly need a high available storage solution.
In such situation you may use Azure Files as an external storage solution. But what if you need something that performs better? Or something running inside your cluster?
Well let me present you Rook
“Rook turns distributed storage systems into self-managing, self-scaling, self-healing storage services. It automates the tasks of a storage administrator: deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management.”
and Ceph.
“Ceph is an open-source software (software-defined storage) storage platform, implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for object-, block- and file-level storage.”
Combining these two technologies, Rook and Ceph, we can create a available storage solution using Kubernetes tools such as helm and primitives such as PVCs.
Let me show you how to deploy Rook and Ceph on Azure Kubernetes Service:
Deploy cluster with Rook and Ceph using Terraform
Create variables.tf with the following contents:
1# Location of the services
2variable "location" {
3 default = "west europe"
4}
5
6# Resource Group Name
7variable "resource_group" {
8 default = "aks-rook"
9}
10
11# Name of the AKS cluster
12variable "aks_name" {
13 default = "aks-rook"
14}
Create provider.tf with the following contents:
1terraform {
2 required_version = "> 0.14"
3 required_providers {
4 azurerm = {
5 version = "= 2.57.0"
6 }
7 kubernetes = {
8 version = "= 2.1.0"
9 }
10 helm = {
11 version = "= 2.1.2"
12 }
13 }
14}
15
16provider "azurerm" {
17 features {}
18}
19
20# Configuring the kubernetes provider
21# AKS resource name is aks: azurerm_kubernetes_cluster.aks
22provider "kubernetes" {
23 host = azurerm_kubernetes_cluster.aks.kube_config.0.host
24 client_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
25 client_key = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
26 cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
27}
28
29# Configuring the helm provider
30# AKS resource name is aks: azurerm_kubernetes_cluster.aks
31provider "helm" {
32 kubernetes {
33 host = azurerm_kubernetes_cluster.aks.kube_config.0.host
34 client_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)
35 client_key = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)
36 cluster_ca_certificate = base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)
37 }
38}
Create main.tf with the following contents:
1# Create Resource Group
2resource "azurerm_resource_group" "rg" {
3 name = var.resource_group
4 location = var.location
5}
6
7# Create VNET for AKS
8resource "azurerm_virtual_network" "vnet" {
9 name = "rook-network"
10 address_space = ["10.0.0.0/8"]
11 location = azurerm_resource_group.rg.location
12 resource_group_name = azurerm_resource_group.rg.name
13}
14
15# Create the Subnet for AKS.
16resource "azurerm_subnet" "aks" {
17 name = "aks"
18 resource_group_name = azurerm_resource_group.rg.name
19 virtual_network_name = azurerm_virtual_network.vnet.name
20 address_prefixes = ["10.240.0.0/16"]
21}
22
23# Create the AKS cluster.
24# Cause this is a test node_count is set to 1
25resource "azurerm_kubernetes_cluster" "aks" {
26 name = var.aks_name
27 location = azurerm_resource_group.rg.location
28 resource_group_name = azurerm_resource_group.rg.name
29 dns_prefix = var.aks_name
30 kubernetes_version = "1.21.2"
31
32 default_node_pool {
33 name = "default"
34 node_count = 3
35 vm_size = "Standard_D2s_v3"
36 os_disk_size_gb = 30
37 os_disk_type = "Ephemeral"
38 vnet_subnet_id = azurerm_subnet.aks.id
39 availability_zones = ["1", "2", "3"]
40 }
41
42 # Using Managed Identity
43 identity {
44 type = "SystemAssigned"
45 }
46
47 network_profile {
48 network_plugin = "azure"
49 network_policy = "calico"
50 }
51
52 role_based_access_control {
53 enabled = true
54 }
55
56 addon_profile {
57 kube_dashboard {
58 enabled = false
59 }
60 }
61}
62
63resource "azurerm_kubernetes_cluster_node_pool" "npceph" {
64 name = "npceph"
65 kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
66 vm_size = "Standard_DS2_v2"
67 node_count = 3
68 node_taints = ["storage-node=true:NoSchedule"]
69 availability_zones = ["1", "2", "3"]
70 vnet_subnet_id = azurerm_subnet.aks.id
71}
72
73data "azurerm_resource_group" "node_resource_group" {
74 name = azurerm_kubernetes_cluster.aks.node_resource_group
75}
76
77resource "azurerm_role_assignment" "kubelet_contributor" {
78 scope = data.azurerm_resource_group.node_resource_group.id
79 role_definition_name = "Contributor" #"Virtual Machine Contributor"?
80 principal_id = azurerm_kubernetes_cluster.aks.kubelet_identity[0].object_id
81}
82
83resource "azurerm_role_assignment" "identity_network_contributor" {
84 scope = azurerm_virtual_network.vnet.id
85 role_definition_name = "Network Contributor"
86 principal_id = azurerm_kubernetes_cluster.aks.identity[0].principal_id
87}
Please note the following:
- The AKS cluster is availability zone aware.
- A second node pool (npceph) is created for the Ceph storage. This pool is also availability zone aware.
- The node pool (npceph) is configured to use the
storage-node
taint.
Create rook-ceph-operator-values.yaml with the following contents:
1crds:
2 enabled: true
3csi:
4 provisionerTolerations:
5 - effect: NoSchedule
6 key: storage-node
7 operator: Exists
8 pluginTolerations:
9 - effect: NoSchedule
10 key: storage-node
11 operator: Exists
12agent:
13 # AKS: https://rook.github.io/docs/rook/v1.7/flexvolume.html#azure-aks
14 flexVolumeDirPath: "/etc/kubernetes/volumeplugins"
This is the helm configuration for the rook-ceph-operator. As you can see both the provisioner and the plugin tolerations are using the storage-node
taint.
Create rook-ceph-cluster-values.yaml with the following contents:
1operatorNamespace: rook-ceph
2toolbox:
3 enabled: true
4cephBlockPools: []
5cephObjectStores: []
6cephClusterSpec:
7 mon:
8 volumeClaimTemplate:
9 spec:
10 storageClassName: managed-premium
11 resources:
12 requests:
13 storage: 10Gi
14 storage:
15 storageClassDeviceSets:
16 - name: set1
17 # The number of OSDs to create from this device set
18 count: 3
19 # IMPORTANT: If volumes specified by the storageClassName are not portable across nodes
20 # this needs to be set to false. For example, if using the local storage provisioner
21 # this should be false.
22 portable: false
23 # Since the OSDs could end up on any node, an effort needs to be made to spread the OSDs
24 # across nodes as much as possible. Unfortunately the pod anti-affinity breaks down
25 # as soon as you have more than one OSD per node. The topology spread constraints will
26 # give us an even spread on K8s 1.18 or newer.
27 placement:
28 topologySpreadConstraints:
29 - maxSkew: 1
30 topologyKey: kubernetes.io/hostname
31 whenUnsatisfiable: ScheduleAnyway
32 labelSelector:
33 matchExpressions:
34 - key: app
35 operator: In
36 values:
37 - rook-ceph-osd
38 tolerations:
39 - key: storage-node
40 operator: Exists
41 preparePlacement:
42 tolerations:
43 - key: storage-node
44 operator: Exists
45 nodeAffinity:
46 requiredDuringSchedulingIgnoredDuringExecution:
47 nodeSelectorTerms:
48 - matchExpressions:
49 - key: agentpool
50 operator: In
51 values:
52 - npceph
53 topologySpreadConstraints:
54 - maxSkew: 1
55 # IMPORTANT: If you don't have zone labels, change this to another key such as kubernetes.io/hostname
56 topologyKey: topology.kubernetes.io/zone
57 whenUnsatisfiable: DoNotSchedule
58 labelSelector:
59 matchExpressions:
60 - key: app
61 operator: In
62 values:
63 - rook-ceph-osd-prepare
64 resources:
65 limits:
66 cpu: "500m"
67 memory: "4Gi"
68 requests:
69 cpu: "500m"
70 memory: "2Gi"
71 volumeClaimTemplates:
72 - metadata:
73 name: data
74 spec:
75 resources:
76 requests:
77 storage: 100Gi
78 storageClassName: managed-premium
79 volumeMode: Block
80 accessModes:
81 - ReadWriteOnce
This is the configuration for the rook-ceph-cluster. Note that:
- The configuration deploys 3 OSDs.
storage-node
taint must be tolerated.- topologySpreadConstraints are used to spread the OSDs across nodes.
- The toolbox is enabled.
Create rook-ceph.tf with the following contents:
1# Install rook-ceph using the hem chart
2resource "helm_release" "rook-ceph" {
3 name = "rook-ceph"
4 chart = "rook-ceph"
5 namespace = "rook-ceph"
6 version = "1.7.3"
7 repository = "https://charts.rook.io/release/"
8 create_namespace = true
9
10 values = [
11 "${file("./rook-ceph-operator-values.yaml")}"
12 ]
13
14 depends_on = [
15 azurerm_kubernetes_cluster_node_pool.npceph
16 ]
17}
18
19resource "helm_release" "rook-ceph-cluster" {
20 name = "rook-ceph-cluster"
21 chart = "rook-ceph-cluster"
22 namespace = "rook-ceph"
23 version = "1.7.3"
24 repository = "https://charts.rook.io/release/"
25
26 values = [
27 "${file("./rook-ceph-cluster-values.yaml")}"
28 ]
29
30 depends_on = [
31 azurerm_kubernetes_cluster_node_pool.npceph,
32 helm_release.rook-ceph
33 ]
34}
From the terraform folder run:
1terraform init
2terraform apply
Once the cluster is deployed it will take a few minutes until the rook-ceph-cluster is ready.
Check that the OSDs are running:
1az aks get-credentials --resource-group <resource group name> --name <aks name>
2kubectl get pods -n rook-ceph
Deploy a Test Application
Create ceph-filesystem-pvc.yaml with the following contents:
1apiVersion: v1
2kind: PersistentVolumeClaim
3metadata:
4 name: ceph-filesystem-pvc
5spec:
6 accessModes:
7 - ReadWriteMany
8 resources:
9 requests:
10 storage: 1Gi
11 storageClassName: ceph-filesystem
With this PVC you are asking for a 1Gi of storage from the ceph cluster.
Create busybox-deployment.yaml with the following contents:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 creationTimestamp: null
5 labels:
6 app: busy
7 name: busy
8spec:
9 replicas: 1
10 selector:
11 matchLabels:
12 app: busy
13 strategy: {}
14 template:
15 metadata:
16 creationTimestamp: null
17 labels:
18 app: busy
19 spec:
20 containers:
21 - image: busybox
22 imagePullPolicy: Always
23 name: busy-rook
24 command:
25 - sh
26 - -c
27 - test -f /ceph-file-store/important.file || echo "yada yada yada" >> /ceph-file-store/important.file && sleep 3600
28 volumeMounts:
29 - mountPath: "/ceph-file-store"
30 name: ceph-volume
31 resources: {}
32 volumes:
33 - name: ceph-volume
34 persistentVolumeClaim:
35 claimName: ceph-filesystem-pvc
36 readOnly: false
The busybox deployment will mount the ceph-filesystem-pvc
and will write a file to it.
Deploy the test application using the following command:
1kubectl apply -f ceph-filesystem-pvc.yaml
2kubectl apply -f busybox-deployment.yaml
Check that everything is running as expected::
Check the pvc status:
1kubectl get pvc
The output should look like this (Note the status is “Bound”):
1NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
2ceph-filesystem-pvc Bound pvc-344e2517-b421-4a81-98e2-5bc6a991d93d 1Gi RWX ceph-filesystem 21h
Check that the important.file
exists:
1kubectl exec $(kubectl get po -l app=busy -o jsonpath='{.items[0].metadata.name}') -it -- cat /ceph-file-store/important.file
you should get the contents of important.file
:
1yada yada yada
Performance Tests
You can run some performance tests using kubestr.
Test the performance of the ceph storage:
Run:
1.\kubestr.exe fio -z 20Gi -s ceph-filesystem
You should get some results:
1PVC created kubestr-fio-pvc-6hvqr
2Pod created kubestr-fio-pod-nmbmr
3Running FIO test (default-fio) on StorageClass (ceph-filesystem) with a PVC of Size (20Gi)
4Elapsed time- 2m35.1500188s
5FIO test results:
6
7FIO version - fio-3.20
8Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1
9
10JobName: read_iops
11 blocksize=4K filesize=2G iodepth=64 rw=randread
12read:
13 IOPS=1583.366333 BW(KiB/s)=6350
14 iops: min=1006 max=2280 avg=1599.766724
15 bw(KiB/s): min=4024 max=9120 avg=6399.233398
16
17JobName: write_iops
18 blocksize=4K filesize=2G iodepth=64 rw=randwrite
19write:
20 IOPS=223.526337 BW(KiB/s)=910
21 iops: min=124 max=305 avg=224.199997
22 bw(KiB/s): min=496 max=1221 avg=897.133362
23
24JobName: read_bw
25 blocksize=128K filesize=2G iodepth=64 rw=randread
26read:
27 IOPS=1565.778198 BW(KiB/s)=200950
28 iops: min=968 max=2214 avg=1583.266724
29 bw(KiB/s): min=123904 max=283392 avg=202674.265625
30
31JobName: write_bw
32 blocksize=128k filesize=2G iodepth=64 rw=randwrite
33write:
34 IOPS=225.524933 BW(KiB/s)=29396
35 iops: min=124 max=308 avg=227.033340
36 bw(KiB/s): min=15872 max=39424 avg=29077.132812
37
38Disk stats (read/write):
39 - OK
Test the performance of the Azure Files Premium:
Run:
1.\kubestr.exe fio -z 20Gi -s azurefile-csi-premium
You should get some results:
1PVC created kubestr-fio-pvc-mvf9v
2Pod created kubestr-fio-pod-qntnw
3Running FIO test (default-fio) on StorageClass (azurefile-csi-premium) with a PVC of Size (20Gi)
4Elapsed time- 59.3141476s
5FIO test results:
6
7FIO version - fio-3.20
8Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1
9
10JobName: read_iops
11 blocksize=4K filesize=2G iodepth=64 rw=randread
12read:
13 IOPS=557.804260 BW(KiB/s)=2247
14 iops: min=260 max=1294 avg=644.807678
15 bw(KiB/s): min=1040 max=5176 avg=2579.384521
16
17JobName: write_iops
18 blocksize=4K filesize=2G iodepth=64 rw=randwrite
19write:
20 IOPS=255.239807 BW(KiB/s)=1037
21 iops: min=6 max=428 avg=292.037048
22 bw(KiB/s): min=24 max=1712 avg=1168.333374
23
24JobName: read_bw
25 blocksize=128K filesize=2G iodepth=64 rw=randread
26read:
27 IOPS=537.072571 BW(KiB/s)=69278
28 iops: min=260 max=1358 avg=622.115356
29 bw(KiB/s): min=33280 max=173824 avg=79648.304688
30
31JobName: write_bw
32 blocksize=128k filesize=2G iodepth=64 rw=randwrite
33write:
34 IOPS=295.383789 BW(KiB/s)=38343
35 iops: min=144 max=872 avg=340.846161
36 bw(KiB/s): min=18432 max=111616 avg=43637.308594
37
38Disk stats (read/write):
39 - OK
Note that those are the results of the FIO tests I ran, but I will not jump into any conclusions since I am not a performance expert.
Simulate a node crash
Let’s simulate a VM crash by deallocating one of the nodes:
1$resourceGroupName="aks-rook"
2$aksName="aks-rook"
3$resourceGroup=$(az aks show --resource-group $resourceGroupName --name $aksName --query "nodeResourceGroup" --output tsv)
4$cephScaleSet=$(az vmss list --resource-group $resourceGroup --query "[].{name:name}[? contains(name,'npceph')] | [0].name" --output tsv)
5az vmss deallocate --resource-group $resourceGroup --name $cephScaleSet --instance-ids 0
Check that you still have access to the important.file:
1kubectl exec $(kubectl get po -l app=busy -o jsonpath='{.items[0].metadata.name}') -it -- cat /ceph-file-store/important.file
Hope it helps!!!
Please find the complete sample here
References:
Comments