Site24x7-970x250

Creating and Manageing a HA AKS Kubernetes Cluster in Azure thanks to Terraform

in Kubernetes

hero cs

Learn how to use Terraform to manage a highly-available Azure AKS Kubernetes cluster with Azure AD integration and Calico network policies enabled.


    Article originally published at Coder Society here.

    What is Azure Kubernetes Service (AKS)

    Azure Kubernetes Service (AKS) is a managed Kubernetes offering in Azure which lets you quickly deploy a production ready Kubernetes cluster. It allows customers to focus on application development and deployment, rather than the nitty gritties of Kubernetes cluster management. The cluster control plane is deployed and managed by Microsoft while the node and node pools where the applications are deployed, are handled by the customer.

    The AKS cluster deployment can be fully automated using Terraform. Terraform enables you to safely and predictably create, change, and improve infrastructure. It also supports advanced AKS configurations, such as availability zones, Azure AD integration, and network policies for Kubernetes.

    Let’s take a look at the key AKS features we’ll be covering in this article.

    AKS deployment across multiple availability zones

    Ensuring high availability of deployments is a must for enterprise workloads. Azure availability zones protect resources from data center-level failures by distributing them across one or more data centers in an Azure region.

    AKS clusters can also be deployed in availability zones, in which the nodes are deployed across different zones in a region. In case of a data center failure, the workloads deployed in the cluster would continue to run from nodes in a different zone, thereby protecting them from such incidents.

    Overview of availability zones for AKS clusters Overview of availability zones for AKS clusters

    Azure Active Directory integration

    With identity considered the new security perimeter, customers are now opting to use Azure AD for authentication and authorization of cloud-native deployments. AKS clusters can be integrated with Azure Active Directory so that users can be granted access to namespaces in the cluster or cluster-level resources using their existing Azure AD credentials. This eliminates the need for multiple credentials when deploying and managing workloads in an AKS cluster.

    This is of even greater benefit in hybrid cloud deployments, in which on-premises AD credentials are synced to Azure AD. It delivers a consistent, unified experience for authentication and authorization. Figure 1 below shows this high-level AKS authentication flow when integrated with Azure Active Directory.

    High-level AKS authentication flow integrated with Azure AD High-level AKS authentication flow integrated with Azure AD

    Pod traffic control through network policy implementation

    By default, all pods in an AKS cluster can communicate with each other without any restrictions. However, in production, customers would want to restrict this traffic for security reasons. This can be achieved by implementing network policies in a Kubernetes cluster. Network policies can be used to define a set of rules that allow or deny traffic between pods based on matching labels.

    AKS supports two types of network implementations: Kubenet (basic networking) and Azure CNI (advanced networking). Customers can also choose between two types of network policies: Azure (native) or Calico network policies (open source). While Azure network policies are supported only in Azure CNI, Calico is supported in both Kubenet- and Azure CNI-based network implementations.

    Deployment prerequisites

    Following are the prerequisites for the deployment of the AKS cluster:

    • Azure subscription access: It is recommended that users with contributor rights run the Terraform scripts. During deployment, an additional resource group is created for the AKS nodes. Restricted permissions may lead to deployment failures.
    • Azure AD server and client application: OpenID Connect is used to integrate Azure Active Directory with the AKS cluster. Two Azure AD applications are required to enable this: a server application and a client application. The server application serves as the endpoint for identity requests, while the client application is used for authentication when users try to access the AKS cluster via the kubectl command. Microsoft offers a step-by-step guide for creating these Azure AD applications.
    • Terraform usage from Cloud Shell:Azure Cloud Shell has Terraform installed by default in the bash environment. You can use your favorite text editor like vim or use the code editor in Azure Cloud Shell to write the Terraform templates. Refer to Microsoft’s guide to get started with Terraform in Azure Cloud Shell.

    Creating a Terraform template

    To create the templates, Terraform uses HashiCorp Configuration Language (HCL), as it is designed to be both machine friendly and human readable. For a more in-depth understanding of Terraform syntax, refer to the Terraform documentation. The values that change across deployments can be defined as variables and are either provided through a variables file or during runtime when the Terraform templates are applied.

    In this section, we’ll describe the relevant modules of the Terraform template to be used to create the cluster.

    Note: The Terraform template as well as the variable and output files for this deployment are all available in the GitHub repository.

    Network setup

    The following block of Terraform code should be used to create the Azure VNet and subnet, which are required for the Azure CNI network implementation:

    resource "azurerm_virtual_network" "demo" {
      name                = "${var.prefix}-network"
      location            = azurerm_resource_group.demo.location
      resource_group_name = azurerm_resource_group.demo.name
      address_space       = ["10.1.0.0/16"]
    }
    
    resource "azurerm_subnet" "demo" {
      name                 = "${var.prefix}-akssubnet"
      virtual_network_name = azurerm_virtual_network.demo.name
      resource_group_name  = azurerm_resource_group.demo.name
      address_prefixes     = ["10.1.0.0/22"]
    }

    var.prefix: A prefix will be defined in the Terraform variable files which is used to differentiate the deployment.

    demo: This is the local name which is used by Terraform to reference the defined resources (e.g. Azure VNet and subnet). It can be renamed to suit your use case.

    address_space and address_prefixes: This refers to the address space for the VNet and subnet. You can replace the values with your preferred private IP blocks.

    Azure AD integration

    To enable the Azure AD integration we need to provide the server application, client application, and Azure AD tenant details. The following code block should be used in the AKS cluster definition to enable RBAC for the AKS cluster and to use Azure AD for RBAC authentication.

    role_based_access_control {
        azure_active_directory {
          client_app_id     = var.client_app_id
          server_app_id     = var.server_app_id
          server_app_secret = var.server_app_secret
          tenant_id         = var.tenant_id
        }    
        enabled = true
      }

    var.client_app_id: This variable refers to the client app ID of the Azure AD client application which was mentioned in the prerequisites section.

    var.server_app_id: This variable refers to the server app ID of the Azure AD server application which was mentioned in the prerequisites section.

    var.server_app_secret: This variable refers to the secret created for the Azure AD server application.

    var.tenant_id: This variable refers to the Azure AD tenant ID associated with the subscription where the cluster will be deployed. This value can be obtained from the Azure portal or through the Azure CLI.

    Network policy configuration

    The following Terraform code will be used in the AKS cluster definition to enable Calico network policies. Note that this can be configured only during cluster deployment and any changes will require a recreation of the cluster.

    network_profile {
        network_plugin     = "azure"
        load_balancer_sku  = "standard"
        network_policy     = "calico"
      }

    network_plugin: The value should be set to azure to use CNI networking.

    load_balancer_sku: The value should be set to standard, as we will be using virtual machine scale sets.

    network_policy: The value should be set to calico since we'll be using Calico network policies.

    Node pool and availability zone configuration

    The following code will be used to configure the node pools and availability zone.

    default_node_pool {
        name                = "default"
        node_count          = 2
        vm_size             = "Standard_D2_v2"
        type                = "VirtualMachineScaleSets"
        availability_zones  = ["1", "2"]
        enable_auto_scaling = true
        min_count           = 2
        max_count           = 4
    
        # Required for advanced networking
        vnet_subnet_id = azurerm_subnet.demo.id
      }

    node_count: This refers to the initial amount of nodes to be deployed in the node pool.

    vm_size:Standard_D2_v2 is used in this sample; it can be replaced with your preferred SKU.

    type: This should be set to VirtualMachineScaleSets so that the VMs can be distributed across availability zones.

    availability_zones: Lists the available zones to be used.

    enable_auto_scaling: This should be set to true to enable autoscaling.

    The variables min_count and max_count should be set to define the minimum and maximum node count within the node pool. The value here should be between 1 and 100.

    Deploying the HA AKS cluster

    Download the Terraform files from the GitHub repository to your Cloud Shell session and edit the configuration parameters in accordance with your AKS cluster deployment requirements. The guidance provided in the previous section can be used to update these values.

    1 / Run the following commands to clone the GitHub repository in CloudShell:

    git clone https://github.com/coder-society/terraform-aks-azure.git
    Cloning into 'terraform-aks-azure'...
    remote: Enumerating objects: 12, done.
    remote: Counting objects: 100% (12/12), done.
    remote: Compressing objects: 100% (10/10), done.
    remote: Total 12 (delta 1), reused 12 (delta 1), pack-reused 0
    Unpacking objects: 100% (12/12), done.
    Checking connectivity... done.

    2 /Go into the /terraform directory and run the terraform init command to initialize Terraform:

    terraform init
    Initializing the backend...
    Initializing provider plugins...
    - Finding hashicorp/azurerm versions matching "~> 2.0"...
    - Installing hashicorp/azurerm v2.28.0...
    - Installed hashicorp/azurerm v2.28.0 (signed by HashiCorp)
    Terraform has been successfully initialized!
    You may now begin working with Terraform. Try running "terraform plan" to see
    any changes that are required for your infrastructure. All Terraform commands
    should now work.
    If you ever set or change modules or backend configuration for Terraform,
    rerun this command to reinitialize your working directory. If you forget, other
    commands will detect it and remind you to do so if necessary.

    3 / Export the Terraform variables to be used during runtime, replace the placeholders with environment-specific values. You can also define the values in the variables file.

    export TF_VAR_prefix=<Environment prefix>
    export TF_VAR_client_app_id=<The client app ID of the AKS client application>  
    export TF_VAR_server_app_id=<The server app ID of the AKS server application>
    export TF_VAR_server_app_secret=<The secret created for AKS server application>
    export TF_VAR_tenant_id=<The Azure AD tenant id>

    4 / Create the Terraform plan by executing terraform plan -out out.plan.

    terraform plan -out out.plan
    Refreshing Terraform state in-memory prior to plan...
    The refreshed state will be used to calculate this plan, but will not be
    persisted to local or remote state storage.
    ------------------------------------------------------------------------
    An execution plan has been generated and is shown below.
    Resource actions are indicated with the following symbols:
      + create
    Terraform will perform the following actions:
      # azurerm_kubernetes_cluster.demo will be created
      + resource "azurerm_kubernetes_cluster" "demo" {
          + dns_prefix              = "cs-aks"
          + fqdn                    = (known after apply)
          + id                      = (known after apply)
          + kube_admin_config       = (known after apply)
          + kube_admin_config_raw   = (sensitive value)
          + kube_config             = (known after apply)
          + kube_config_raw         = (sensitive value)
          + kubelet_identity        = (known after apply)
          + kubernetes_version      = (known after apply)
          + location                = "westeurope"
          + name                    = "cs-aks"
          + node_resource_group     = (known after apply)
          + private_cluster_enabled = (known after apply)
          + private_fqdn            = (known after apply)
          + private_link_enabled    = (known after apply)
          + resource_group_name     = "cs-rg"
          + sku_tier                = "Free"
          + tags                    = {
              + "Environment" = "Development"
            }
          + addon_profile {
              + aci_connector_linux {
                  + enabled     = (known after apply)
                  + subnet_name = (known after apply)
                }
              + azure_policy {
                  + enabled = (known after apply)
                }
              + http_application_routing {
                  + enabled                            = (known after apply)
                  + http_application_routing_zone_name = (known after apply)
                }
              + kube_dashboard {
                  + enabled = (known after apply)
                }
              + oms_agent {
                  + enabled                    = (known after apply)
                  + log_analytics_workspace_id = (known after apply)
                  + oms_agent_identity         = (known after apply)
                }
            }
          + auto_scaler_profile {
              + balance_similar_node_groups      = (known after apply)
              + max_graceful_termination_sec     = (known after apply)
              + scale_down_delay_after_add       = (known after apply)
              + scale_down_delay_after_delete    = (known after apply)
              + scale_down_delay_after_failure   = (known after apply)
              + scale_down_unneeded              = (known after apply)
              + scale_down_unready               = (known after apply)
              + scale_down_utilization_threshold = (known after apply)
              + scan_interval                    = (known after apply)
            }
          + default_node_pool {
              + availability_zones   = [
                  + "1",
                  + "2",
                ]
              + enable_auto_scaling  = true
              + max_count            = 4
              + max_pods             = (known after apply)
              + min_count            = 2
              + name                 = "default"
              + node_count           = 2
              + orchestrator_version = (known after apply)
              + os_disk_size_gb      = (known after apply)
              + type                 = "VirtualMachineScaleSets"
              + vm_size              = "Standard_D2_v2"
              + vnet_subnet_id       = (known after apply)
            }
          + identity {
              + principal_id = (known after apply)
              + tenant_id    = (known after apply)
              + type         = "SystemAssigned"
            }
          + network_profile {
              + dns_service_ip     = (known after apply)
              + docker_bridge_cidr = (known after apply)
              + load_balancer_sku  = "standard"
              + network_plugin     = "azure"
              + network_policy     = "calico"
              + outbound_type      = "loadBalancer"
              + pod_cidr           = (known after apply)
              + service_cidr       = (known after apply)
              + load_balancer_profile {
                  + effective_outbound_ips    = (known after apply)
                  + idle_timeout_in_minutes   = (known after apply)
                  + managed_outbound_ip_count = (known after apply)
                  + outbound_ip_address_ids   = (known after apply)
                  + outbound_ip_prefix_ids    = (known after apply)
                  + outbound_ports_allocated  = (known after apply)
                }
            }
          + role_based_access_control {
              + enabled = true
              + azure_active_directory {
                  + client_app_id     = "f9bf8772-aaba-4773-a815-784b31f9ab8b"
                  + server_app_id     = "fa7775b3-ea31-4e99-92f5-8ed0bac3e6a8"
                  + server_app_secret = (sensitive value)
                  + tenant_id         = "8f55a88a-7752-4e10-9bbb-e847ae93911d"
                }
            }
          + windows_profile {
              + admin_password = (sensitive value)
              + admin_username = (known after apply)
            }
        }
      # azurerm_resource_group.demo will be created
      + resource "azurerm_resource_group" "demo" {
          + id       = (known after apply)
          + location = "westeurope"
          + name     = "cs-rg"
        }
      # azurerm_subnet.demo will be created
      + resource "azurerm_subnet" "demo" {
          + address_prefix                                 = (known after apply)
          + address_prefixes                               = [
              + "10.1.0.0/22",
            ]
          + enforce_private_link_endpoint_network_policies = false
          + enforce_private_link_service_network_policies  = false
          + id                                             = (known after apply)
          + name                                           = "cs-subnet"
          + resource_group_name                            = "cs-rg"
          + virtual_network_name                           = "cs-network"
        }
      # azurerm_virtual_network.demo will be created
      + resource "azurerm_virtual_network" "demo" {
          + address_space       = [
              + "10.1.0.0/16",
            ]
          + guid                = (known after apply)
          + id                  = (known after apply)
          + location            = "westeurope"
          + name                = "cs-network"
          + resource_group_name = "cs-rg"
          + subnet              = (known after apply)
        }
    Plan: 4 to add, 0 to change, 0 to destroy.
    ------------------------------------------------------------------------
    This plan was saved to: out.plan
    To perform exactly these actions, run the following command to apply:
        terraform apply "out.plan"

    5 / Use the terraform apply out.plan command to apply the plan.

    Once successfully deployed, the details of the cluster, network, etc. will be shown in the command line.

    6 / Browse to the resource pool in the Azure portal to view the cluster and the network which was created by the deployment:

    7 / Retrieve the admin kubeconfig using the Azure client:

    az aks get-credentials --resource-group $prefix-rg --name $prefix-aks --admin --overwrite-existing

    8 / Run the following command to list the nodes and availability zone configuration:

    kubectl describe nodes | grep -e "Name:" -e "failure-domain.beta.kubernetes.io/zone"
    Name:               aks-default-36042037-vmss000000
                        failure-domain.beta.kubernetes.io/zone=westeurope-1
    Name:               aks-default-36042037-vmss000001
                        failure-domain.beta.kubernetes.io/zone=westeurope-2

    failure-domain.beta.kubernetes.io/zone is a label associated with Kubernetes nodes that indicates the zone in which it is deployed. The output shows that the nodes are deployed across two availability zones in Western Europe.

    Configure the Azure Active Directory integration

    1. Create a group in Azure AD.
    GROUP_ID=$(az ad group create --display-name dev --mail-nickname dev --query objectId -o tsv)

    2 / Retrieve the resource ID of the AKS cluster

    AKS_ID=$(az aks show \
        --resource-group $prefix-rg \
        --name $prefix-aks \
        --query id -o tsv)

    3 / Create an Azure role assignment so that any member of the dev group can use kubectl to interact with the Kubernetes cluster.

    az role assignment create \
      --assignee $GROUP_ID \
      --role "Azure Kubernetes Service Cluster User Role" \
      --scope $AKS_ID

    4 / Add yourself to the dev AD group.

    USER_ID=$(az ad signed-in-user show --query objectId -o tsv)
    az ad group member add --group dev --member-id $USER_ID

    5 / With the admin kubeconfig, create a development and production Kubernetes namespace. kubectl create namespace development kubectl create namespace production

    6 / Replace the groupObjectId with the resource ID of the previously created group and apply the rolebinding.yaml file.

    sed -i '' "s/groupObjectId/$GROUP_ID/g" rolebinding.yaml
    kubectl apply -f rolebinding.yaml

    7 / Run the following command to get the cluster credentials before testing Azure AD integration.

    az aks get-credentials --resource-group $prefix-rg --name $prefix-aks --overwrite-existing

    8 / Run the following kubectl command to see the Azure AD integration in action:

    kubectl get pods --namespace development
    To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code DSFV9H98W to authenticate.
    No resources found in development namespace.

    Enter the code in the device login page followed by your Azure AD login credentials.

    Note that only users in the dev group will be able to log in through this process.

    9 / Try to access resources in the production namespace:

    kubectl get pods --namespace production
    Error from server (Forbidden): pods is forbidden: User "kentaro@codersociety.com" cannot list resource "pods" in API group "" in the namespace "production"
    kubectl get pods --namespace development
    To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code DSFV9H98W to authenticate.
    No resources found in development namespace.

    Configure network policies with Calico

    1 / To test Calico network policy, create an httpbin service and deployment in a namespace using the k8s/httpbin.yaml.

    kubectl apply -f httpbin.yaml --namespace development

    2 / Create a network policy which restricts all inbound access to the deployment using k8s/networkpolicy.yaml. We only allow network access from pods with the label app: webapp.

    kubectl apply -f networkpolicy.yaml --namespace development

    3 / Create a new pod and test access to the httpbin service. From the command prompt of the pod, try to access the httpbin service over port 8000. The access will timeout. You can type “exit” to exit and delete the pod after testing.

    kubectl run --rm -it --image=alpine frontend --namespace development
    If you don't see a command prompt, try pressing enter.
    / # wget --timeout=2 http://httpbin:8000
    Connecting to httpbin:8000 (10.0.233.179:8000)
    wget: download timed out
    / # exit

    4 / Create a new test pod, but this time with labels matching the ingress rules. Then run the wget command to check access to httpbin service over port 8000.

    kubectl run --rm -it --image=alpine frontend --labels app=webapp --namespace development
    If you don't see a command prompt, try pressing enter.
    / # wget --timeout=2 http://httpbin:8000
    Connecting to httpbin:8000 (10.0.233.179:8000)
    saving to 'index.html'
    index.html           100% |************************************************************************************|  9593  0:00:00 ETA
    'index.html' saved

    You can see that it's now possible to retrieve the index.html which shows that the pod can access the httpbin service, since the pod labels match the ingress policy.

    Remove demo resources

    Go into the terraform directory and run terraform destroy. You get asked if you really want to delete the resources where you confirm by entering yes.

    Summary

    Availability zones, Azure AD integration, and Calico network policies all help to achieve high availability, seamless identity management, and advanced network traffic management for applications deployed in AKS.

    Availability zones help protect your workloads from Azure data center failures and ensure production system resiliency. Azure AD integration is crucial for unifying the identity management of the cluster, as customers can continue to leverage their investments in Azure AD for managing AKS workloads as well. Calico network policy helps enhance security posture of line-of-business applications deployed in AKS by ensuring that only legit traffic reaches your workloads.

    These features are key for ensuring the production readiness of your AKS cluster.

    As a next step, the automated deployment of the AKS cluster covered in this article can also be integrated with your existing infrastructure-as-code DevOps pipelines for production-scale deployments.

    Originally published at https://codersociety.com.


    Get similar stories in your inbox weekly, for free



    Share this story:
    codersociety
    Kentaro Wakayama, Founder @ Coder Society

    With his in-depth knowledge of software development and cloud technologies, Kentaro often takes on the lead engineer's role. His analytical, organized, and people-oriented nature makes him an apt advisor on software projects and flexible staffing.

    APM-970x250

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …