Google Kubernetes Engine: Advanced Production Guide

Tyler Maginnis | February 01, 2024

Google CloudGKEKubernetescontainersDevOps

Need Professional Google Cloud Services?

Get expert assistance with your google cloud services implementation and management. Tyler on Tech Louisville provides priority support for Louisville businesses.

Same-day service available for Louisville area

Google Kubernetes Engine: Advanced Production Guide

Google Kubernetes Engine (GKE) is Google Cloud's managed Kubernetes service, offering enterprise-grade container orchestration with the simplicity of managed infrastructure. This guide covers advanced GKE features for production deployments.

Why GKE for Production?

GKE provides: - Fully Managed Control Plane: Automatic upgrades and patches - Autopilot Mode: Hands-off cluster management - Advanced Security: Binary Authorization, Workload Identity - Auto-scaling: Cluster, node, and pod-level scaling - Multi-cluster Management: Anthos for hybrid deployments

Advanced Cluster Setup

Creating Production-Ready Clusters

# Create a regional cluster for high availability
gcloud container clusters create production-cluster \
    --region us-central1 \
    --num-nodes 2 \
    --enable-autoscaling \
    --min-nodes 2 \
    --max-nodes 10 \
    --enable-autorepair \
    --enable-autoupgrade \
    --release-channel stable \
    --enable-ip-alias \
    --network custom-vpc \
    --subnetwork k8s-subnet \
    --cluster-secondary-range-name pods \
    --services-secondary-range-name services \
    --enable-stackdriver-kubernetes \
    --enable-cloud-logging \
    --logging=SYSTEM,WORKLOAD \
    --enable-cloud-monitoring \
    --maintenance-window-start 2020-01-01T00:00:00Z \
    --maintenance-window-end 2020-01-01T04:00:00Z \
    --maintenance-window-recurrence FREQ=WEEKLY;BYDAY=SA \
    --addons HorizontalPodAutoscaling,HttpLoadBalancing,CloudRun \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --enable-shielded-nodes \
    --shielded-secure-boot \
    --shielded-integrity-monitoring

# Create Autopilot cluster for simplified management
gcloud container clusters create-auto autopilot-cluster \
    --region us-central1 \
    --release-channel stable \
    --network custom-vpc \
    --subnetwork k8s-subnet \
    --enable-private-nodes \
    --enable-private-endpoint \
    --master-ipv4-cidr 172.16.0.0/28

Node Pool Configuration

# Create specialized node pools
gcloud container node-pools create high-memory-pool \
    --cluster=production-cluster \
    --region=us-central1 \
    --machine-type=n2-highmem-4 \
    --num-nodes=1 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=5 \
    --node-labels=workload=memory-intensive \
    --node-taints=memory-intensive=true:NoSchedule

# GPU node pool
gcloud container node-pools create gpu-pool \
    --cluster=production-cluster \
    --region=us-central1 \
    --machine-type=n1-standard-4 \
    --accelerator=type=nvidia-tesla-t4,count=1 \
    --num-nodes=0 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=3 \
    --node-labels=workload=gpu \
    --node-taints=nvidia.com/gpu=present:NoSchedule

# Spot instance node pool for cost optimization
gcloud container node-pools create spot-pool \
    --cluster=production-cluster \
    --region=us-central1 \
    --spot \
    --machine-type=e2-standard-4 \
    --num-nodes=2 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=20 \
    --node-labels=workload=batch,node-type=spot \
    --node-taints=spot=true:NoSchedule

Workload Identity and Security

Setting Up Workload Identity

# Enable Workload Identity on existing cluster
gcloud container clusters update production-cluster \
    --workload-pool=PROJECT_ID.svc.id.goog

# Create Google Service Account
gcloud iam service-accounts create gke-workload-sa \
    --display-name="GKE Workload Service Account"

# Grant necessary permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

# Create Kubernetes Service Account
kubectl create serviceaccount workload-sa \
    --namespace production

# Bind Kubernetes SA to Google SA
gcloud iam service-accounts add-iam-policy-binding \
    gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[production/workload-sa]"

# Annotate Kubernetes SA
kubectl annotate serviceaccount workload-sa \
    --namespace production \
    iam.gke.io/gcp-service-account=gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com

Binary Authorization

# binary-authorization-policy.yaml
apiVersion: binaryauthorization.grafeas.io/v1beta1
kind: Policy
metadata:
  name: binary-authorization-policy
spec:
  globalPolicyEvaluationMode: ENABLE
  admissionWhitelistPatterns:
  - namePattern: gcr.io/my-project/*
  defaultAdmissionRule:
    evaluationMode: REQUIRE_ATTESTATION
    enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
    requireAttestationsBy:
    - projects/PROJECT_ID/attestors/prod-attestor
  clusterAdmissionRules:
    us-central1.production-cluster:
      evaluationMode: REQUIRE_ATTESTATION
      enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
      requireAttestationsBy:
      - projects/PROJECT_ID/attestors/prod-attestor
# Enable Binary Authorization
gcloud container binauthz policy import binary-authorization-policy.yaml

# Create attestor
gcloud container binauthz attestors create prod-attestor \
    --attestation-authority-note=prod-attestor-note \
    --attestation-authority-note-project=PROJECT_ID

# Create attestation
gcloud container binauthz attestations sign-and-create \
    --artifact-url="gcr.io/PROJECT_ID/app:v1.0" \
    --attestor="prod-attestor" \
    --attestor-project="PROJECT_ID" \
    --keyversion-project="PROJECT_ID" \
    --keyversion-location="global" \
    --keyversion-keyring="binauthz" \
    --keyversion-key="attestor-key" \
    --keyversion="1"

Advanced Networking

Service Mesh with Istio

# Enable Istio on GKE
gcloud container clusters update production-cluster \
    --update-addons=Istio=ENABLED

# Install Istio with custom configuration
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-control-plane
spec:
  profile: production
  values:
    pilot:
      resources:
        requests:
          cpu: 1000m
          memory: 1024Mi
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
  components:
    egressGateways:
    - name: istio-egressgateway
      enabled: true
      k8s:
        hpaSpec:
          minReplicas: 2
          maxReplicas: 5
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        hpaSpec:
          minReplicas: 3
          maxReplicas: 10
        service:
          type: LoadBalancer
          loadBalancerIP: STATIC_IP
EOF

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Auto-scaling Strategies

Horizontal Pod Autoscaling (HPA)

# hpa-advanced.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  - type: External
    external:
      metric:
        name: pubsub_queue_depth
        selector:
          matchLabels:
            queue: work-queue
      target:
        type: Value
        value: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      - type: Pods
        value: 5
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 10
        periodSeconds: 60
      selectPolicy: Max

Vertical Pod Autoscaling (VPA)

# vpa-config.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

Cluster Autoscaling Configuration

# cluster-autoscaler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-status
  namespace: kube-system
data:
  nodes.max: "100"
  nodes.min: "3"
  scale-down-delay-after-add: "10m"
  scale-down-delay-after-delete: "10s"
  scale-down-delay-after-failure: "3m"
  scale-down-unneeded-time: "10m"
  scale-down-utilization-threshold: "0.5"
  skip-nodes-with-local-storage: "false"
  skip-nodes-with-system-pods: "true"
  balance-similar-node-groups: "true"
  expander: "least-waste"

Production Deployments

Blue-Green Deployment

# blue-green-deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  selector:
    app: api
    version: blue  # Switch between blue and green
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
      version: blue
  template:
    metadata:
      labels:
        app: api
        version: blue
    spec:
      containers:
      - name: api
        image: gcr.io/PROJECT_ID/api:v1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
      version: green
  template:
    metadata:
      labels:
        app: api
        version: green
    spec:
      containers:
      - name: api
        image: gcr.io/PROJECT_ID/api:v2.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Canary Deployment with Flagger

# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-canary
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  progressDeadlineSeconds: 300
  service:
    port: 80
    targetPort: 8080
    gateways:
    - public-gateway.istio-system.svc.cluster.local
    hosts:
    - api.example.com
  analysis:
    interval: 30s
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 30s
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://api.production:80/"

Monitoring and Observability

Custom Metrics with Prometheus

# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      seriesFilters: []
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total$"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
    - seriesQuery: 'pubsub_queue_depth{topic!=""}'
      resources:
        template: <<.Resource>>
      name:
        matches: "^(.*)$"
        as: "${1}"
      metricsQuery: 'max(<<.Series>>{<<.LabelMatchers>>})'

Application Performance Monitoring

# apm-deployment.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: apm-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: apm-agent
  template:
    metadata:
      labels:
        app: apm-agent
    spec:
      serviceAccountName: apm-agent
      containers:
      - name: agent
        image: gcr.io/PROJECT_ID/apm-agent:latest
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /var/secrets/google/key.json
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: google-cloud-key
          mountPath: /var/secrets/google
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: google-cloud-key
        secret:
          secretName: apm-key
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Disaster Recovery and Backup

Velero Backup Configuration

# Install Velero for GKE backups
velero install \
    --provider gcp \
    --plugins velero/velero-plugin-for-gcp:v1.5.0 \
    --bucket velero-backups \
    --secret-file ./credentials-velero \
    --backup-location-config serviceAccount=velero@PROJECT_ID.iam.gserviceaccount.com \
    --snapshot-location-config project=PROJECT_ID,snapshotLocation=us-central1

# Create backup schedule
velero schedule create daily-backup \
    --schedule="0 2 * * *" \
    --include-namespaces production,staging \
    --exclude-resources events,events.events.k8s.io \
    --ttl 720h0m0s

# Create pre-backup hooks
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: database-backup
  namespace: production
  annotations:
    pre.hook.backup.velero.io/container: database-backup
    pre.hook.backup.velero.io/command: '["/bin/sh", "-c", "pg_dump -h $DB_HOST -U $DB_USER -d $DB_NAME > /backup/dump.sql"]'
spec:
  containers:
  - name: database-backup
    image: postgres:13
    env:
    - name: DB_HOST
      value: postgres-service
    volumeMounts:
    - name: backup
      mountPath: /backup
  volumes:
  - name: backup
    persistentVolumeClaim:
      claimName: backup-pvc
EOF

Multi-Region Failover

# multi-region-ingress.yaml
apiVersion: networking.gke.io/v1
kind: MultiClusterIngress
metadata:
  name: api-multicluster-ingress
  namespace: production
spec:
  template:
    spec:
      backend:
        serviceName: api-multicluster-service
        servicePort: 80
      rules:
      - host: api.example.com
        http:
          paths:
          - path: /*
            backend:
              serviceName: api-multicluster-service
              servicePort: 80
---
apiVersion: networking.gke.io/v1
kind: MultiClusterService
metadata:
  name: api-multicluster-service
  namespace: production
spec:
  template:
    spec:
      selector:
        app: api
      ports:
      - port: 80
        targetPort: 8080
  clusters:
  - link: "us-central1/production-cluster"
  - link: "europe-west1/production-cluster-eu"
  - link: "asia-southeast1/production-cluster-asia"

Cost Optimization

Pod Disruption Budgets

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api
  maxUnavailable: 33%

Resource Quotas

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "1000"
    requests.memory: "1000Gi"
    limits.cpu: "2000"
    limits.memory: "2000Gi"
    persistentvolumeclaims: "20"
    services.loadbalancers: "5"
    pods: "200"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
  - max:
      cpu: "2"
      memory: "4Gi"
    min:
      cpu: "100m"
      memory: "128Mi"
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "200m"
      memory: "256Mi"
    type: Container
  - max:
      storage: "10Gi"
    type: PersistentVolumeClaim

CI/CD Integration

GitOps with Config Sync

# config-sync.yaml
apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
  name: config-management
spec:
  clusterName: production-cluster
  git:
    syncRepo: https://github.com/company/k8s-config
    syncBranch: main
    secretType: ssh
    policyDir: "clusters/production"
  policyController:
    enabled: true
    templateLibraryInstalled: true
    referentialRulesEnabled: true
    logDeniesEnabled: true
    mutationEnabled: true
  hierarchyController:
    enabled: true
    enablePodTreeLabels: true
  sourceFormat: unstructured

Cloud Build Integration

# cloudbuild.yaml
steps:
# Build Docker image
- name: 'gcr.io/cloud-builders/docker'
  args: ['build', '-t', 'gcr.io/$PROJECT_ID/api:$SHORT_SHA', '.']

# Run tests
- name: 'gcr.io/$PROJECT_ID/api:$SHORT_SHA'
  args: ['npm', 'test']

# Push image
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'gcr.io/$PROJECT_ID/api:$SHORT_SHA']

# Deploy to GKE
- name: 'gcr.io/cloud-builders/gke-deploy'
  args:
  - run
  - --filename=k8s/
  - --image=gcr.io/$PROJECT_ID/api:$SHORT_SHA
  - --cluster=production-cluster
  - --location=us-central1
  - --namespace=production

# Run smoke tests
- name: 'gcr.io/cloud-builders/gcloud'
  args:
  - 'builds'
  - 'submit'
  - '--config=smoke-tests/cloudbuild.yaml'
  - '--substitutions=_ENDPOINT=${_ENDPOINT}'

options:
  machineType: 'N1_HIGHCPU_8'
  substitutionOption: 'ALLOW_LOOSE'

Security Hardening

Pod Security Standards

# pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'downwardAPI'
  - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: true

Security Scanning

# Enable vulnerability scanning
gcloud container images scan IMAGE_URL

# Configure admission controller for security policies
kubectl apply -f - <<EOF
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: security-webhook
webhooks:
- name: validate.security.company.com
  clientConfig:
    service:
      name: security-webhook
      namespace: kube-system
      path: "/validate"
    caBundle: $(cat ca.crt | base64 | tr -d '\n')
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: ["apps", ""]
    apiVersions: ["v1"]
    resources: ["deployments", "pods"]
  admissionReviewVersions: ["v1", "v1beta1"]
  sideEffects: None
  failurePolicy: Fail
  namespaceSelector:
    matchLabels:
      security-scanning: "enabled"
EOF

Conclusion

Google Kubernetes Engine provides a robust platform for running production workloads at scale. By leveraging advanced features like Workload Identity, auto-scaling, and comprehensive monitoring, you can build resilient and efficient container orchestration systems.

Best Practices Summary

  1. Use Regional Clusters: For high availability in production
  2. Enable Workload Identity: For secure pod authentication
  3. Implement Auto-scaling: At cluster, node, and pod levels
  4. Use Binary Authorization: For supply chain security
  5. Monitor Everything: Leverage GCP's observability stack
  6. Plan for Disaster Recovery: Regular backups and multi-region setup
  7. Optimize Costs: Use spot instances and resource quotas
  8. Secure by Default: Pod security policies and network policies

Next Steps

  • Explore Anthos for multi-cloud Kubernetes management
  • Implement service mesh with Istio or Anthos Service Mesh
  • Study advanced GKE security features
  • Get certified as a Kubernetes Administrator (CKA)
  • Learn about GKE Autopilot for simplified operations

Remember: GKE's strength is in providing enterprise-grade Kubernetes with Google's infrastructure expertise. Use managed features to reduce operational overhead while maintaining flexibility.