Google Kubernetes Engine: Advanced Production Guide

Google Kubernetes Engine (GKE) is Google Cloud's managed Kubernetes service, offering enterprise-grade container orchestration with the simplicity of managed infrastructure. This guide covers advanced GKE features for production deployments.

Why GKE for Production?

GKE provides: - Fully Managed Control Plane: Automatic upgrades and patches - Autopilot Mode: Hands-off cluster management - Advanced Security: Binary Authorization, Workload Identity - Auto-scaling: Cluster, node, and pod-level scaling - Multi-cluster Management: Anthos for hybrid deployments

Advanced Cluster Setup

Creating Production-Ready Clusters

# Create a regional cluster for high availability
gcloud container clusters create production-cluster \
    --region us-central1 \
    --num-nodes 2 \
    --enable-autoscaling \
    --min-nodes 2 \
    --max-nodes 10 \
    --enable-autorepair \
    --enable-autoupgrade \
    --release-channel stable \
    --enable-ip-alias \
    --network custom-vpc \
    --subnetwork k8s-subnet \
    --cluster-secondary-range-name pods \
    --services-secondary-range-name services \
    --enable-stackdriver-kubernetes \
    --enable-cloud-logging \
    --logging=SYSTEM,WORKLOAD \
    --enable-cloud-monitoring \
    --maintenance-window-start 2020-01-01T00:00:00Z \
    --maintenance-window-end 2020-01-01T04:00:00Z \
    --maintenance-window-recurrence FREQ=WEEKLY;BYDAY=SA \
    --addons HorizontalPodAutoscaling,HttpLoadBalancing,CloudRun \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --enable-shielded-nodes \
    --shielded-secure-boot \
    --shielded-integrity-monitoring

# Create Autopilot cluster for simplified management
gcloud container clusters create-auto autopilot-cluster \
    --region us-central1 \
    --release-channel stable \
    --network custom-vpc \
    --subnetwork k8s-subnet \
    --enable-private-nodes \
    --enable-private-endpoint \
    --master-ipv4-cidr 172.16.0.0/28

Node Pool Configuration

# Create specialized node pools
gcloud container node-pools create high-memory-pool \
    --cluster=production-cluster \
    --region=us-central1 \
    --machine-type=n2-highmem-4 \
    --num-nodes=1 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=5 \
    --node-labels=workload=memory-intensive \
    --node-taints=memory-intensive=true:NoSchedule

# GPU node pool
gcloud container node-pools create gpu-pool \
    --cluster=production-cluster \
    --region=us-central1 \
    --machine-type=n1-standard-4 \
    --accelerator=type=nvidia-tesla-t4,count=1 \
    --num-nodes=0 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=3 \
    --node-labels=workload=gpu \
    --node-taints=nvidia.com/gpu=present:NoSchedule

# Spot instance node pool for cost optimization
gcloud container node-pools create spot-pool \
    --cluster=production-cluster \
    --region=us-central1 \
    --spot \
    --machine-type=e2-standard-4 \
    --num-nodes=2 \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=20 \
    --node-labels=workload=batch,node-type=spot \
    --node-taints=spot=true:NoSchedule

Workload Identity and Security

Setting Up Workload Identity

# Enable Workload Identity on existing cluster
gcloud container clusters update production-cluster \
    --workload-pool=PROJECT_ID.svc.id.goog

# Create Google Service Account
gcloud iam service-accounts create gke-workload-sa \
    --display-name="GKE Workload Service Account"

# Grant necessary permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

# Create Kubernetes Service Account
kubectl create serviceaccount workload-sa \
    --namespace production

# Bind Kubernetes SA to Google SA
gcloud iam service-accounts add-iam-policy-binding \
    gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[production/workload-sa]"

# Annotate Kubernetes SA
kubectl annotate serviceaccount workload-sa \
    --namespace production \
    iam.gke.io/gcp-service-account=gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com

Binary Authorization

# binary-authorization-policy.yaml
apiVersion: binaryauthorization.grafeas.io/v1beta1
kind: Policy
metadata:
  name: binary-authorization-policy
spec:
  globalPolicyEvaluationMode: ENABLE
  admissionWhitelistPatterns:
  - namePattern: gcr.io/my-project/*
  defaultAdmissionRule:
    evaluationMode: REQUIRE_ATTESTATION
    enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
    requireAttestationsBy:
    - projects/PROJECT_ID/attestors/prod-attestor
  clusterAdmissionRules:
    us-central1.production-cluster:
      evaluationMode: REQUIRE_ATTESTATION
      enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
      requireAttestationsBy:
      - projects/PROJECT_ID/attestors/prod-attestor

# Enable Binary Authorization
gcloud container binauthz policy import binary-authorization-policy.yaml

# Create attestor
gcloud container binauthz attestors create prod-attestor \
    --attestation-authority-note=prod-attestor-note \
    --attestation-authority-note-project=PROJECT_ID

# Create attestation
gcloud container binauthz attestations sign-and-create \
    --artifact-url="gcr.io/PROJECT_ID/app:v1.0" \
    --attestor="prod-attestor" \
    --attestor-project="PROJECT_ID" \
    --keyversion-project="PROJECT_ID" \
    --keyversion-location="global" \
    --keyversion-keyring="binauthz" \
    --keyversion-key="attestor-key" \
    --keyversion="1"

Advanced Networking

Service Mesh with Istio

# Enable Istio on GKE
gcloud container clusters update production-cluster \
    --update-addons=Istio=ENABLED

# Install Istio with custom configuration
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-control-plane
spec:
  profile: production
  values:
    pilot:
      resources:
        requests:
          cpu: 1000m
          memory: 1024Mi
    global:
      proxy:
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
  components:
    egressGateways:
    - name: istio-egressgateway
      enabled: true
      k8s:
        hpaSpec:
          minReplicas: 2
          maxReplicas: 5
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        hpaSpec:
          minReplicas: 3
          maxReplicas: 10
        service:
          type: LoadBalancer
          loadBalancerIP: STATIC_IP
EOF

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Auto-scaling Strategies

Horizontal Pod Autoscaling (HPA)

# hpa-advanced.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  - type: External
    external:
      metric:
        name: pubsub_queue_depth
        selector:
          matchLabels:
            queue: work-queue
      target:
        type: Value
        value: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      - type: Pods
        value: 5
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 10
        periodSeconds: 60
      selectPolicy: Max

Vertical Pod Autoscaling (VPA)

# vpa-config.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

Cluster Autoscaling Configuration

# cluster-autoscaler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-status
  namespace: kube-system
data:
  nodes.max: "100"
  nodes.min: "3"
  scale-down-delay-after-add: "10m"
  scale-down-delay-after-delete: "10s"
  scale-down-delay-after-failure: "3m"
  scale-down-unneeded-time: "10m"
  scale-down-utilization-threshold: "0.5"
  skip-nodes-with-local-storage: "false"
  skip-nodes-with-system-pods: "true"
  balance-similar-node-groups: "true"
  expander: "least-waste"

Production Deployments

Blue-Green Deployment

# blue-green-deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  selector:
    app: api
    version: blue  # Switch between blue and green
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
      version: blue
  template:
    metadata:
      labels:
        app: api
        version: blue
    spec:
      containers:
      - name: api
        image: gcr.io/PROJECT_ID/api:v1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
      version: green
  template:
    metadata:
      labels:
        app: api
        version: green
    spec:
      containers:
      - name: api
        image: gcr.io/PROJECT_ID/api:v2.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Canary Deployment with Flagger

# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-canary
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  progressDeadlineSeconds: 300
  service:
    port: 80
    targetPort: 8080
    gateways:
    - public-gateway.istio-system.svc.cluster.local
    hosts:
    - api.example.com
  analysis:
    interval: 30s
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 30s
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://api.production:80/"

Monitoring and Observability

Custom Metrics with Prometheus

# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      seriesFilters: []
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total$"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
    - seriesQuery: 'pubsub_queue_depth{topic!=""}'
      resources:
        template: <<.Resource>>
      name:
        matches: "^(.*)$"
        as: "${1}"
      metricsQuery: 'max(<<.Series>>{<<.LabelMatchers>>})'

Application Performance Monitoring

# apm-deployment.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: apm-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: apm-agent
  template:
    metadata:
      labels:
        app: apm-agent
    spec:
      serviceAccountName: apm-agent
      containers:
      - name: agent
        image: gcr.io/PROJECT_ID/apm-agent:latest
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /var/secrets/google/key.json
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: google-cloud-key
          mountPath: /var/secrets/google
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: google-cloud-key
        secret:
          secretName: apm-key
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Disaster Recovery and Backup

Velero Backup Configuration

# Install Velero for GKE backups
velero install \
    --provider gcp \
    --plugins velero/velero-plugin-for-gcp:v1.5.0 \
    --bucket velero-backups \
    --secret-file ./credentials-velero \
    --backup-location-config serviceAccount=velero@PROJECT_ID.iam.gserviceaccount.com \
    --snapshot-location-config project=PROJECT_ID,snapshotLocation=us-central1

# Create backup schedule
velero schedule create daily-backup \
    --schedule="0 2 * * *" \
    --include-namespaces production,staging \
    --exclude-resources events,events.events.k8s.io \
    --ttl 720h0m0s

# Create pre-backup hooks
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: database-backup
  namespace: production
  annotations:
    pre.hook.backup.velero.io/container: database-backup
    pre.hook.backup.velero.io/command: '["/bin/sh", "-c", "pg_dump -h $DB_HOST -U $DB_USER -d $DB_NAME > /backup/dump.sql"]'
spec:
  containers:
  - name: database-backup
    image: postgres:13
    env:
    - name: DB_HOST
      value: postgres-service
    volumeMounts:
    - name: backup
      mountPath: /backup
  volumes:
  - name: backup
    persistentVolumeClaim:
      claimName: backup-pvc
EOF

Multi-Region Failover

# multi-region-ingress.yaml
apiVersion: networking.gke.io/v1
kind: MultiClusterIngress
metadata:
  name: api-multicluster-ingress
  namespace: production
spec:
  template:
    spec:
      backend:
        serviceName: api-multicluster-service
        servicePort: 80
      rules:
      - host: api.example.com
        http:
          paths:
          - path: /*
            backend:
              serviceName: api-multicluster-service
              servicePort: 80
---
apiVersion: networking.gke.io/v1
kind: MultiClusterService
metadata:
  name: api-multicluster-service
  namespace: production
spec:
  template:
    spec:
      selector:
        app: api
      ports:
      - port: 80
        targetPort: 8080
  clusters:
  - link: "us-central1/production-cluster"
  - link: "europe-west1/production-cluster-eu"
  - link: "asia-southeast1/production-cluster-asia"

Cost Optimization

Pod Disruption Budgets

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api
  maxUnavailable: 33%

Resource Quotas

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "1000"
    requests.memory: "1000Gi"
    limits.cpu: "2000"
    limits.memory: "2000Gi"
    persistentvolumeclaims: "20"
    services.loadbalancers: "5"
    pods: "200"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
  - max:
      cpu: "2"
      memory: "4Gi"
    min:
      cpu: "100m"
      memory: "128Mi"
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "200m"
      memory: "256Mi"
    type: Container
  - max:
      storage: "10Gi"
    type: PersistentVolumeClaim

CI/CD Integration

GitOps with Config Sync

# config-sync.yaml
apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
  name: config-management
spec:
  clusterName: production-cluster
  git:
    syncRepo: https://github.com/company/k8s-config
    syncBranch: main
    secretType: ssh
    policyDir: "clusters/production"
  policyController:
    enabled: true
    templateLibraryInstalled: true
    referentialRulesEnabled: true
    logDeniesEnabled: true
    mutationEnabled: true
  hierarchyController:
    enabled: true
    enablePodTreeLabels: true
  sourceFormat: unstructured

Cloud Build Integration

# cloudbuild.yaml
steps:
# Build Docker image
- name: 'gcr.io/cloud-builders/docker'
  args: ['build', '-t', 'gcr.io/$PROJECT_ID/api:$SHORT_SHA', '.']

# Run tests
- name: 'gcr.io/$PROJECT_ID/api:$SHORT_SHA'
  args: ['npm', 'test']

# Push image
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'gcr.io/$PROJECT_ID/api:$SHORT_SHA']

# Deploy to GKE
- name: 'gcr.io/cloud-builders/gke-deploy'
  args:
  - run
  - --filename=k8s/
  - --image=gcr.io/$PROJECT_ID/api:$SHORT_SHA
  - --cluster=production-cluster
  - --location=us-central1
  - --namespace=production

# Run smoke tests
- name: 'gcr.io/cloud-builders/gcloud'
  args:
  - 'builds'
  - 'submit'
  - '--config=smoke-tests/cloudbuild.yaml'
  - '--substitutions=_ENDPOINT=${_ENDPOINT}'

options:
  machineType: 'N1_HIGHCPU_8'
  substitutionOption: 'ALLOW_LOOSE'

Security Hardening

Pod Security Standards

# pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'downwardAPI'
  - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: true

Security Scanning

# Enable vulnerability scanning
gcloud container images scan IMAGE_URL

# Configure admission controller for security policies
kubectl apply -f - <<EOF
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: security-webhook
webhooks:
- name: validate.security.company.com
  clientConfig:
    service:
      name: security-webhook
      namespace: kube-system
      path: "/validate"
    caBundle: $(cat ca.crt | base64 | tr -d '\n')
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: ["apps", ""]
    apiVersions: ["v1"]
    resources: ["deployments", "pods"]
  admissionReviewVersions: ["v1", "v1beta1"]
  sideEffects: None
  failurePolicy: Fail
  namespaceSelector:
    matchLabels:
      security-scanning: "enabled"
EOF

Conclusion

Google Kubernetes Engine provides a robust platform for running production workloads at scale. By leveraging advanced features like Workload Identity, auto-scaling, and comprehensive monitoring, you can build resilient and efficient container orchestration systems.

Best Practices Summary

Use Regional Clusters: For high availability in production
Enable Workload Identity: For secure pod authentication
Implement Auto-scaling: At cluster, node, and pod levels
Use Binary Authorization: For supply chain security
Monitor Everything: Leverage GCP's observability stack
Plan for Disaster Recovery: Regular backups and multi-region setup
Optimize Costs: Use spot instances and resource quotas
Secure by Default: Pod security policies and network policies

Next Steps

Explore Anthos for multi-cloud Kubernetes management
Implement service mesh with Istio or Anthos Service Mesh
Study advanced GKE security features
Get certified as a Kubernetes Administrator (CKA)
Learn about GKE Autopilot for simplified operations

Remember: GKE's strength is in providing enterprise-grade Kubernetes with Google's infrastructure expertise. Use managed features to reduce operational overhead while maintaining flexibility.

Google Kubernetes Engine: Advanced Production Guide

Need Professional Google Cloud Services?

Google Kubernetes Engine: Advanced Production Guide

Why GKE for Production?

Advanced Cluster Setup

Creating Production-Ready Clusters

Node Pool Configuration

Workload Identity and Security

Setting Up Workload Identity

Binary Authorization

Advanced Networking

Service Mesh with Istio

Network Policies

Auto-scaling Strategies

Horizontal Pod Autoscaling (HPA)

Vertical Pod Autoscaling (VPA)

Cluster Autoscaling Configuration

Production Deployments

Blue-Green Deployment

Canary Deployment with Flagger

Monitoring and Observability

Custom Metrics with Prometheus

Application Performance Monitoring

Disaster Recovery and Backup

Velero Backup Configuration

Multi-Region Failover

Cost Optimization

Pod Disruption Budgets

Resource Quotas

CI/CD Integration

GitOps with Config Sync

Cloud Build Integration

Security Hardening

Pod Security Standards

Security Scanning

Conclusion

Best Practices Summary

Next Steps