Google Kubernetes Engine: Advanced Production Guide
Google Kubernetes Engine (GKE) is Google Cloud's managed Kubernetes service, offering enterprise-grade container orchestration with the simplicity of managed infrastructure. This guide covers advanced GKE features for production deployments.
Why GKE for Production?
GKE provides: - Fully Managed Control Plane: Automatic upgrades and patches - Autopilot Mode: Hands-off cluster management - Advanced Security: Binary Authorization, Workload Identity - Auto-scaling: Cluster, node, and pod-level scaling - Multi-cluster Management: Anthos for hybrid deployments
Advanced Cluster Setup
Creating Production-Ready Clusters
# Create a regional cluster for high availability
gcloud container clusters create production-cluster \
--region us-central1 \
--num-nodes 2 \
--enable-autoscaling \
--min-nodes 2 \
--max-nodes 10 \
--enable-autorepair \
--enable-autoupgrade \
--release-channel stable \
--enable-ip-alias \
--network custom-vpc \
--subnetwork k8s-subnet \
--cluster-secondary-range-name pods \
--services-secondary-range-name services \
--enable-stackdriver-kubernetes \
--enable-cloud-logging \
--logging=SYSTEM,WORKLOAD \
--enable-cloud-monitoring \
--maintenance-window-start 2020-01-01T00:00:00Z \
--maintenance-window-end 2020-01-01T04:00:00Z \
--maintenance-window-recurrence FREQ=WEEKLY;BYDAY=SA \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,CloudRun \
--workload-pool=PROJECT_ID.svc.id.goog \
--enable-shielded-nodes \
--shielded-secure-boot \
--shielded-integrity-monitoring
# Create Autopilot cluster for simplified management
gcloud container clusters create-auto autopilot-cluster \
--region us-central1 \
--release-channel stable \
--network custom-vpc \
--subnetwork k8s-subnet \
--enable-private-nodes \
--enable-private-endpoint \
--master-ipv4-cidr 172.16.0.0/28
Node Pool Configuration
# Create specialized node pools
gcloud container node-pools create high-memory-pool \
--cluster=production-cluster \
--region=us-central1 \
--machine-type=n2-highmem-4 \
--num-nodes=1 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=5 \
--node-labels=workload=memory-intensive \
--node-taints=memory-intensive=true:NoSchedule
# GPU node pool
gcloud container node-pools create gpu-pool \
--cluster=production-cluster \
--region=us-central1 \
--machine-type=n1-standard-4 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--num-nodes=0 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=3 \
--node-labels=workload=gpu \
--node-taints=nvidia.com/gpu=present:NoSchedule
# Spot instance node pool for cost optimization
gcloud container node-pools create spot-pool \
--cluster=production-cluster \
--region=us-central1 \
--spot \
--machine-type=e2-standard-4 \
--num-nodes=2 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=20 \
--node-labels=workload=batch,node-type=spot \
--node-taints=spot=true:NoSchedule
Workload Identity and Security
Setting Up Workload Identity
# Enable Workload Identity on existing cluster
gcloud container clusters update production-cluster \
--workload-pool=PROJECT_ID.svc.id.goog
# Create Google Service Account
gcloud iam service-accounts create gke-workload-sa \
--display-name="GKE Workload Service Account"
# Grant necessary permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# Create Kubernetes Service Account
kubectl create serviceaccount workload-sa \
--namespace production
# Bind Kubernetes SA to Google SA
gcloud iam service-accounts add-iam-policy-binding \
gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:PROJECT_ID.svc.id.goog[production/workload-sa]"
# Annotate Kubernetes SA
kubectl annotate serviceaccount workload-sa \
--namespace production \
iam.gke.io/gcp-service-account=gke-workload-sa@PROJECT_ID.iam.gserviceaccount.com
Binary Authorization
# binary-authorization-policy.yaml
apiVersion: binaryauthorization.grafeas.io/v1beta1
kind: Policy
metadata:
name: binary-authorization-policy
spec:
globalPolicyEvaluationMode: ENABLE
admissionWhitelistPatterns:
- namePattern: gcr.io/my-project/*
defaultAdmissionRule:
evaluationMode: REQUIRE_ATTESTATION
enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
requireAttestationsBy:
- projects/PROJECT_ID/attestors/prod-attestor
clusterAdmissionRules:
us-central1.production-cluster:
evaluationMode: REQUIRE_ATTESTATION
enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
requireAttestationsBy:
- projects/PROJECT_ID/attestors/prod-attestor
# Enable Binary Authorization
gcloud container binauthz policy import binary-authorization-policy.yaml
# Create attestor
gcloud container binauthz attestors create prod-attestor \
--attestation-authority-note=prod-attestor-note \
--attestation-authority-note-project=PROJECT_ID
# Create attestation
gcloud container binauthz attestations sign-and-create \
--artifact-url="gcr.io/PROJECT_ID/app:v1.0" \
--attestor="prod-attestor" \
--attestor-project="PROJECT_ID" \
--keyversion-project="PROJECT_ID" \
--keyversion-location="global" \
--keyversion-keyring="binauthz" \
--keyversion-key="attestor-key" \
--keyversion="1"
Advanced Networking
Service Mesh with Istio
# Enable Istio on GKE
gcloud container clusters update production-cluster \
--update-addons=Istio=ENABLED
# Install Istio with custom configuration
kubectl apply -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-control-plane
spec:
profile: production
values:
pilot:
resources:
requests:
cpu: 1000m
memory: 1024Mi
global:
proxy:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
components:
egressGateways:
- name: istio-egressgateway
enabled: true
k8s:
hpaSpec:
minReplicas: 2
maxReplicas: 5
ingressGateways:
- name: istio-ingressgateway
enabled: true
k8s:
hpaSpec:
minReplicas: 3
maxReplicas: 10
service:
type: LoadBalancer
loadBalancerIP: STATIC_IP
EOF
Network Policies
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: production
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: production
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Auto-scaling Strategies
Horizontal Pod Autoscaling (HPA)
# hpa-advanced.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-deployment
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
- type: External
external:
metric:
name: pubsub_queue_depth
selector:
matchLabels:
queue: work-queue
target:
type: Value
value: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 10
periodSeconds: 60
selectPolicy: Max
Vertical Pod Autoscaling (VPA)
# vpa-config.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
Cluster Autoscaling Configuration
# cluster-autoscaler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-status
namespace: kube-system
data:
nodes.max: "100"
nodes.min: "3"
scale-down-delay-after-add: "10m"
scale-down-delay-after-delete: "10s"
scale-down-delay-after-failure: "3m"
scale-down-unneeded-time: "10m"
scale-down-utilization-threshold: "0.5"
skip-nodes-with-local-storage: "false"
skip-nodes-with-system-pods: "true"
balance-similar-node-groups: "true"
expander: "least-waste"
Production Deployments
Blue-Green Deployment
# blue-green-deployment.yaml
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: production
spec:
selector:
app: api
version: blue # Switch between blue and green
ports:
- port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: api
version: blue
template:
metadata:
labels:
app: api
version: blue
spec:
containers:
- name: api
image: gcr.io/PROJECT_ID/api:v1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-green
namespace: production
spec:
replicas: 10
selector:
matchLabels:
app: api
version: green
template:
metadata:
labels:
app: api
version: green
spec:
containers:
- name: api
image: gcr.io/PROJECT_ID/api:v2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Canary Deployment with Flagger
# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-canary
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
progressDeadlineSeconds: 300
service:
port: 80
targetPort: 8080
gateways:
- public-gateway.istio-system.svc.cluster.local
hosts:
- api.example.com
analysis:
interval: 30s
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 30s
- name: request-duration
thresholdRange:
max: 500
interval: 30s
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://api.production:80/"
Monitoring and Observability
Custom Metrics with Prometheus
# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
seriesFilters: []
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
- seriesQuery: 'pubsub_queue_depth{topic!=""}'
resources:
template: <<.Resource>>
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: 'max(<<.Series>>{<<.LabelMatchers>>})'
Application Performance Monitoring
# apm-deployment.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: apm-agent
namespace: monitoring
spec:
selector:
matchLabels:
app: apm-agent
template:
metadata:
labels:
app: apm-agent
spec:
serviceAccountName: apm-agent
containers:
- name: agent
image: gcr.io/PROJECT_ID/apm-agent:latest
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: google-cloud-key
mountPath: /var/secrets/google
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: google-cloud-key
secret:
secretName: apm-key
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Disaster Recovery and Backup
Velero Backup Configuration
# Install Velero for GKE backups
velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.5.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config serviceAccount=velero@PROJECT_ID.iam.gserviceaccount.com \
--snapshot-location-config project=PROJECT_ID,snapshotLocation=us-central1
# Create backup schedule
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces production,staging \
--exclude-resources events,events.events.k8s.io \
--ttl 720h0m0s
# Create pre-backup hooks
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: database-backup
namespace: production
annotations:
pre.hook.backup.velero.io/container: database-backup
pre.hook.backup.velero.io/command: '["/bin/sh", "-c", "pg_dump -h $DB_HOST -U $DB_USER -d $DB_NAME > /backup/dump.sql"]'
spec:
containers:
- name: database-backup
image: postgres:13
env:
- name: DB_HOST
value: postgres-service
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc
EOF
Multi-Region Failover
# multi-region-ingress.yaml
apiVersion: networking.gke.io/v1
kind: MultiClusterIngress
metadata:
name: api-multicluster-ingress
namespace: production
spec:
template:
spec:
backend:
serviceName: api-multicluster-service
servicePort: 80
rules:
- host: api.example.com
http:
paths:
- path: /*
backend:
serviceName: api-multicluster-service
servicePort: 80
---
apiVersion: networking.gke.io/v1
kind: MultiClusterService
metadata:
name: api-multicluster-service
namespace: production
spec:
template:
spec:
selector:
app: api
ports:
- port: 80
targetPort: 8080
clusters:
- link: "us-central1/production-cluster"
- link: "europe-west1/production-cluster-eu"
- link: "asia-southeast1/production-cluster-asia"
Cost Optimization
Pod Disruption Budgets
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: api
maxUnavailable: 33%
Resource Quotas
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "1000"
requests.memory: "1000Gi"
limits.cpu: "2000"
limits.memory: "2000Gi"
persistentvolumeclaims: "20"
services.loadbalancers: "5"
pods: "200"
---
apiVersion: v1
kind: LimitRange
metadata:
name: production-limits
namespace: production
spec:
limits:
- max:
cpu: "2"
memory: "4Gi"
min:
cpu: "100m"
memory: "128Mi"
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "200m"
memory: "256Mi"
type: Container
- max:
storage: "10Gi"
type: PersistentVolumeClaim
CI/CD Integration
GitOps with Config Sync
# config-sync.yaml
apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
name: config-management
spec:
clusterName: production-cluster
git:
syncRepo: https://github.com/company/k8s-config
syncBranch: main
secretType: ssh
policyDir: "clusters/production"
policyController:
enabled: true
templateLibraryInstalled: true
referentialRulesEnabled: true
logDeniesEnabled: true
mutationEnabled: true
hierarchyController:
enabled: true
enablePodTreeLabels: true
sourceFormat: unstructured
Cloud Build Integration
# cloudbuild.yaml
steps:
# Build Docker image
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'gcr.io/$PROJECT_ID/api:$SHORT_SHA', '.']
# Run tests
- name: 'gcr.io/$PROJECT_ID/api:$SHORT_SHA'
args: ['npm', 'test']
# Push image
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/api:$SHORT_SHA']
# Deploy to GKE
- name: 'gcr.io/cloud-builders/gke-deploy'
args:
- run
- --filename=k8s/
- --image=gcr.io/$PROJECT_ID/api:$SHORT_SHA
- --cluster=production-cluster
- --location=us-central1
- --namespace=production
# Run smoke tests
- name: 'gcr.io/cloud-builders/gcloud'
args:
- 'builds'
- 'submit'
- '--config=smoke-tests/cloudbuild.yaml'
- '--substitutions=_ENDPOINT=${_ENDPOINT}'
options:
machineType: 'N1_HIGHCPU_8'
substitutionOption: 'ALLOW_LOOSE'
Security Hardening
Pod Security Standards
# pod-security-policy.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
readOnlyRootFilesystem: true
Security Scanning
# Enable vulnerability scanning
gcloud container images scan IMAGE_URL
# Configure admission controller for security policies
kubectl apply -f - <<EOF
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: security-webhook
webhooks:
- name: validate.security.company.com
clientConfig:
service:
name: security-webhook
namespace: kube-system
path: "/validate"
caBundle: $(cat ca.crt | base64 | tr -d '\n')
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: ["apps", ""]
apiVersions: ["v1"]
resources: ["deployments", "pods"]
admissionReviewVersions: ["v1", "v1beta1"]
sideEffects: None
failurePolicy: Fail
namespaceSelector:
matchLabels:
security-scanning: "enabled"
EOF
Conclusion
Google Kubernetes Engine provides a robust platform for running production workloads at scale. By leveraging advanced features like Workload Identity, auto-scaling, and comprehensive monitoring, you can build resilient and efficient container orchestration systems.
Best Practices Summary
- Use Regional Clusters: For high availability in production
- Enable Workload Identity: For secure pod authentication
- Implement Auto-scaling: At cluster, node, and pod levels
- Use Binary Authorization: For supply chain security
- Monitor Everything: Leverage GCP's observability stack
- Plan for Disaster Recovery: Regular backups and multi-region setup
- Optimize Costs: Use spot instances and resource quotas
- Secure by Default: Pod security policies and network policies
Next Steps
- Explore Anthos for multi-cloud Kubernetes management
- Implement service mesh with Istio or Anthos Service Mesh
- Study advanced GKE security features
- Get certified as a Kubernetes Administrator (CKA)
- Learn about GKE Autopilot for simplified operations
Remember: GKE's strength is in providing enterprise-grade Kubernetes with Google's infrastructure expertise. Use managed features to reduce operational overhead while maintaining flexibility.