本指南向您展示如何使用 Linkerd 和 Flagger 來自動化金絲雀部署與 A/B 測試。
Flagger Linkerd Traffic Split(流量拆分)
前提條件
Flagger 需要 Kubernetes 集群 v1.16 或更新版本和 Linkerd 2.10 或更新版本。
安裝 Linkerd the Prometheus(Linkerd Viz 的一部分):
- linkerd install | kubectl apply -f -
- linkerd viz install | kubectl apply -f -
在 linkerd 命名空間中安裝 Flagger:
- kubectl apply -k github.com/fluxcd/flagger//kustomize/linkerd
引導程序
Flagger 采用 Kubernetes deployment 和可選的水平 Pod 自動伸縮 (HPA),然后創建一系列對象(Kubernetes 部署、ClusterIP 服務和 SMI 流量拆分)。這些對象將應用程序暴露在網格內部并驅動 Canary 分析和推廣。
創建一個 test 命名空間并啟用 Linkerd 代理注入:
- kubectl create ns test
- kubectl annotate namespace test linkerd.io/inject=enabled
安裝負載測試服務以在金絲雀分析期間生成流量:
- kubectl apply -k https://github.com/fluxcd/flagger//kustomize/tester?ref=main
創建部署和水平 pod autoscaler:
- kubectl apply -k https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main
為 podinfo 部署創建一個 Canary 自定義資源:
- apiVersion: flagger.app/v1beta1
- kind: Canary
- metadata:
- name: podinfo
- namespace: test
- spec:
- # deployment reference
- targetRef:
- apiVersion: apps/v1
- kind: Deployment
- name: podinfo
- # HPA reference (optional)
- autoscalerRef:
- apiVersion: autoscaling/v2beta2
- kind: HorizontalPodAutoscaler
- name: podinfo
- # the maximum time in seconds for the canary deployment
- # to make progress before it is rollback (default 600s)
- progressDeadlineSeconds: 60
- service:
- # ClusterIP port number
- port: 9898
- # container port number or name (optional)
- targetPort: 9898
- analysis:
- # schedule interval (default 60s)
- interval: 30s
- # max number of failed metric checks before rollback
- threshold: 5
- # max traffic percentage routed to canary
- # percentage (0-100)
- maxWeight: 50
- # canary increment step
- # percentage (0-100)
- stepWeight: 5
- # Linkerd Prometheus checks
- metrics:
- - name: request-success-rate
- # minimum req success rate (non 5xx responses)
- # percentage (0-100)
- thresholdRange:
- min: 99
- interval: 1m
- - name: request-duration
- # maximum req duration P99
- # milliseconds
- thresholdRange:
- max: 500
- interval: 30s
- # testing (optional)
- webhooks:
- - name: acceptance-test
- type: pre-rollout
- url: http://flagger-loadtester.test/
- timeout: 30s
- metadata:
- type: bash
- cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
- - name: load-test
- type: rollout
- url: http://flagger-loadtester.test/
- metadata:
- cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"
將上述資源另存為 podinfo-canary.yaml 然后應用:
- kubectl apply -f ./podinfo-canary.yaml
當 Canary 分析開始時,Flagger 將在將流量路由到 Canary 之前調用 pre-rollout webhooks。金絲雀分析將運行五分鐘,同時每半分鐘驗證一次 HTTP 指標和 rollout(推出) hooks。
幾秒鐘后,Flager 將創建 canary 對象:
- # applied
- deployment.apps/podinfo
- horizontalpodautoscaler.autoscaling/podinfo
- ingresses.extensions/podinfo
- canary.flagger.app/podinfo
- # generated
- deployment.apps/podinfo-primary
- horizontalpodautoscaler.autoscaling/podinfo-primary
- service/podinfo
- service/podinfo-canary
- service/podinfo-primary
- trafficsplits.split.smi-spec.io/podinfo
在 boostrap 之后,podinfo 部署將被縮放到零, 并且到 podinfo.test 的流量將被路由到主 pod。在 Canary 分析過程中,可以使用 podinfo-canary.test 地址直接定位 Canary Pod。
自動金絲雀推進
Flagger 實施了一個控制循環,在測量 HTTP 請求成功率、請求平均持續時間和 Pod 健康狀況等關鍵性能指標的同時,逐漸將流量轉移到金絲雀。根據對 KPI 的分析,提升或中止 Canary,并將分析結果發布到 Slack。
Flagger 金絲雀階段
通過更新容器鏡像觸發金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.1
Flagger 檢測到部署修訂已更改并開始新的部署:
- kubectl -n test describe canary/podinfo
- Status:
- Canary Weight: 0
- Failed Checks: 0
- Phase: Succeeded
- Events:
- New revision detected! Scaling up podinfo.test
- Waiting for podinfo.test rollout to finish: 0 of 1 updated replicas are available
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary weight 5
- Advance podinfo.test canary weight 10
- Advance podinfo.test canary weight 15
- Advance podinfo.test canary weight 20
- Advance podinfo.test canary weight 25
- Waiting for podinfo.test rollout to finish: 1 of 2 updated replicas are available
- Advance podinfo.test canary weight 30
- Advance podinfo.test canary weight 35
- Advance podinfo.test canary weight 40
- Advance podinfo.test canary weight 45
- Advance podinfo.test canary weight 50
- Copying podinfo.test template spec to podinfo-primary.test
- Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
- Promotion completed! Scaling down podinfo.test
請注意,如果您在 Canary 分析期間對部署應用新更改,Flagger 將重新開始分析。
金絲雀部署由以下任何對象的更改觸發:
- Deployment PodSpec(容器鏡像container image、命令command、端口ports、環境env、資源resources等)
- ConfigMaps 作為卷掛載或映射到環境變量
- Secrets 作為卷掛載或映射到環境變量
您可以通過以下方式監控所有金絲雀:
- watch kubectl get canaries --all-namespaces
- NAMESPACE NAME STATUS WEIGHT LASTTRANSITIONTIME
- test podinfo Progressing 15 2019-06-30T14:05:07Z
- prod frontend Succeeded 0 2019-06-30T16:15:07Z
- prod backend Failed 0 2019-06-30T17:05:07Z
自動回滾
在金絲雀分析期間,您可以生成 HTTP 500 錯誤和高延遲來測試 Flagger 是否暫停并回滾有故障的版本。
觸發另一個金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.2
使用以下命令執行負載測試器 pod:
- kubectl -n test exec -it flagger-loadtester-xx-xx sh
生成 HTTP 500 錯誤:
- watch -n 1 curl http://podinfo-canary.test:9898/status/500
生成延遲:
- watch -n 1 curl http://podinfo-canary.test:9898/delay/1
當失敗的檢查次數達到金絲雀分析閾值時,流量將路由回主服務器,金絲雀縮放為零,并將推出標記為失敗。
- kubectl -n test describe canary/podinfo
- Status:
- Canary Weight: 0
- Failed Checks: 10
- Phase: Failed
- Events:
- Starting canary analysis for podinfo.test
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary weight 5
- Advance podinfo.test canary weight 10
- Advance podinfo.test canary weight 15
- Halt podinfo.test advancement success rate 69.17% < 99%
- Halt podinfo.test advancement success rate 61.39% < 99%
- Halt podinfo.test advancement success rate 55.06% < 99%
- Halt podinfo.test advancement request duration 1.20s > 0.5s
- Halt podinfo.test advancement request duration 1.45s > 0.5s
- Rolling back podinfo.test failed checks threshold reached 5
- Canary failed! Scaling down podinfo.test
自定義指標
Canary analysis 可以通過 Prometheus 查詢進行擴展。
讓我們定義一個未找到錯誤的檢查。編輯 canary analysis 并添加以下指標:
- analysis:
- metrics:
- - name: "404s percentage"
- threshold: 3
- query: |
- 100 - sum(
- rate(
- response_total{
- namespace="test",
- deployment="podinfo",
- status_code!="404",
- direction="inbound"
- }[1m]
- )
- )
- /
- sum(
- rate(
- response_total{
- namespace="test",
- deployment="podinfo",
- direction="inbound"
- }[1m]
- )
- )
- * 100
上述配置通過檢查 HTTP 404 req/sec 百分比是否低于總流量的 3% 來驗證金絲雀版本。如果 404s 率達到 3% 閾值,則分析將中止,金絲雀被標記為失敗。
通過更新容器鏡像觸發金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.3
生成 404:
- watch -n 1 curl http://podinfo-canary:9898/status/404
監視 Flagger 日志:
- kubectl -n linkerd logs deployment/flagger -f | jq .msg
- Starting canary deployment for podinfo.test
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary weight 5
- Halt podinfo.test advancement 404s percentage 6.20 > 3
- Halt podinfo.test advancement 404s percentage 6.45 > 3
- Halt podinfo.test advancement 404s percentage 7.22 > 3
- Halt podinfo.test advancement 404s percentage 6.50 > 3
- Halt podinfo.test advancement 404s percentage 6.34 > 3
- Rolling back podinfo.test failed checks threshold reached 5
- Canary failed! Scaling down podinfo.test
如果您配置了 Slack,Flager 將發送一條通知,說明金絲雀失敗的原因。
Linkerd Ingress
有兩個入口控制器與 Flagger 和 Linkerd 兼容:NGINX 和 Gloo。
安裝 NGINX:
- helm upgrade -i nginx-ingress stable/nginx-ingress \
- --namespace ingress-nginx
為 podinfo 創建一個 ingress 定義,將傳入標頭重寫為內部服務名稱(Linkerd 需要):
- apiVersion: extensions/v1beta1
- kind: Ingress
- metadata:
- name: podinfo
- namespace: test
- labels:
- app: podinfo
- annotations:
- kubernetes.io/ingress.class: "nginx"
- nginx.ingress.kubernetes.io/configuration-snippet: |
- proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:9898;
- proxy_hide_header l5d-remote-ip;
- proxy_hide_header l5d-server-id;
- spec:
- rules:
- - host: app.example.com
- http:
- paths:
- - backend:
- serviceName: podinfo
- servicePort: 9898
使用 ingress controller 時,Linkerd 流量拆分不適用于傳入流量,因為 NGINX 在網格之外運行。為了對前端應用程序運行金絲雀分析,Flagger 創建了一個 shadow ingress 并設置了 NGINX 特定的注釋(annotations)。
A/B 測試
除了加權路由,Flagger 還可以配置為根據 HTTP 匹配條件將流量路由到金絲雀。在 A/B 測試場景中,您將使用 HTTP headers 或 cookies 來定位您的特定用戶群。這對于需要會話關聯的前端應用程序特別有用。
Flagger Linkerd Ingress
編輯 podinfo 金絲雀分析,將提供者設置為 nginx,添加 ingress 引用,移除 max/step 權重并添加匹配條件和 iterations:
- apiVersion: flagger.app/v1beta1
- kind: Canary
- metadata:
- name: podinfo
- namespace: test
- spec:
- # ingress reference
- provider: nginx
- ingressRef:
- apiVersion: extensions/v1beta1
- kind: Ingress
- name: podinfo
- targetRef:
- apiVersion: apps/v1
- kind: Deployment
- name: podinfo
- autoscalerRef:
- apiVersion: autoscaling/v2beta2
- kind: HorizontalPodAutoscaler
- name: podinfo
- service:
- # container port
- port: 9898
- analysis:
- interval: 1m
- threshold: 10
- iterations: 10
- match:
- # curl -H 'X-Canary: always' http://app.example.com
- - headers:
- x-canary:
- exact: "always"
- # curl -b 'canary=always' http://app.example.com
- - headers:
- cookie:
- exact: "canary"
- # Linkerd Prometheus checks
- metrics:
- - name: request-success-rate
- thresholdRange:
- min: 99
- interval: 1m
- - name: request-duration
- thresholdRange:
- max: 500
- interval: 30s
- webhooks:
- - name: acceptance-test
- type: pre-rollout
- url: http://flagger-loadtester.test/
- timeout: 30s
- metadata:
- type: bash
- cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
- - name: load-test
- type: rollout
- url: http://flagger-loadtester.test/
- metadata:
- cmd: "hey -z 2m -q 10 -c 2 -H 'Cookie: canary=always' http://app.example.com"
上述配置將運行 10 分鐘的分析,目標用戶是:canary cookie 設置為 always 或使用 X-Canary: always header 調用服務。
請注意,負載測試現在針對外部地址并使用 canary cookie。
通過更新容器鏡像觸發金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.4
Flagger 檢測到部署修訂已更改并開始 A/B 測試:
- kubectl -n test describe canary/podinfo
- Events:
- Starting canary deployment for podinfo.test
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary iteration 1/10
- Advance podinfo.test canary iteration 2/10
- Advance podinfo.test canary iteration 3/10
- Advance podinfo.test canary iteration 4/10
- Advance podinfo.test canary iteration 5/10
- Advance podinfo.test canary iteration 6/10
- Advance podinfo.test canary iteration 7/10
- Advance podinfo.test canary iteration 8/10
- Advance podinfo.test canary iteration 9/10
- Advance podinfo.test canary iteration 10/10
- Copying podinfo.test template spec to podinfo-primary.test
- Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
- Promotion completed! Scaling down podinfo.test
原文鏈接:https://mp.weixin.qq.com/s/8ThwH9DvFAnc-trOSf_nNQ