繼續離題：繼續查修Prometheus

前提

前提就是前一篇得內容，Prometheus開始用了PV、PVC

緊接著昨日晚上，prometheus pod異常了，以為小問題，刪除了prometheus pod期許它自行恢復，但...無情的是"沒有恢復"。 Prometheus log 顯示

component=tsdb \
msg="last page of the wal is torn, filling it with zeros" \
segment=/prometheus/wal/00000012

按照上述 log 特徵去查詢網路資訊，輾轉查到這幾篇： https://github.com/prometheus/prometheus/issues/3632 https://github.com/prometheus/tsdb/issues/590 https://github.com/prometheus/tsdb/pull/623 得知在v.2.11.0-rc.0版本中獲得WAL問題的解決，於是....

更新版本

當時我使用的版本是v2.10.0，更新前去瀏覽 Prometheus github 網站查看目前版本進展，於是我選擇了v2.12.0版本。緊接著趕緊去更新，驗證問題是否真的解決。

更新後狀況

更新之後，於是看見了新問題...嗎？

level=error ts=2019-09-26T15:58:39.427Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=error ts=2019-09-26T15:58:39.427Z caller=pod.go:85 component="discovery manager scrape" discovery=k8s role=pod msg="pod informer unable to sync cache"
level=error ts=2019-09-26T15:58:39.427Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"

Prometheus仍然無法正常啟動，無解...... 即使使用v2.11.2版本，情況相同。

於是我想到，當Prometheus pod啟動時，會去讀取PV裡頭的數據資料，於是我選擇了捨棄PVC、PV掛載使用。 Prometheus pod重新更新後，迅速恢復了～～～

不死心，再重新將PVC、PV掛載回來，果然Prometheus無法正常運作，總是讀取了眾多WAL資訊後就停擺了。

於是我做了個動作：移除PV，再重新掛載新的PV~ 就這樣 "期許" 放著運作數日看看，是否有新問題發生。

但殘酷是，一小時之後，結果prometheus pod又故障不工作了！只好先捨棄不採用PV、PVC磁碟方案了。

一夜之後

發現Prometheus仍然會異常，都出現在記憶體使用不足的情況下列是正常Prometheus pod的記憶體使用情況（下圖綠色線條接近2.4 ~ 2.6GB）。

下列是不正常Prometheus的記憶體使用情況，一直猛用記憶體（下圖綠色線條接近5GB）。

什麼原因啊？

到今晚的心得是

不一定是掛載了PV、PVC磁碟關係，因為其他掛載的Prometheus仍正常運作中。
不像是讀取WAL的相關bug，因為換了版本仍然故障。
我一直看見Prometheus吃了不少記憶體。

回想穩定與不穩定之間的環境差異，就是pod數量了！

不穩定的環境中，K8s cluster 內pod數量超過100，每當我大量操作restart pod時，Prometheus會特別明顯的立即故障重啟～
相對穩定環境的Prometheus中，該環境的pod數量約略40個。

目前我為了可以放心放假，先為了Prometheus準備了資源更充足的K8s node給予使用。觀察其穩定性狀況為何。當然另外再花時間，研究Prometheus的效能與記憶體需求面向。

以下省略

WAL 是啥？

Prome 其他問題

level=warn ts=2019-09-26T17:31:15.299Z caller=manager.go:526 component="rule manager" group=k8s.rules msg="Evaluating rule failed" rule="record: namespace_name:container_cpu_usage_seconds_total:sum_rate\nexpr: sum by(namespace, label_name) (sum by(namespace, pod_name) (rate(container_cpu_usage_seconds_total{container_name!=\"\",image!=\"\",job=\"kubelet\"}[5m]))\n  * on(namespace, pod_name) group_left(label_name) label_replace(kube_pod_labels{job=\"kube-state-metrics\"},\n  \"pod_name\", \"$1\", \"pod\", \"(.*)\"))\n" err="found duplicate series for the match group {namespace=\"b2pro\", pod_name=\"prometheus-b2pro-prometheus-operator-prometheus-0\"} on the right hand-side of the operation: [{__name__=\"kube_pod_labels\", endpoint=\"http\", instance=\"10.56.9.3:8080\", job=\"kube-state-metrics\", label_app=\"prometheus\", label_controller_revision_hash=\"prometheus-b2pro-prometheus-operator-prometheus-554fb6c98d\", label_prometheus=\"b2pro-prometheus-operator-prometheus\", label_statefulset_kubernetes_io_pod_name=\"prometheus-b2pro-prometheus-operator-prometheus-0\", namespace=\"b2pro\", pod=\"prometheus-b2pro-prometheus-operator-prometheus-0\", pod_name=\"prometheus-b2pro-prometheus-operator-prometheus-0\", service=\"b2pro-kube-state-metrics\"}, {__name__=\"kube_pod_labels\", endpoint=\"http\", instance=\"10.56.9.3:8080\", job=\"kube-state-metrics\", label_app=\"prometheus\", label_controller_revision_hash=\"prometheus-b2pro-prometheus-operator-prometheus-54bf49b9cf\", label_prometheus=\"b2pro-prometheus-operator-prometheus\", label_statefulset_kubernetes_io_pod_name=\"prometheus-b2pro-prometheus-operator-prometheus-0\", namespace=\"b2pro\", pod=\"prometheus-b2pro-prometheus-operator-prometheus-0\", pod_name=\"prometheus-b2pro-prometheus-operator-prometheus-0\", service=\"b2pro-kube-state-metrics\"}];many-to-many matching not allowed: matching labels must be unique on one side"

Previous臨時題目：查修 prometheus Next臨時題目：限定 Pod 訪問外網時，固定 public ip

Last updated 5 years ago