由來
客戶是深信服的訂制系統,基于 centos 改的,排查半天發現居然是文件損壞,而不是 docker 的問題。
環境信息
docker信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
$ docker info Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 2 Server Version: 18.09.3 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json- file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json- file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 3.10.0 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 20 Total Memory: 125.3GiB Name: eds-1f21a854 ID: VZLV:26PU:ZUN6:QQEJ:GW3I:YETT:HMEU:CK6J:SIPL:CHKV:6ASN:5NDF Docker Root Dir: /data/kube/docker Debug Mode (client): false Debug Mode (server): false Registry: https: //index .docker.io /v1/ Labels: Experimental: false Insecure Registries: reg.wps.lan:5000 treg.yun.wps.cn 127.0.0.0 /8 Registry Mirrors: https: //registry .docker-cn.com/ https: //docker .mirrors.ustc.edu.cn/ Live Restore Enabled: false Product License: Community Engine |
系統信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
$ uname -a Linux eds-1f21a854 3.10.0 #1 SMP Mon Sep 28 12:00:30 CST 2020 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/os-release NAME= "CentOS Linux" VERSION= "7 (Core)" ID= "centos" ID_LIKE= "rhel fedora" VERSION_ID= "7" PRETTY_NAME= "CentOS Linux 7 (Core)" ANSI_COLOR= "0;31" CPE_NAME= "cpe:/o:centos:centos:7" HOME_URL= "https://www.centos.org/" BUG_REPORT_URL= "https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT= "CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION= "7" REDHAT_SUPPORT_PRODUCT= "centos" REDHAT_SUPPORT_PRODUCT_VERSION= "7" |
服務器信息:
1
2
3
4
5
6
|
$ cat product_name SUGON-60G16 $ cat sys_vendor SANGFOR $ cat product_version 1.2 |
排查過程
安裝docker服務器掛了
時間2020 10/29 19:51:
實施: 客戶這邊部署的時候安裝docker的時候服務器掛了
我: 開機后/var/log/message有信息嗎
實施: 只能恢復快照才能進去,服務器進不了,看不了信息
我: 不恢復快照起不來嗎
實施: 是的
到這里我以為是觸發了啥內核 bug 直接內核 panic 了服務器起不來。
時間2020 10/30 9:07:
我: 起不來的時候有進控制臺去看啥原因起不來嗎
實施: 是客戶服務器沒法查看呢
我: 客戶沒去看下嗎
然后實施直接發來一個向日葵遠程連接,我上去后發現不是常規的操作系統,是基于 centos 改過的,沒找到/var/log/message,然后手動執行我們的 docker 安裝腳本。
1
|
bash -x install -docker.sh |
然后輸出的信息在某一步就沒輸出了,應該”掛了”,看了下腳本最后一條輸出調試信息的后面是啟動 docker,應該是啟動 docker 觸發的。然后很久后還是無法連上和 ping 通,叫實施問問那邊現場看看是硬件服務器的話有沒有 idrac,ilo 之類的看看 tty 控制臺的信息。
現場人員看了下服務器是”正常開機”的,我這邊嘗試還是連不上,現場問我們的操作是否是改了路由,現場 systemctl 看了下 docker 是起來的。現場那邊還是 ping 不通網關。我這邊突然想到是不是壓根沒掛。。。
叫他 uptime -s 看看上次的啟動時間,結果壓根沒重啟。。。
然后現場排查到是 iptables 的問題,啟動 docker 的時候把他們的規則刷沒了。后面他們改了下都放開了。所以前面的啟動 docker 把機器掛了實際上是iptables的影響導致網絡斷開,機器壓根沒重啟。
啟動容器掛掉
然后繼續,實施說之前同樣的其他機器安裝 docker 的時候沒出現上面的問題,而是啟動的時候出現上面的問題,我就手動執行下部署,結果報錯。腳本開-x調試看是load 部署鏡像的時候報錯了。
1
|
error during connect: Post http: // %2Fvar%2Frun%2Fdocker.sock /v1 .39 /images/load ?quiet=0: read unix @-> /var/run/docker .sock: read : EOF |
手動執行下:
1
2
|
$ docker load -i . /kube/images/deploy . tar error during connect: Post http: // %2Fvar%2Frun%2Fdocker.sock /v1 .39 /images/load ?quiet=0: read unix @-> /var/run/docker .sock: read : connection reset by peer |
jounalctl 看了下 docker daemon 沒任何相關日志,這個報錯搜了下有的人說是 /var/run/docker.sock 的 docker 組不存在,也有人直接chmod 777解決的。試了下還是不行。前臺 debug 下 docker 看看有沒有有用的信息:
1
2
3
|
systemctl stop docker pkill dockerd dockerd -D |
另開一個終端執行 load 鏡像操作:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
$ docker load -i . /kube/images/deploy . tar ab6425526dab: Loading layer [==================================================>] 126.3MB /126 .3MB c7fe3ea715ef: Loading layer [==================================================>] 340.3MB /340 .3MB 7f7eae7934f7: Loading layer [==================================================>] 3.584kB /3 .584kB e99a66706b50: Loading layer [==================================================>] 2.105MB /2 .105MB 245b79de6bb7: Loading layer [==================================================>] 283.5MB /283 .5MB 2a56a4432cef: Loading layer [==================================================>] 93.18kB /93 .18kB 0c2ea71066ab: Loading layer [==================================================>] 276.5kB /276 .5kB bb3f6df0f87c: Loading layer [==================================================>] 82.94kB /82 .94kB 6f34ead3cef7: Loading layer [==================================================>] 946.7kB /946 .7kB c21422dd15f6: Loading layer [==================================================>] 1.97MB /1 .97MB 940288517f4c: Loading layer [==================================================>] 458.2kB /458 .2kB 0c52f1af61f4: Loading layer [==================================================>] 5.12kB /5 .12kB 049e7edd48bc: Loading layer [==================================================>] 1.57MB /1 .57MB 73307245d702: Loading layer [==================================================>] 5.632kB /5 .632kB f109309d8ffc: Loading layer [==================================================>] 2.175MB /2 .175MB Loaded image: xxxxxxxxxxxx.cn /platform/deploy-amd64 :ubuntu.16.04 $ docker images REPOSITORY TAG IMAGE ID CREATED SIZE xxxxxxxxxxxx.cn /platform/deploy-amd64 ubuntu.16.04 3ad94a76af5d 3 months ago 734MB |
調試這邊前臺日志輸出正常
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
... DEBU[2020-10-30T14:48:39.955963986+08:00] Applied tar sha256:049e7edd48bc46e3dd5edf89c9caa8f0f7efbb41af403c5a54dd4f1008f604a7 to d58edd0d97bb672ef40e82e45c1603ca3ceaad847d9b9fc7c9b0588087019649, size: 1518278 DEBU[2020-10-30T14:48:39.960091040+08:00] Applying tar in /data/kube/docker/overlay2/b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18/diff storage-driver=overlay2 DEBU[2020-10-30T14:48:40.059510528+08:00] Applied tar sha256:73307245d7021f9627ca0b2cbfeab3aac0b65abfd476f6ec26bb92c75892d7e2 to b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18, size: 3284 DEBU[2020-10-30T14:48:40.063040538+08:00] Applying tar in /data/kube/docker/overlay2/03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f/diff storage-driver=overlay2 DEBU[2020-10-30T14:48:40.148209852+08:00] Applied tar sha256:f109309d8ffcb76589ad6389e80335d986b411c80122d990ab00a02a3a916e3e to 03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f, size: 2072803 ^CINFO[2020-10-30T14:48:55.593523177+08:00] Processing signal 'interrupt' DEBU[2020-10-30T14:48:55.593617229+08:00] daemon configured with a 15 seconds minimum shutdown timeout DEBU[2020-10-30T14:48:55.593638628+08:00] start clean shutdown of all containers with a 15 seconds timeout... DEBU[2020-10-30T14:48:55.594074457+08:00] Unix socket /run/docker/libnetwork/ebd15186e86385c48c4c5508d5f30eb83d5d74e56f09af5c82b6d6d9d63ec8b8 .sock doesn't exist. cannot accept client connections DEBU[2020-10-30T14:48:55.594106623+08:00] Cleaning up old mountid : start. INFO[2020-10-30T14:48:55.594157536+08:00] stopping event stream following graceful shutdown error= "<nil>" module=libcontainerd namespace=moby DEBU[2020-10-30T14:48:55.594343122+08:00] Cleaning up old mountid : done . DEBU[2020-10-30T14:48:55.594501828+08:00] Clean shutdown succeeded INFO[2020-10-30T14:48:55.594520918+08:00] stopping healthcheck following graceful shutdown module=libcontainerd INFO[2020-10-30T14:48:55.594531978+08:00] stopping event stream following graceful shutdown error= "context canceled" module=libcontainerd namespace=plugins.moby DEBU[2020-10-30T14:48:55.594603119+08:00] received signal signal=terminated INFO[2020-10-30T14:48:55.594739890+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201a61b0, TRANSIENT_FAILURE module=grpc INFO[2020-10-30T14:48:55.594751465+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201a61b0, CONNECTING module=grpc |
看了下systemd的配置沒啥特殊的,就很迷,不知道為啥前臺運行就能導入,后面實在想不到怎么排查,就懷疑可能是 socket 問題,嘗試用 socat 轉發成 tcp 試試,結果還是不行(此處應該daemon那加tcp 監聽127試試,不應該通過socket,socat最終也是過的socket)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
$ socat -d -d TCP-LISTEN:2375,fork,bind=127.0.0.1 UNIX: /var/run/docker .sock 2020 /10/30 17:39:58 socat[5201] N listening on AF=2 127.0.0.1:2375 ^[[C2020 /10/30 17:42:06 socat[5201] N accepting connection from AF=2 127.0.0.1:35370 on AF=2 127.0.0.1:2375 2020 /10/30 17:42:06 socat[5201] N forked off child process 11501 2020 /10/30 17:42:06 socat[5201] N listening on AF=2 127.0.0.1:2375 2020 /10/30 17:42:06 socat[11501] N opening connection to AF=1 "/var/run/docker.sock" 2020 /10/30 17:42:06 socat[11501] N successfully connected from local address AF=1 "\0\0\0\0\0\0 \x0D\x@7\xE9\xEC\x7E\0\0\0\x01\0\0\0\0" 2020 /10/30 17:42:06 socat[11501] N starting data transfer loop with FDs [6,6] and [5,5] 2020 /10/30 17:42:12 socat[11501] E write(5, 0x55f098349920, 8192): Broken pipe 2020 /10/30 17:42:12 socat[11501] N exit (1) 2020 /10/30 17:42:12 socat[5201] N childdied(): handling signal 17 $ docker --log-level debug -H tcp: //127 .0.0.1:2375 load -i kube /images/deploy . tar c7fe3ea715ef: Loading layer [==========================================> ] 286.9MB /340 .3MB unexpected EOF |
最后耗了挺久的,當時忙,去看了下另一個客戶的問題,然會回到這邊,突發奇想的試試 load 其他鏡像,結果可以。。。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
$ docker load -i kube /images/pause_3 .1. tar e17133b79956: Loading layer [==================================================>] 744.4kB /744 .4kB Loaded image: mirrorgooglecontainers /pause-amd64 :3.1 $ docker load -i kube /images/tiller_v2 .16.1. tar 77cae8ab23bf: Loading layer [==================================================>] 5.815MB /5 .815MB 679105aa33fb: Loading layer [==================================================>] 6.184MB /6 .184MB 639eab5d05b1: Loading layer [==================================================>] 40.46MB /40 .46MB 87e5687e03f2: Loading layer [==================================================>] 41.13MB /41 .13MB Loaded image: gcr.io /kubernetes-helm/tiller :v2.16.1 $ docker load -i kube /images/calico_v3 .1.3. tar cd7100a72410: Loading layer [==================================================>] 4.403MB /4 .403MB ddc4cb8dae60: Loading layer [==================================================>] 7.84MB /7 .84MB 77087b8943a2: Loading layer [==================================================>] 249.3kB /249 .3kB c7227c83afaf: Loading layer [==================================================>] 4.801MB /4 .801MB 2e0e333a66b6: Loading layer [==================================================>] 231.8MB /231 .8MB Loaded image: calico /node :v3.1.3 2580685bfb60: Loading layer [==================================================>] 50.84MB /50 .84MB Loaded image: calico /kube-controllers :v3.1.3 0314be9edf00: Loading layer [==================================================>] 1.36MB /1 .36MB 15db169413e5: Loading layer [==================================================>] 28.05MB /28 .05MB 4252efcc5013: Loading layer [==================================================>] 2.818MB /2 .818MB 76cf2496cf36: Loading layer [==================================================>] 3.03MB /3 .03MB 91d3d3a16862: Loading layer [==================================================>] 2.995MB /2 .995MB 18a58488ba3b: Loading layer [==================================================>] 3.474MB /3 .474MB 8d8197f49da2: Loading layer [==================================================>] 27.34MB /27 .34MB 7520364e0845: Loading layer [==================================================>] 9.216kB /9 .216kB b9d064622bd6: Loading layer [==================================================>] 2.56kB /2 .56kB Loaded image: calico /cni :v3.1.3 |
只有導入這個的時候才報錯
1
2
|
$ docker load -i kube /images/deploy . tar error during connect: Post http: // %2Fvar%2Frun%2Fdocker.sock /v1 .39 /images/load ?quiet=0: read unix @-> /var/run/docker .sock: read : connection reset by peer |
然后出包的機器上對比了下這個文件的校驗值發現不對。。。。
總結
有個疑問就是為啥前臺可以,其次文件損壞導入的時候 docker daemon 居然不刷任何日志直接 connection reset,新版本沒測試過這種情況。
到此這篇關于docker錯誤的耗時排查過程記錄的文章就介紹到這了,更多相關docker錯誤耗時排查內容請搜索服務器之家以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持服務器之家!
原文鏈接:https://zhangguanzhang.github.io/2020/10/30/stupid-check-of-docker/