请求中的 Accept-Encoding Header 出错

请求类似:

GET /api/v1/feed HTTP/1.1
model: Android
If-Modified-Since: Thu, 10 Sep 2015 08:42:41 GMT+00:00
If-None-Match: “03e6c6bafcf335f108dd2fdc8359976cc”
Host: baobab.xxx.com
Connection: Keep-Alive
Accdpt-Encoding: gzip
User-Agent: okhttp/2.4.0

在 LVS 外网网卡上抓包: lvs.pcap

在 wireshark 里通过 http.request and http contains “Accdpt-Encoding” 过滤,然后再根据 ip.addr 和 tcp.port 过滤,看到如下结果:

111132432432423h1

说明进来的请求已经是有问题的了。

 

好吧,这篇文章目前为止没有价值,先 mark,如果有后续的话继续更新。

 

解决 HTTP 请求被 LVS reset 的问题

最近两天遇到了 HTTP 请求被 LVS reset 的问题,LVS 默认有三个超时参数,如下:

# ipvsadm -L –timeout
Timeout (tcp tcpfin udp): 90 3 300

第一个就是 TCP 的空闲连接时间,这里超过 90 s 连接没数据会被 LVS reset 掉,在 HTTP Client 端可以看到 Connection reset by peer。我的某个服务 HTTP 请求时间很长,至少有 5 min,所以每次都会被 reset。

 

一开始我想到的解决办法是在 HTTP Client (HTTP 请求使用 requests)指定 so_keepidle参数小于 90 s。

因为 requests 使用 urllib3,而 urllib3 使用标准库 http.client(Python 3),Python 2.x 版本使用 httplib(它们会调用 socket.create_connection),所以我们可以封装一下来实现修改 so_keepidle:

import httplib
orig_connect = httplib.HTTPConnection.connect
def monkey_connect(self):
    orig_connect(self)
    self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
    self.sock.setsockopt(socket.SOL_TCP, socket.TCP_KEEPIDLE, 10)
    # sock.setsockopt(socket.SOL_TCP, socket.TCP_KEEPINTVL, 10)
    # sock.setsockopt(socket.SOL_TCP, socket.TCP_KEEPCNT, 100)
httplib.HTTPConnection.connect = monkey_connect
import requests

但是抓包发现 HTTP Client 端根本不发送 TCP keepalive 的包来保活,很是奇怪(可能是这段代码有问题,引用的还是老的模块,没继续debug,mark)。

 

后来发现可以在 Nginx 来实现 so_keepidle(LVS 后面是 Nginx),而且 Nginx 支持 so_keepidle 的配置,在 linsten 后面指定,格式如下:

222111h4s4ipokimf3zppr

比如:listen 80 so_keepalive=60::;

实测有效,终于解决了因为 HTTP 请求时间太长被 LVS reset 的问题,而且对域名生效,不影响其他域名(如果修改 LVS TCP idle timeout,对所有域名都有影响)。

 

参考:

http://stackoverflow.com/questions/15148225/is-it-possible-to-override-the-default-socket-options-in-requests

https://linux.cn/article-3766-1.html

http://nginx.org/en/docs/http/ngx_http_core_module.html

 

Nginx 接收到 method 为 T 的问题排查

从 Nginx 日志看到方法为 T 的请求已经很久了,今天才借机彻底解决,除了大部分的 T 请求,还有少量 ST 请求,这种情况看起来是被截断了。

在 Nginx 的 access log 中看到日志是这样:

58.62.163.217 – – [16/Aug/2015:00:05:36 +0800] “T /xxx/v1/index?format=proto&v=5.1.1&udid=woeruwouw&launchedCount=257&channel=tencent_app&vc=8897&capacity=3 HTTP/1.1″ 502 787 “-” “Dalvik/2.1.0 (Linux; U; Android 5.0.2; L50t Build/23.1.A.0.726)” “-” “10.0.69.48:8080, 10.0.29.129:8080, 10.0.31.105:8080, 10.0.22.123:8080, 10.0.27.179:8080, 10.0.27.178:8080, 10.0.66.138:8080, 10.0.69.135:8080, 10.0.42.90:8080, 10.0.23.78:8080, 10.0.41.109:8080, 10.0.70.93:8080, 10.0.27.37:8080, 10.0.29.69:8080, 10.0.25.47:8080, 10.0.66.81:8080” “scheme=http” “reqtime=0.299” “upstream_resptime=0.002, 0.001, 0.002, 0.001, 0.003, 0.001, 0.002, 0.001, 0.001, 0.003, 0.001, 0.002, 0.002, 0.001, 0.006, 0.003” “-” “host=xxx.xxx.com”

$request 中的 method  为 T,upstream 中一台机器返回 502,并重试其他所有 upstream,最后还是返回 502( 502 的值和后端框架有关,此处的 502 后端是 nodejs express 框架,如果后端是 java 相关框架,可能返回 501)。

在 Nginx 的 error log 中看到是这样:

2015/08/17 16:40:49 [error] 12217#0: *303246388 upstream prematurely closed connection while reading response header from upstream, client: 121.34.19.123, server: apis.xxx.com, request: “T /xxx/v1/search/hint?format=proto&pattern=3&v=5.1.1&capacity=3&vc=8900&udid=woeruwouw&channel=xiaomi HTTP/1.1”, upstream: “http://10.0.27.37:8080/five/v1/search/hint?format=proto&pattern=3&v=5.1.1&capacity=3&vc=8900&udid=woeruwouw&channel=xiaomi”, host: “xxx.xxx.com”
2015/08/17 16:40:49 [error] 12217#0: *303246322 no live upstreams while connecting to upstream, client: 14.17.22.49, server: xxx.xxx.com, request: “GET /xxx/v1/search/hint?format=proto&pattern=0&v=5.1.1&capacity=3&vc=8900&udid=woeruwouw&channel=xxx_pc_baidu_pt HTTP/1.1”, upstream: “http://upstream-nodes/five/v1/search/hint?format=proto&pattern=0&v=5.1.1&capacity=3&vc=8900&udid=woeruwouw&channel=xxx_pc_baidu_pt”, host: “xxx.xxx.com”

因为重试了 upstream 中所有的机器,error 中报了 no live upstreams while connecting to upstream,这个比较吓人。

 

在不断的抓包中,我们发现了这个问题和 synproxy 有关,关闭了 synproxy 问题就解决了,但是 synproxy 有抗 syn 攻击的作用,关了不好,于是尝试修改 synproxy 相关的内核参数,后来发现了 /proc/sys/net/ipv4/vs/synproxy_defer,通过把  /proc/sys/net/ipv4/vs/synproxy_defer 从 1 改成 0,问题解决。

synproxy_defer 是 LVS synproxy 的一个参数,在 synproxy 打开的情况下, synproxy_defer 设置为 1 ,client 和 LVS 先建立连接,当 client 真正发送请求的时候  LVS 才和后端的 Nginx 建立连接; synproxy_defer 设置为 0,LVS 会在收到 client  的 ACK 之后立即和 Nginx 建立连接。看看出问题的时候在 LVS 上抓到的一个包:

CE28EB3B-627D-48D4-82C6-D7F38D311424

client 和 LVS 建立连接之后,client 发送的数据出现了乱序,NO.6 的 seq 最小,NO.5 第二,NO.4 第三,NO.6 里面带这 PO,NO.5 里面带着 ST。

从 后面的  NO.7 看 LVS 向 client 发送了 NO.5 的 ACK,虽然这个 ACK 的 seq 已经包含了 NO.6 的 seq,但是从 Nginx 接收到了 ST 来说,NO.6 的数据应该是被漏掉了。

这个问题很可能是 LVS 的 bug 导致的,关闭 synproxy_defer 改变了 LVS 的处理流程,使得 bug 消失,问题解决。点击 dump.pcap 来下载这个抓包的具体 pcap 文件。再附一下相关的抓包,第一个是 synproxy 关闭在 Client 到 LVS 抓包,第二个是 synproxy 打开 synproxy_defer 关闭 LVS 到 Nginx 的抓包:

syn_proxy_closed_lvs.pcap

synproxy_defer_closed_nginx.pcap

 

Update,和阿里沟通以后发现这是一个已经修复的 bug,在 https://github.com/alibaba/LVS 的 master branch 上已经 fix,该 bug 信息如下:

问题:synproxy 开启,当 Client 第二个请求数据包比第一个先到达LVS,并且 第一个数据包字节数<15,则 synproxy seq 校验通过,从而导致 LVS 发往 RS 的 syn 包 seq 错误。
解决方法:将 mssidx 从 data[3-0] 移到 data[15-12],避免和 seq 偏差混合;seq 校验时,判断 data[11-0]必须为0,才校验通过;

 

 

下面附录一下,synproxy 和 synproxy_defer 参数相关的三种测试,涉及到 client 、LVS 和后端建立连接的流程。

在 LVS 上抓包。

情况一, synproxy 打开 ,/proc/sys/net/ipv4/vs/synproxy_defer 打开(为 1) 。

抓 LVS VIP 的包。

# tcpdump -n -i em1 host 60.28.208.10 and port 80

20:24:46.493678 IP 111.206.15.131.53471 > 60.28.208.10.http: Flags [S], seq 358124746, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 255746849 ecr 0,sackOK,eol], length 0
20:24:46.493694 IP 60.28.208.10.http > 111.206.15.131.53471: Flags [S.], seq 1490864584, ack 358124747, win 65535, options [mss 1452,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,sackOK,eol], length 0
20:24:46.501224 IP 111.206.15.131.53471 > 60.28.208.10.http: Flags [.], ack 1, win 65535, length 0
20:24:46.501658 IP 111.206.15.131.53471 > 60.28.208.10.http: Flags [P.], seq 1:77, ack 1, win 65535, length 76
20:24:46.502390 IP 60.28.208.10.http > 111.206.15.131.53471: Flags [.], ack 77, win 14600, length 0
20:24:46.507444 IP 60.28.208.10.http > 111.206.15.131.53471: Flags [P.], seq 1:846, ack 77, win 14600, length 845
20:24:46.515149 IP 111.206.15.131.53471 > 60.28.208.10.http: Flags [.], ack 846, win 65535, length 0
20:24:46.515813 IP 111.206.15.131.53471 > 60.28.208.10.http: Flags [F.], seq 77, ack 846, win 65535, length 0
20:24:46.516076 IP 60.28.208.10.http > 111.206.15.131.53471: Flags [F.], seq 846, ack 78, win 14600, length 0
20:24:46.524315 IP 111.206.15.131.53471 > 60.28.208.10.http: Flags [.], ack 847, win 65535, length 0

抓后端内网 IP 的包(LVS fullnat 模式使用 localip 连接后端,10.0.19.0/24 是 localip 网段)

# tcpdump -n -i em2 host 10.0.23.152 and port 80

20:24:46.501701 IP 10.0.19.245.commplex-link > 10.0.23.152.http: Flags [S], seq 4279320233, win 5000, options [Unknown Option 200d0df6fce0f83,mss 1440,nop,nop,sackOK], length 0
20:24:46.502002 IP 10.0.23.152.http > 10.0.19.245.commplex-link: Flags [S.], seq 4098256091, ack 4279320234, win 14600, options [mss 1460,nop,nop,sackOK], length 0
20:24:46.502045 IP 10.0.19.245.commplex-link > 10.0.23.152.http: Flags [P.], seq 1:77, ack 1, win 65535, options [Unknown Option 200d0df6fce0f83], length 76
20:24:46.502362 IP 10.0.23.152.http > 10.0.19.245.commplex-link: Flags [.], ack 77, win 14600, length 0
20:24:46.507429 IP 10.0.23.152.http > 10.0.19.245.commplex-link: Flags [P.], seq 1:846, ack 77, win 14600, length 845
20:24:46.515167 IP 10.0.19.245.commplex-link > 10.0.23.152.http: Flags [.], ack 846, win 65535, length 0
20:24:46.515828 IP 10.0.19.245.commplex-link > 10.0.23.152.http: Flags [F.], seq 77, ack 846, win 65535, length 0
20:24:46.516062 IP 10.0.23.152.http > 10.0.19.245.commplex-link: Flags [F.], seq 846, ack 78, win 14600, length 0
20:24:46.524331 IP 10.0.19.245.commplex-link > 10.0.23.152.http: Flags [.], ack 847, win 65535, length 0

看抓包时间,20:24:46.501658 小于 20:24:46.501701 ,说明 client 和 LVS 建立连接之后第一次发送请求数据(第四个包)的时候,LVS 才和后端建立连接。

 

情况二,synproxy 打开,/proc/sys/net/ipv4/vs/synproxy_defer 关闭(为 0) 。

抓 LVS VIP 的包。

# tcpdump -n -i em1 host 60.28.208.10 and port 80
20:09:16.360931 IP 111.206.15.131.53400 > 60.28.208.10.http: Flags [S], seq 689511913, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 254818388 ecr 0,sackOK,eol], length 0
20:09:16.360945 IP 60.28.208.10.http > 111.206.15.131.53400: Flags [S.], seq 826028321, ack 689511914, win 65535, options [mss 1452,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,sackOK,eol], length 0
20:09:16.368457 IP 111.206.15.131.53400 > 60.28.208.10.http: Flags [.], ack 1, win 65535, length 0
20:09:16.368727 IP 111.206.15.131.53400 > 60.28.208.10.http: Flags [P.], seq 1:77, ack 1, win 65535, length 76
20:09:16.369483 IP 60.28.208.10.http > 111.206.15.131.53400: Flags [.], ack 77, win 14600, length 0
20:09:16.374830 IP 60.28.208.10.http > 111.206.15.131.53400: Flags [P.], seq 1:846, ack 77, win 14600, length 845
20:09:16.383219 IP 111.206.15.131.53400 > 60.28.208.10.http: Flags [.], ack 846, win 65535, length 0
20:09:16.383265 IP 111.206.15.131.53400 > 60.28.208.10.http: Flags [F.], seq 77, ack 846, win 65535, length 0
20:09:16.383482 IP 60.28.208.10.http > 111.206.15.131.53400: Flags [F.], seq 846, ack 78, win 14600, length 0
20:09:16.390569 IP 111.206.15.131.53400 > 60.28.208.10.http: Flags [.], ack 847, win 65535, length 0

抓后端内网 IP 的包。

# tcpdump -n -i em2 host 10.0.23.152 and port 80

20:09:16.368513 IP 10.0.19.244.commplex-link > 10.0.23.152.http: Flags [S], seq 24896414, win 5000, options [Unknown Option 200d0986fce0f83,mss 1440,nop,nop,sackOK], length 0
20:09:16.368995 IP 10.0.23.152.http > 10.0.19.244.commplex-link: Flags [S.], seq 325609968, ack 24896415, win 14600, options [mss 1460,nop,nop,sackOK], length 0
20:09:16.369028 IP 10.0.19.244.commplex-link > 10.0.23.152.http: Flags [.], ack 1, win 65535, options [Unknown Option 200d0986fce0f83], length 0
20:09:16.369035 IP 10.0.19.244.commplex-link > 10.0.23.152.http: Flags [P.], seq 1:77, ack 1, win 65535, options [Unknown Option 200d0986fce0f83], length 76
20:09:16.369461 IP 10.0.23.152.http > 10.0.19.244.commplex-link: Flags [.], ack 77, win 14600, length 0
20:09:16.374815 IP 10.0.23.152.http > 10.0.19.244.commplex-link: Flags [P.], seq 1:846, ack 77, win 14600, length 845
20:09:16.383233 IP 10.0.19.244.commplex-link > 10.0.23.152.http: Flags [.], ack 846, win 65535, length 0
20:09:16.383273 IP 10.0.19.244.commplex-link > 10.0.23.152.http: Flags [F.], seq 77, ack 846, win 65535, length 0
20:09:16.383470 IP 10.0.23.152.http > 10.0.19.244.commplex-link: Flags [F.], seq 846, ack 78, win 14600, length 0
20:09:16.390583 IP 10.0.19.244.commplex-link > 10.0.23.152.http: Flags [.], ack 847, win 65535, length 0

比较一下,20:09:16.368457 小于 20:09:16.368513,说明在 client 和 LVS 建立连接收 client ACK 的时候和后端建立连接。

 

情况三,synproxy 关闭(这种情况之前有测试过,可以参考这里)。

抓 LVS VIP 的包。

# tcpdump -n -i em1 host 60.28.208.10 and port 80

20:37:46.231923 IP 111.206.15.131.53519 > 60.28.208.10.http: Flags [S], seq 4290864149, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 256524656 ecr 0,sackOK,eol], length 0
20:37:46.232248 IP 60.28.208.10.http > 111.206.15.131.53519: Flags [S.], seq 401558380, ack 4290864150, win 14600, options [mss 1452,nop,nop,sackOK,nop,wscale 7], length 0
20:37:46.241856 IP 111.206.15.131.53519 > 60.28.208.10.http: Flags [.], ack 1, win 8192, length 0
20:37:46.242187 IP 111.206.15.131.53519 > 60.28.208.10.http: Flags [P.], seq 1:77, ack 1, win 8192, length 76
20:37:46.242397 IP 60.28.208.10.http > 111.206.15.131.53519: Flags [.], ack 77, win 115, length 0
20:37:46.247565 IP 60.28.208.10.http > 111.206.15.131.53519: Flags [P.], seq 1:846, ack 77, win 115, length 845
20:37:46.262458 IP 111.206.15.131.53519 > 60.28.208.10.http: Flags [.], ack 846, win 8165, length 0
20:37:46.262480 IP 111.206.15.131.53519 > 60.28.208.10.http: Flags [F.], seq 77, ack 846, win 8192, length 0
20:37:46.262713 IP 60.28.208.10.http > 111.206.15.131.53519: Flags [F.], seq 846, ack 78, win 115, length 0
20:37:46.272600 IP 111.206.15.131.53519 > 60.28.208.10.http: Flags [.], ack 847, win 8192, length 0

抓后端内网 IP 的包。

# tcpdump -n -i em2 host 10.0.23.152 and port 80

20:37:46.231972 IP 10.0.19.227.commplex-link > 10.0.23.152.http: Flags [S], seq 1490789410, win 65535, options [Unknown Option 200d10f6fce0f83,mss 1460,nop,wscale 5,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,nop,sackOK,eol], length 0
20:37:46.232212 IP 10.0.23.152.http > 10.0.19.227.commplex-link: Flags [S.], seq 401558380, ack 1490789411, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
20:37:46.241875 IP 10.0.19.227.commplex-link > 10.0.23.152.http: Flags [.], ack 1, win 8192, options [Unknown Option 200d10f6fce0f83], length 0
20:37:46.242199 IP 10.0.19.227.commplex-link > 10.0.23.152.http: Flags [P.], seq 1:77, ack 1, win 8192, options [Unknown Option 200d10f6fce0f83], length 76
20:37:46.242381 IP 10.0.23.152.http > 10.0.19.227.commplex-link: Flags [.], ack 77, win 115, length 0
20:37:46.247551 IP 10.0.23.152.http > 10.0.19.227.commplex-link: Flags [P.], seq 1:846, ack 77, win 115, length 845
20:37:46.262471 IP 10.0.19.227.commplex-link > 10.0.23.152.http: Flags [.], ack 846, win 8165, length 0
20:37:46.262486 IP 10.0.19.227.commplex-link > 10.0.23.152.http: Flags [F.], seq 77, ack 846, win 8192, length 0
20:37:46.262691 IP 10.0.23.152.http > 10.0.19.227.commplex-link: Flags [F.], seq 846, ack 78, win 115, length 0
20:37:46.272614 IP 10.0.19.227.commplex-link > 10.0.23.152.http: Flags [.], ack 847, win 8192, length 0

20:37:46.231923 比 20:37:46.231972 小,说明在 LVS 在接收到 client 第一个 SYN 包的时候已经向后端建连接了,仔细看一下,数据在 LVS 一直在”穿透”。

 

参考:

http://shanks.blog.51cto.com/3899909/1387615

http://ljlion.sinaapp.com/?p=355

遇到的一个用于 LVS 检查的 Nginx ssl 配置的大坑

参照 我之前遇到的  ,这次的坑 更大了一点,导致贵司最重要的数个域名挂了 6 min,真是令人伤心啊。

看下面的用于 LVS 检查的配置:

server {
    listen 80 default_server;
    listen 443 ssl default_server;

    # Do not delete it, otherwise lvs http check will fail, it’s terrible!!!
    location /http_check/index.html {
        root /home/work/nginx/conf/;
        open_file_cache off;
    }

    location /nginx_status {
        stub_status on;
        access_log off;
        allow 10.0.0.0/8;
        deny all;
    }
}

 

坑的主角是 linsten 443 后面的 ssl,如果不加 ssl ,而且 在 Nginx 其他的配置里 没有 linsten 443 ssl 的话,那么 LVS 检查就会失败,报 400,而且 Nginx 日志显示类似的乱码:

10.0.19.34 – – [05/Feb/2015:00:01:05 +0800] “\x16\x03\x01\x01\x02\x01\x00\x00\xFE\x03\x03T\xD2BA\xC2\x03G\xCB\xF1M\xCE\xD5\
xB5\xF0 \xB8e\x0F\x1Cwh\x06%,\xDB\xFD\xD7\x0C\xE9\xD8\xBB\xCA\x00\x00\x94\xC00\xC0,\xC0(\xC0$\xC0\x14\xC0” 400 588 “-” “-”

 

今天我「走运了」,修改配置 恰巧引发的了 LVS SSL 检查的 400 ,根据 上面的链接中的解释,如果 LVS 检查某个 VIP 的后端的某个端口全部失败时,会删除 这个 VIP ,这个 VIP 上面的其他端口 也就没了。所以今天才导致数个重要域名无法访问,真是罪过啊 。

 

Nginx 机器关闭 443 端口导致的 LVS 故障

今天在下线 Nginx 443 端口的时候发现 正常的 80 的服务也挂了,然后发现 Nginx 前面的 LVS 日志报如下:

Dec 20 16:44:04 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.54]:443.
Dec 20 16:44:04 lb16 Keepalived_healthcheckers: Removing service [10.0.11.54]:443 from VS [125.39.216.11_443]:0
Dec 20 16:44:04 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.71]:443.
Dec 20 16:44:04 lb16 Keepalived_healthcheckers: Removing service [10.0.11.71]:443 from VS [123.150.178.140_443]:0
Dec 20 16:44:04 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.71]:443.
Dec 20 16:44:04 lb16 Keepalived_healthcheckers: Removing service [10.0.11.71]:443 from VS [125.39.216.11_443]:0
Dec 20 16:44:09 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.54]:443.
Dec 20 16:44:09 lb16 Keepalived_healthcheckers: Removing service [10.0.11.54]:443 from VS [123.150.178.140_443]:0
Dec 20 16:44:09 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.56]:443.
Dec 20 16:44:09 lb16 Keepalived_healthcheckers: Removing service [10.0.11.56]:443 from VS [123.150.178.140_443]:0
Dec 20 16:44:13 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.56]:443.
Dec 20 16:44:13 lb16 Keepalived_healthcheckers: Removing service [10.0.11.56]:443 from VS [125.39.216.11_443]:0
Dec 20 16:44:23 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.25]:443.
Dec 20 16:44:23 lb16 Keepalived_healthcheckers: Removing service [10.0.11.25]:443 from VS [123.150.178.140_443]:0
Dec 20 16:44:23 lb16 Keepalived_healthcheckers: Lost quorum 1-0=1 > 0 for VS [123.150.178.140_443]:0
Dec 20 16:44:23 lb16 Keepalived_healthcheckers: Executing [ip addr del 123.150.178.140/32 dev lo] for VS [123.150.178.140_443]:0
Dec 20 16:44:24 lb16 Keepalived_healthcheckers: SSL handshake/communication error connecting to server (openssl errno: 1) [10.0.11.25]:443.
Dec 20 16:44:24 lb16 Keepalived_healthcheckers: Removing service [10.0.11.25]:443 from VS [125.39.216.11_443]:0
Dec 20 16:44:24 lb16 Keepalived_healthcheckers: Lost quorum 1-0=1 > 0 for VS [125.39.216.11_443]:0
Dec 20 16:44:24 lb16 Keepalived_healthcheckers: Executing [ip addr del 125.39.216.11/32 dev lo] for VS [125.39.216.11_443]:0

 

原来 LVS 检查 后端的 443 端口,发现全部失败之后,就把对应的 VIP 删掉了,reload LVS 之后又好了。

这是一个大坑,这是 LVS 的自动切换机制,是这样:

1. LVS 的 VIP 可以有多个端口,当某个端口检查全部失败时(也就是所有后端的这个端口都挂掉),LVS 会删除这个 VIP,其他端口的正常访问会失败。

2. 如果 reload LVS 的时候所有后端的某个端口都已经挂了,VIP 则不会删除,其他的端口服务没有问题。

所以,对于 Nginx 配置里面的 端口 删除,必须要谨慎,要检查 LVS,否则可能酿成大错。