etcd
001-安装
1 | 参考资料: |
002-常用etcd接口
1 | * 查看etcd内容 |
etcd-备份恢复方案
003-金鑫etcd总结
1 | Etcd测试与分析 |
Etcd 数据存储调研
1 | Etcd 数据存储调研 |
004-etcd连接风暴优化
问题
测试发现大量客户端同时连接etcd服务器时,会触发大量连接错误,并且需要很长时间才会恢复。
连接性能的测试结果如下
- 单个客户端,每次连接耗时200ms,服务端etcd进程CPU增加55%
- 多个客户端压测,连接qps稳定在 9,服务端etcd进程CPU利用率在110%左右。
根据以上数据,在有4w agent客户端的情况下,如果etcd整个集群(5节点)发生断网或停机,重连将至少花费15分钟,期间etcd服务极不稳定。
40000/(9*5)=888s=15min
原因
连接风暴期间,性能分析,发现
CPU主要消耗在 blowfish.encryptBlock函数
[root@sndspstdb52 ~]# perf top
67.01% etcd [.] etcd-3.3.13/cmd/vendor/golang.org/x/crypto/blowfish.encryptBlock
4.06% [kernel] [k] _spin_unlock_irqrestore
3.68% [kernel] [k] finish_task_switch
3.06% etcd [.] etcd-3.3.13/cmd/vendor/golang.org/x/crypto/blowfish.ExpandKey
2.34% [kernel] [k] find_busiest_group
1.74% [kernel] [k] iowrite16
1.33% [kernel] [k] __do_softirq
0.58% etcd [.] runtime.mallocgc
0.50% [kernel] [k] __rcu_process_callbacks
0.49% [kernel] [k] _spin_lock
0.45% [kernel] [k] pvclock_clocksource_read
0.43% etcd [.] runtime.findrunnable
0.42% [kernel] [k] rcu_process_gp_end
0.40% etcd [.] runtime.selectgo
0.37% [kernel] [k] rcu_process_callbacks
0.34% [kernel] [k] system_call_after_swapgs
0.33% etcd [.] runtime.lock
0.33% etcd [.] runtime.heapBitsSetType
0.31% [kernel] [k] rebalance_domains
0.31% [kernel] [k] tick_nohz_stop_sched_tick
0.28% etcd [.] runtime.deferreturnetcd用bcrypt(blowfish)算法进行密码验证
[root@sndspstdb52 ~]# pstack $etcd_pid
Thread 9 (Thread 0x7feb8abfd700 (LWP 2339)):
#0 0x000000000097b9da in etcd-3.3.13/cmd/vendor/golang.org/x/crypto/blowfish.encryptBlock ()
#1 0x000000000097b120 in etcd-3.3.13/cmd/vendor/golang.org/x/crypto/blowfish.ExpandKey ()
#2 0x000000000097dc34 in etcd-3.3.13/cmd/vendor/golang.org/x/crypto/bcrypt.expensiveBlowfishSetup ()
#3 0x000000000097d935 in etcd-3.3.13/cmd/vendor/golang.org/x/crypto/bcrypt.bcrypt ()
#4 0x000000000097d076 in etcd-3.3.13/cmd/vendor/golang.org/x/crypto/bcrypt.CompareHashAndPassword ()
#5 0x00000000009828d7 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/auth.(authStore).CheckPassword ()
#6 0x0000000000ad6cda in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/etcdserver.(EtcdServer).Authenticate ()
#7 0x0000000000b75395 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(AuthServer).Authenticate ()
#8 0x00000000008d9196 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._Auth_Authenticate_Handler.func1 ()
#9 0x0000000000b53151 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1 ()
#10 0x0000000000b5721e in etcd-3.3.13/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(ServerMetrics).UnaryServerInterceptor.func1 ()
#11 0x0000000000b530d9 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1 ()
#12 0x0000000000b80968 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1 ()
#13 0x0000000000b530d9 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1 ()
#14 0x0000000000b80b5e in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1 ()
#15 0x0000000000b532f3 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 ()
#16 0x000000000088beb8 in etcd-3.3.13/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._Auth_Authenticate_Handler ()
#17 0x00000000008568dc in etcd-3.3.13/cmd/vendor/google.golang.org/grpc.(Server).processUnaryRPC ()
#18 0x00000000008598f5 in etcd-3.3.13/cmd/vendor/google.golang.org/grpc.(Server).handleStream ()
#19 0x000000000085f66f in etcd-3.3.13/cmd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1 ()
#20 0x000000000045ea31 in runtime.goexit ()
#21 0x000000c024d42e60 in ?? ()
#22 0x000000c0000f8780 in ?? ()
#23 0x0000000001018580 in ?? ()
#24 0x000000c0139b54a0 in ?? ()
#25 0x000000c01e07a100 in ?? ()
#26 0x0000000000000000 in ?? ()进一步调查了解,Bcrypt消耗CPU高也是出于安全性考虑,增加密码被穷举破解的难度,并且加密性能可以通过参数调节。
https://segmentfault.com/q/1010000003054250
md5加密是快,在密码只有小写字母+数字组合的情况下,一台比较好的PC机,在40s内就可以穷举出所有的口令.
Bcrypt虽然慢,但是对于验证用户口令方面不算慢,对于穷举来说,就很慢了.因为bcrypt采用了一系列各种不同的Blowfish加密算法,并引入了一个work factor,这个工作因子可以让你决定这个算法的代价有多大。因为这些,这个算法不会因为计算机CPU处理速度变快了,而导致算法的时间会缩短了。因为,你可以增加work factor来把其性能降下来。
注:关闭密码认证,不会存在连接性能问题。
优化
最新的etcd 3.4.0提供了一个参数bcrypt-cost,可以调节bcrypt的性能。
[root@sndspstdb51 etcd-v3.4.0-linux-amd64]#./etcd --help
Auth:
...
--bcrypt-cost 10
Specify the cost / strength of the bcrypt algorithm for hashing auth passwords. Valid values are between 4 and 31.
测试结果如下
bcrypt-cost 单客户端连接时间 单客户端连接CPU 多客户端连接qps
10(默认) 100ms 99% 10
4 6ms 25% 600
设置bcrypt-cost=4后,同时启动4000个agent连接,未发生连接报错和阻塞现象。
因此优化方案如下
- 使用etcd 3.4.0
- 启动etcd时,设置参数 –bcrypt-cost=4
005-etcd客户端
起etcd grpc-proxy
1 | etcd grpc-proxy start --endpoints=http://10.244.208.3:2379,http://10.244.208.4:2379,http://10.244.208.5:2379 --listen-addr=10.242.4.92:2370 |
起etcd watch
1 | ./watch -etcd=10.242.4.92:2370,10.242.4.93:2370 |