irpas技术客

记一次zookeeper连接慢的问题和解决方法_clooker_zookeeper 连接慢

网络投稿 5389

1. 环境

服务器:CentOS Linux release 7.9.2009 (Core) zookeeper:3.6.3 java:1.8.0_231

2. 问题

zookeeper客户端连接到zookeeper节点有20秒的延迟,而且每次都是20秒。自己编写客户端进行测试,通过日志看到在某个环节上会出现阻塞,导致延迟。

15:49:12.424 [main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=x.x.x.x:2181 sessionTimeout=5000 watcher=zk.ZKTest$1@393671df 15:49:12.427 [main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation 15:49:12.431 [main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 1048575 Bytes 15:49:12.436 [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false 15:49:32.465 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server x.x.x.x/x.x.x.x:2181. 15:49:32.465 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error) 15:49:32.477 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.20.140.23:56244, server: x.x.x.x/x.x.x.x:2181 15:49:32.498 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server x.x.x.x/x.x.x.x:2181, session id = 0x201249468f70009, negotiated timeout = 5000 3. 解决过程 3.1 编写测试客户端,客户端只是连接到zookeeper package zk; import org.apache.zookeeper.WatchedEvent; import org.apache.zookeeper.Watcher; import org.apache.zookeeper.ZooKeeper; import org.apache.zookeeper.client.ConnectStringParser; import org.apache.zookeeper.client.StaticHostProvider; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.net.InetSocketAddress; import java.text.SimpleDateFormat; import java.util.List; import java.util.concurrent.CountDownLatch; /** * @author cc * @function * @date 2021/7/22 15:14 */ public class ZKTest { private static final Logger LOG = LoggerFactory.getLogger(ZKTest.class); public static void main(String[] args) throws Exception { ZKTest main = new ZKTest(); main.run(args[0]); } String date() { return new SimpleDateFormat("HH:mm:ss.SSS").format(System.currentTimeMillis()); } void run(String zkIps) throws Exception { CountDownLatch latch = new CountDownLatch(1); LOG.info("running..."); StaticHostProvider providers = new StaticHostProvider(new ConnectStringParser(zkIps).getServerAddresses()); InetSocketAddress address = providers.next(1000); LOG.debug(" " + address.getAddress().getHostAddress()); address.getHostName(); LOG.info(" init StaticHostProvider done"); try (ZooKeeper zoo = new ZooKeeper(zkIps, 5000, new Watcher() { @Override public void process(WatchedEvent event) { if (event.getState() == Event.KeeperState.SyncConnected) { latch.countDown();; LOG.info(" connected"); return; } System.out.println(event.toString()); } })) { latch.await(); for (int i=0; i<10; i++) { List<String> results = zoo.getChildren("/ledgers/00/0000", null); if (results != null) { LOG.info(" result size " + results.size()); } Thread.sleep(1000); } } } } 3.2 查看连接日志,确认在哪一个环节上出现阻塞

15:49:12.436 [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=false

15:49:32.465 [main-SendThread(x.x.x.x:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server x.x.x.x/x.x.x.x:2181.

结合程序日志,可以看到上边两个日志之间阻塞了20秒。

3.3 查看源代码

调用过程:

// 1.程序初始化 ZooKeeper zoo = new ZooKeeper(zkIps, 5000, new Watcher()); // 2.初始化zookeeper对象 public ZooKeeper( String connectString, int sessionTimeout, Watcher watcher, boolean canBeReadOnly, HostProvider aHostProvider, ZKClientConfig clientConfig) throws IOException { LOG.info( "Initiating client connection, connectString={} sessionTimeout={} watcher={}", connectString, sessionTimeout, watcher); if (clientConfig == null) { clientConfig = new ZKClientConfig(); } this.clientConfig = clientConfig; watchManager = defaultWatchManager(); watchManager.defaultWatcher = watcher; ConnectStringParser connectStringParser = new ConnectStringParser(connectString); hostProvider = aHostProvider; cnxn = createConnection( connectStringParser.getChrootPath(), hostProvider, sessionTimeout, this, watchManager, getClientCnxnSocket(), canBeReadOnly); cnxn.start(); } // 3.查看createConnection方法 // @VisibleForTesting protected ClientCnxn createConnection( String chrootPath, HostProvider hostProvider, int sessionTimeout, ZooKeeper zooKeeper, ClientWatchManager watcher, ClientCnxnSocket clientCnxnSocket, boolean canBeReadOnly) throws IOException { return new ClientCnxn( chrootPath, hostProvider, sessionTimeout, this, watchManager, clientCnxnSocket, canBeReadOnly); } // 4.查看ClientCnxn初始化过程 public ClientCnxn( String chrootPath, HostProvider hostProvider, int sessionTimeout, ZooKeeper zooKeeper, ClientWatchManager watcher, ClientCnxnSocket clientCnxnSocket, long sessionId, byte[] sessionPasswd, boolean canBeReadOnly) throws IOException { this.zooKeeper = zooKeeper; this.watcher = watcher; this.sessionId = sessionId; this.sessionPasswd = sessionPasswd; this.sessionTimeout = sessionTimeout; this.hostProvider = hostProvider; this.chrootPath = chrootPath; connectTimeout = sessionTimeout / hostProvider.size(); readTimeout = sessionTimeout * 2 / 3; readOnly = canBeReadOnly; sendThread = new SendThread(clientCnxnSocket); eventThread = new EventThread(); this.clientConfig = zooKeeper.getClientConfig(); initRequestTimeout(); } // 5.查看initRequestTimeout方法,可以看到出现阻塞的前一条日志在这个方法中打印。然后返回到第二步中,查看ClientCnxn线程启动的部分(cnxn.start())。 private void initRequestTimeout() { try { requestTimeout = clientConfig.getLong( ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT, ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT_DEFAULT); LOG.info( "{} value is {}. feature enabled={}", ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT, requestTimeout, requestTimeout > 0); } catch (NumberFormatException e) { LOG.error( "Configured value {} for property {} can not be parsed to long.", clientConfig.getProperty(ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT), ZKClientConfig.ZOOKEEPER_REQUEST_TIMEOUT); throw e; } } // 6.查看线程启动方法(run),这里我们只看前面一部分代码。当线程启动后,会调用startConnect方法创建到zookeeper的socket连接,下面查看这个方法的内容。 @Override public void run() { clientCnxnSocket.introduce(this, sessionId, outgoingQueue); clientCnxnSocket.updateNow(); clientCnxnSocket.updateLastSendAndHeard(); int to; long lastPingRwServer = Time.currentElapsedTime(); final int MAX_SEND_PING_INTERVAL = 10000; //10 seconds InetSocketAddress serverAddress = null; while (state.isAlive()) { try { if (!clientCnxnSocket.isConnected()) { // don't re-establish connection if we are closing if (closing) { break; } if (rwServerAddress != null) { serverAddress = rwServerAddress; rwServerAddress = null; } else { serverAddress = hostProvider.next(1000); } onConnecting(serverAddress); startConnect(serverAddress); clientCnxnSocket.updateLastSendAndHeard(); } // 7.通过下面两个方法,我们可以确定zookeeper尝试创建ZooKeeperSaslClient连接,但是失败了。而最上边的连接日志显示是创建连接失败以上的代码造成了阻塞,由此可以确认是创建ZooKeeperSaslClient的时候阻塞了,所以需要查看ZooKeeperSaslClient创建过程。 private void startConnect(InetSocketAddress addr) throws IOException { // initializing it for new connection saslLoginFailed = false; if (!isFirstConnect) { try { Thread.sleep(ThreadLocalRandom.current().nextLong(1000)); } catch (InterruptedException e) { LOG.warn("Unexpected exception", e); } } changeZkState(States.CONNECTING); String hostPort = addr.getHostString() + ":" + addr.getPort(); MDC.put("myid", hostPort); setName(getName().replaceAll("\\(.*\\)", "(" + hostPort + ")")); if (clientConfig.isSaslClientEnabled()) { try { if (zooKeeperSaslClient != null) { zooKeeperSaslClient.shutdown(); } zooKeeperSaslClient = new ZooKeeperSaslClient(SaslServerPrincipal.getServerPrincipal(addr, clientConfig), clientConfig); } catch (LoginException e) { // An authentication error occurred when the SASL client tried to initialize: // for Kerberos this means that the client failed to authenticate with the KDC. // This is different from an authentication error that occurs during communication // with the Zookeeper server, which is handled below. LOG.warn( "SASL configuration failed. " + "Will continue connection to Zookeeper server without " + "SASL authentication, if Zookeeper server allows it.", e); eventThread.queueEvent(new WatchedEvent(Watcher.Event.EventType.None, Watcher.Event.KeeperState.AuthFailed, null)); saslLoginFailed = true; } } logStartConnect(addr); clientCnxnSocket.connect(addr); } private void logStartConnect(InetSocketAddress addr) { LOG.info("Opening socket connection to server {}.", addr); if (zooKeeperSaslClient != null) { LOG.info("SASL config status: {}", zooKeeperSaslClient.getConfigStatus()); } } // 8.通过查看ZooKeeperSaslClient创建过程代码,可以看到程序在调用SaslServerPrincipal.getServerPrincipal(addr, clientConfig)方法的时候使用了getHostName方法... static String getServerPrincipal(WrapperInetSocketAddress addr, ZKClientConfig clientConfig) { String configuredServerPrincipal = clientConfig.getProperty(ZKClientConfig.ZOOKEEPER_SERVER_PRINCIPAL); if (configuredServerPrincipal != null) { // If server principal is already configured then return it return configuredServerPrincipal; } String principalUserName = clientConfig.getProperty( ZKClientConfig.ZK_SASL_CLIENT_USERNAME, ZKClientConfig.ZK_SASL_CLIENT_USERNAME_DEFAULT); // 此方法会在某些情况下造成阻塞(会对ip进行dns反向解析,如果dns服务不好用或者连接不上,则出现阻塞)。 String hostName = addr.getHostName(); 3.4 问题修复

通过上述过程,可以确定是zookeeper默认使用了ZooKeeperSaslClient,而这个过程中调用了getHostName方法,从而造成程序阻塞。

所以可以通过在hosts文件添加上zookeeper的ip地址,使得getHostName可以直接使用hosts里面的dns结果。

在/etc/hosts文件中添加记录,记录格式如下: ip1 host1 ip2 hosts2

4. 总结

问题排查过程比较繁琐,核心思想就是通过修改服务器、程序参数,确定延迟产生的环节,然后结合代码查看具体的处理过程。

以后一定谨慎使用InetSocketAddress的getHostName方法,这个方法会对传进去的字符串进行域名解析,即使是一个ip地址也会进行解析,而dns解析的快慢和服务器配置有关系。


1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,会注明原创字样,如未注明都非原创,如有侵权请联系删除!;3.作者投稿可能会经我们编辑修改或补充;4.本站不提供任何储存功能只提供收集或者投稿人的网盘链接。

标签: #zookeeper #连接慢 #1 #环境服务器CentOS #Linux #release #792009