Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-10403

Distributed deadlock when stopping gateway sender

    XMLWordPrintableJSON

Details

    Description

      A distributed deadlock has been found during some tests of a Geode system with WAN replication when stopping the gateway sender while sending a fair amount of operations to the servers.

      The distributed deadlock manifests in the gateway sender stop command hanging forever and by all normal Geode operations from clients (gets, puts,...) not being responded.
      The situation is provoked by the Gateway sender stop command that first takes the lifecycle lock and then, at a given point, tries to retrieve the size of the gateway sender. This operation, that requires communication with the other peers never finishes, probably because the response from one of the peers is never received.
      Another thread is blocked when trying to acquire the lifecycle lock in AbstractGatewaySender.distribute().
      Finally many threads handling Geode operations (get, put...) get blocked in the DistributedCacheOperation._distribute() call waiting for a response from another peer.

      Thread dump section from blocked gateway sender stop command in call to get size of queue:

      "ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x00007f92bc1c2000 nid=0x2154 waiting on condition  [0x00007f9179cd2000]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
              - parking to wait for  <0x000000031ca2be50> (a java.util.concurrent.CountDownLatch$Sync)
              at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
              at java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
              at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
              at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
              at org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)
              at org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)
              at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)
              at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)
              at org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)
              at org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)
              at org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)
              at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)
              at org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)
              at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)
              at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)
              at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)
              at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)
              at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
              at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

       

      Thread dump section from blocked call to AbstractGatewaySender.distribute() call trying to acquire the lifecycle lock:

      "P2P message reader for 192.168.78.164(eric-data-kvdb-ag-server-0:1)<v31>:41000 shared ordered uid=6 local port=60360 remote port=57246" #56 daemon prio=10 os_prio=0 cpu=462104.83ms elapsed=7095.02s tid=0x00007f93a8007800 nid=0x50 waiting on condition  [0x00007f93e59d0000]
         java.lang.Thread.State: WAITING (parking)
              at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
              - parking to wait for  <0x00000000ed9cb9f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
              at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1009)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1324)
              at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:738)
              at org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1104)
              at org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6144)
              at org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6108)
              at org.apache.geode.internal.cache.BucketRegion.notifyGatewaySender(BucketRegion.java:719)
              at org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5775)
              at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:704)
              at org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$515/0x00000008006e0440.run(Unknown Source)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
              - locked <0x0000000136123330> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
              - locked <0x0000000136123330> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006e0040.run(Unknown Source)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006ca440.run(Unknown Source)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
              at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
              at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)
              at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)
              at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
              at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)
              at org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:194)
              at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)
              at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:278)
              at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1208)
              at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1110)
              at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)
              at org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2060)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1826)
              at org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$178/0x0000000800380440.messageReceived(Unknown Source)
              at org.apache.geode.distributed.internal.membership.gms.GMSMembership.dispatchMessage(GMSMembership.java:936)
              at org.apache.geode.distributed.internal.membership.gms.GMSMembership.handleOrDeferMessage(GMSMembership.java:867)
              at org.apache.geode.distributed.internal.membership.gms.GMSMembership.processMessage(GMSMembership.java:1209)
              at org.apache.geode.distributed.internal.DistributionImpl$MyDCReceiver.messageReceived(DistributionImpl.java:828)
              at org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:614)
              at org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:679)
              at org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3261)
              at org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:2988)
              at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2794)
              at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1648)
              at org.apache.geode.internal.tcp.Connection.run(Connection.java:1479)
              at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

       

      Thread dump section from blocked calls to DistributedCacheOperation._distribute() waiting a for a response from a remote peer:

      "ServerConnection on port 40404 Thread 2" #88 daemon prio=5 os_prio=0 cpu=81268.62ms elapsed=7050.08s tid=0x00007f8160001800 nid=0x73 waiting on condition  [0x00007f8196f57000]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
              - parking to wait for  <0x000000031befad38> (a java.util.concurrent.CountDownLatch$Sync)
              at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)
              at java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)
              at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
              at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
              at org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)
              at org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)
              at org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)
              at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:694)
              at org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006c9c40.run(Unknown Source)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
              - locked <0x00000001eb353770> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
              - locked <0x00000001eb353770> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006c9840.run(Unknown Source)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$512/0x00000008006ca440.run(Unknown Source)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
              at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
              at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
              at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)
              at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)
              at org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5578)
              at org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1213)
              at org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:3005)
              at org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2215)
              at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
              at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)
              at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5531)
              at org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5210)
              at org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411)
              at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183)
              at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848)
              at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72)
              at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1181)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
              at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:691)
              at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$$Lambda$495/0x00000008006be440.invoke(Unknown Source)
              at org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:120)
              at org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$166/0x000000080034c040.run(Unknown Source)
              at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

      Attachments

        Issue Links

          Activity

            People

              alberto.gomez Alberto Gomez
              alberto.gomez Alberto Gomez
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: