Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-18560

Incorrect IP used for gossip across DCs with prefer_local=true

    XMLWordPrintableJSON

Details

    • Correctness - Transient Incorrect Response
    • Critical
    • Normal
    • User Report
    • All
    • None

    Description

      After installing a new node using 4.0.10 we experienced a situation where the new node attempted to connect to the private ip of a random number of nodes remote DCs which are only accessible via public ip for cross dc communications.

      The only impact was new nodes outbound connections, inbound from pre-4.0.10 were not affected.  system.peers_v2 (below) showed that the preferred_ip and preferred_port as null, only those in 4.0.10 nodes dc have perferred_ip values as expected.

      We believe the issue originated with https://issues.apache.org/jira/browse/CASSANDRA-16718 

      Details on cluster:

      • All nodes have public IP configured as well as private IP
      • Listen/rpc addressrs are configured for private ip, broadcast is public IP
      • prefer_local=true is enabled for all nodes

      The log that showed the connection failing:

      INFO  [Messaging-EventLoop-3-8] 2023-06-01 00:14:21,565 NoSpamLogger.java:92 - /99.81.<redacted>:7000->/44.208.<redacted>:7000-URGENT_MESSAGES-[no-channel] failed to connectio.netty.channel.ConnectTimeoutException: connection timed out: /10.26.5.11:7000  at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)

      99 and 44 instances can only access each other using public ips.

      gossipinfo output from 4.0.10 node

      /44.208.<redacted>
        generation:1661113358
        heartbeat:25267691
        LOAD:25267683:1.7882044268E10
        SCHEMA:24692061:e98b918d-499f-3ccc-8dbe-5af31f685bda
        DC:13:us-east-1
        RACK:15:1a
        RELEASE_VERSION:6:4.0.5
        NET_VERSION:2:12
        HOST_ID:3:9a41e668-060d-4cfe-bb1e-013f5116422d
        RPC_READY:1407:true
        INTERNAL_ADDRESS_AND_PORT:9:10.26.5.11:7000
        NATIVE_ADDRESS_AND_PORT:4:44.208.<redacted>:9042
        STATUS_WITH_PORT:1393:NORMAL,-2262036356854762881
        SSTABLE_VERSIONS:7:big-nb
        TOKENS:1392:<hidden> 

      Peers output from 4.0.10 node:

         peer           | peer_port | data_center         | host_id                              | native_address | native_port | preferred_ip | preferred_port | rack | release_version | schema_version                       | tokens----------------+-----------+---------------------+--------------------------------------+----------------+-------------+--------------+----------------+------+-----------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  44.208.<redacted> |      7000 |      us-east-1 | 9a41e668-060d-4cfe-bb1e-013f5116422d |  44.208.<redacted> |        9042 |         null |           null |   1a |           4.0.5 | e98b918d-499f-3ccc-8dbe-5af31f685bda |    {'-2262036356854762881', '-4197710115038136897', '-7072386316096662315', '2085255826742630980', '249732489387853170', '4976300208126705818', '7187184456885833289', '8777189009399731927'} 

      To solve temporarily we routed outbound traffic to the private ip to public using iptables which resulted in successful outbound connections.

      Attachments

        Issue Links

          Activity

            People

              brandon.williams Brandon Williams
              bvernon Brad Vernon
              Brandon Williams
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: