Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEODE-8553: Fix inter-locks in ThinClientLocator #660

Merged
merged 11 commits into from Nov 17, 2020

Conversation

gaussianrecurrence
Copy link
Contributor

  • ThinClientLocatorHelper uses a mutex called m_locatorLock, which is
    used in the entire scope of all the public member functions for this
    class.
  • Problem with that is all of those functions perform some network
    communication and in the event of networking issues, this will result
    in having all the application threads inter-locked.
  • Also, the only purpose of this mutex is to avoid that the locators
    list is read while it's being updated.
  • Given all of the above, the class has been refactored, so every time
    the locators list has to be accessed, a copy of it is created, being
    that copy created only while owning the mutex.
  • And for the case of the update, a new locators list is composed and
    its only at the end of the update function where the mutex is locked
    and the locators list swapped by the new one.
  • This whole change ensures that the time the mutex is in use is
    minimal while the functionality remains intact.

 - ThinClientLocatorHelper uses a mutex called m_locatorLock, which is
   used in the entire scope of all the public member functions for this
   class.
 - Problem with that is all of those functions perform some network
   communication and in the event of networking issues, this will result
   in having all the application threads inter-locked.
 - Also, the only purpose of this mutex is to avoid that the locators
   list is read while it's being updated.
 - Given all of the above, the class has been refactored, so every time
   the locators list has to be accessed, a copy of it is created, being
   that copy created only while owning the mutex.
 - And for the case of the update, a new locators list is composed and
   its only at the end of the update function where the mutex is locked
   and the locators list swapped by the new one.
 - This whole change ensures that the time the mutex is in use is
   minimal while the functionality remains intact.
Copy link

@mreddington mreddington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tell you, I want this - but can we get a unit test demonstrating what you're trying to fix? I need to see a demonstration of the before and after so we can further capture edge cases in tests. We need to be cautious of unintended breaking changes in behavior.

Copy link
Contributor

@pdxcodemonkey pdxcodemonkey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, and existing tests all pass. Got a few practical questions and/or small improvements to make, but otherwise looks good. I agree with Matt's opinion, though, let's figure out how we can write a test that shows this code working with the copy of the locators list and not blowing up.

cppcache/src/ThinClientLocatorHelper.cpp Outdated Show resolved Hide resolved
cppcache/src/ThinClientLocatorHelper.cpp Outdated Show resolved Hide resolved
cppcache/src/ThinClientLocatorHelper.cpp Show resolved Hide resolved
cppcache/src/ThinClientLocatorHelper.cpp Outdated Show resolved Hide resolved
cppcache/src/ThinClientLocatorHelper.cpp Show resolved Hide resolved
cppcache/src/ThinClientLocatorHelper.hpp Outdated Show resolved Hide resolved
@gaussianrecurrence
Copy link
Contributor Author

Regarding general comment about the mssing test ensuring this PR is not breaking anything, let me give it a thought and will come back with a proposal :)

 - Refactored ConnectionWrapper so is move constructible.
 - Locators list is now initialized using iterator range constructor.
 - Added getConnRetries utility function.
 - Changed mutex for a shared mutex.
 - getLocators function now returns a shuffled list each time.
 - Cleaned-up repeated code. As a result of this createConnection has been
   refactored and a new function called sendRequest has been created.
 - Implemented proposal for locator connections retries, treating the locators
   list as a circular buffer.
 - TODO. Add tests to verify functionality is not broken.
@gaussianrecurrence
Copy link
Contributor Author

gaussianrecurrence commented Oct 2, 2020

Hi again,

Regarding the test proposal, I've got an idea I would like to share with you in order to get your opinion on the matter.

The purpose of this PR is avoid that whenever a thread is connecting to a locator or performing a request, other threads accesing the pool locator don't get locked.

Therefore what I thought is to:

  • Introduce a new abstract class ConnectorFactory, which basically creates a new Connector.
  • Create a ConnectorFactory implementation called TcpConnFactory which basically creates TcpConn/TcpSslConn. This itself would also help reduced duplicated code, as connections creation is duplicated both in TcrConnection and ThinClientLocatorHelper.
  • PoolFactory would be the one responsible for setting the instance of ConnectorFactory into the created pool, adding therefore a method to set ConnectorFactory instance in there.
  • ConnectorFactory should be then accesible throught the Pool.
  • Create a Google Mock for ConnectorFactory.
  • Create a Google Mock for Connector.

PST. While seeking a solution for the test proposal I think I've noticed something: "Is it possible that many of the factory class existing in NC, should be instead builders?"

Ok, so having set all the above scaffolding, then the idea would be to create a ThinClientLocatorHelper TS which tries to perform requests by multiple threads, and having the ability to control whats returned by the connections and how much time is spent on a connection operation, you can therefore check that if one thread is waiting for an operation to complete, the other thread can inmmediatly return.

An pseudo-code example would be the following:

testGetAllServerWhileUpdating
`
createCache();
createPoolFactoryWithCustomConnectorFactory();

updateT = createUpdateThread(connector_sleep_time=10s);
getAllServersT = createGetAllServersThread(connector_sleep_time=0s);

ASSERT_EQ(getAllServersT.wait_for(1s), std::future_state::ready);
`


So want do you think about it?
What I don't know is where would I put the test, for that I might need a hand from your side :)

BR
/Mario

@pdxcodemonkey
Copy link
Contributor

I'll think about your idea today and get back to you. My first instinct is that it's a much larger change than even the initial one, however, so it may be beyond the scope of what I'm comfortable with, absent a lot more testing etc.

In the meantime, your last commit broke our Windows build, maybe due to some platform difference in boost(?). Here's the error output:

  ThinClientPoolStickyDM.cpp
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2039: 'shared_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2065: 'shared_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '{' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '}' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2039: 'unique_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2065: 'unique_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]

I'll try a build on a Windows instance here, and let you know if I find out what's up before you. Thanks!

 - Added missing include to fix compilation for Windows.
@gaussianrecurrence
Copy link
Contributor Author

I'll think about your idea today and get back to you. My first instinct is that it's a much larger change than even the initial one, however, so it may be beyond the scope of what I'm comfortable with, absent a lot more testing etc.

In the meantime, your last commit broke our Windows build, maybe due to some platform difference in boost(?). Here's the error output:

  ThinClientPoolStickyDM.cpp
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2039: 'shared_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2065: 'shared_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '{' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '}' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2039: 'unique_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2065: 'unique_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]

I'll try a build on a Windows instance here, and let you know if I find out what's up before you. Thanks!

Regarding about the size of the PR, totally agree. Indeed usually I tend to do many more but much smaller commits. But I don't know what's your WOW here. I guess we could always split this PR right?

Regarding the Windows error, sorry, my mistake, Thing is Boost.Thread includes different headers depending on the platform. Pthread implementation implicitly includes boost/thread/lock_types.hpp, but the windows implementation does not. I've already pushed a commit solving it.

Btw. Do you have a publicly available CI cluster or somwhere were test can be executed for every supported platform?
Or are we supposed to execute them all manually?

Thanks for all :)

cppcache/src/ThinClientLocatorHelper.hpp Outdated Show resolved Hide resolved
cppcache/src/ThinClientLocatorHelper.hpp Outdated Show resolved Hide resolved
@pdxcodemonkey
Copy link
Contributor

@gaussianrecurrence Yes we have a CI pipeline for all the builds/tests, we're working out how to donate and make it public. In the meantime, we run all incoming PR branches through it to avoid unpleasant surprises.

I'm now seeing a ton of test failures on Windows:

18% tests passed, 86 tests failed out of 105

Label Time Summary:
FLAKY = 0.00 secproc (42 tests)
OMITTED = 0.00 sec
proc (20 tests)
QUICK = 60.32 secproc (15 tests)
STABLE = 13222.11 sec
proc (105 tests)

Total Test time (real) = 3604.23 sec

The following tests did not run:
7 - testFwPerf (Disabled)
11 - testOverflowPutGetSqLite (Disabled)
22 - testThinClientAfterRegionLive (Disabled)
25 - testThinClientCacheables (Disabled)
31 - testThinClientCq (Disabled)
33 - testThinClientCqDurable (Disabled)
34 - testThinClientCqFailover (Disabled)
35 - testThinClientCqHAFailover (Disabled)
44 - testThinClientDurableConnect (Disabled)
47 - testThinClientDurableDisconnectNormal (Disabled)
48 - testThinClientDurableDisconnectTimeout (Disabled)
49 - testThinClientDurableFailoverClientClosedNoRedundancy (Disabled)
51 - testThinClientDurableFailoverClientNotClosedRedundancy (Disabled)
54 - testThinClientDurableKeepAliveFalseTimeout (Disabled)
55 - testThinClientDurableKeepAliveTrueNormal (Disabled)
57 - testThinClientDurableReconnect (Disabled)
58 - testThinClientFailover (Disabled)
59 - testThinClientFailover2 (Disabled)
61 - testThinClientFailoverInterest (Disabled)
62 - testThinClientFailoverInterest2 (Disabled)
64 - testThinClientFailoverRegex (Disabled)
65 - testThinClientFixedPartitionResolver (Disabled)
66 - testThinClientGatewayTest (Disabled)
68 - testThinClientHADistOps (Disabled)
69 - testThinClientHAEventIDMap (Disabled)
70 - testThinClientHAFailover (Disabled)
71 - testThinClientHAFailoverRegex (Disabled)
72 - testThinClientHAMixedRedundancy (Disabled)
74 - testThinClientHAQueryFailover (Disabled)
86 - testThinClientLRUExpiration (Disabled)
87 - testThinClientLargePutAllWithCallBackArg (Disabled)
95 - testThinClientLocatorFailover (Disabled)
96 - testThinClientMultiDS (Disabled)
100 - testThinClientPRPutAllFailover (Disabled)
101 - testThinClientPRSingleHop (Disabled)
103 - testThinClientPartitionResolver (Disabled)
104 - testThinClientPdxDeltaWithNotification (Disabled)
109 - testThinClientPdxTests (Disabled)
110 - testThinClientPoolAttrTest (Disabled)
114 - testThinClientPoolLocator (Disabled)
115 - testThinClientPoolRedundancy (Disabled)
117 - testThinClientPoolServer (Disabled)
118 - testThinClientPutAll (Disabled)
119 - testThinClientPutAllPRSingleHop (Disabled)
122 - testThinClientPutAllWithCallBackArgWithoutConcurrency (Disabled)
124 - testThinClientPutWithDelta (Disabled)
135 - testThinClientRemoteQueryTimeout (Disabled)
140 - testThinClientRemoveOps (Disabled)
141 - testThinClientSecurityAuthentication (Disabled)
142 - testThinClientSecurityAuthenticationMU (Disabled)
144 - testThinClientSecurityAuthorization (Disabled)
145 - testThinClientSecurityAuthorizationMU (Disabled)
146 - testThinClientSecurityCQAuthorizationMU (Disabled)
147 - testThinClientSecurityDurableCQAuthorizationMU (Disabled)
149 - testThinClientSecurityPostAuthorization (Disabled)
150 - testThinClientTXFailover (Disabled)
151 - testThinClientTicket303 (Disabled)
152 - testThinClientTicket304 (Disabled)
154 - testThinClientTracking (Disabled)
157 - testThinClientTransactionsXA (Disabled)
162 - testThinClientWriterException (Disabled)
163 - testTimedSemaphore (Disabled)

The following tests FAILED:
1 - testCacheless (Timeout)
12 - testPdxMetadataCheckTest (Failed)
14 - testRegionAccessThreadSafe (Failed)
18 - testSerialization (Failed)
23 - testThinClientBigValue (Failed)
24 - testThinClientCacheableStringArray (Failed)
26 - testThinClientCacheablesLimits (Failed)
27 - testThinClientCallbackArg (Failed)
28 - testThinClientClearRegion (Failed)
29 - testThinClientConflation (Failed)
30 - testThinClientContainsKeyOnServer (Failed)
32 - testThinClientCqDelta (Failed)
36 - testThinClientCqIR (Failed)
37 - testThinClientDeltaWithNotification (Failed)
38 - testThinClientDisconnectionListioner (Failed)
39 - testThinClientDistOps2 (Failed)
40 - testThinClientDistOpsDontUpdateLocatorList (Failed)
41 - testThinClientDistOpsNotSticky (Failed)
42 - testThinClientDistOpsSticky (Failed)
43 - testThinClientDistOpsUpdateLocatorList (Failed)
45 - testThinClientDurableCrashNormal (Failed)
46 - testThinClientDurableCrashTimeout (Failed)
50 - testThinClientDurableFailoverClientClosedRedundancy (Failed)
52 - testThinClientDurableInterest (Failed)
53 - testThinClientDurableKeepAliveFalseNormal (Failed)
56 - testThinClientDurableKeepAliveTrueTimeout (Failed)
60 - testThinClientFailover3 (Failed)
63 - testThinClientFailoverInterestAllWithCache (Failed)
67 - testThinClientGetInterests (Failed)
73 - testThinClientHAPeriodicAck (Failed)
75 - testThinClientHeapLRU (Failed)
76 - testThinClientIntResPolKeysInv (Failed)
77 - testThinClientInterest1 (Failed)
78 - testThinClientInterest1Cacheless (Failed)
79 - testThinClientInterest1_Bug1001 (Failed)
80 - testThinClientInterest2Pooled (Failed)
81 - testThinClientInterest3 (Failed)
82 - testThinClientInterest3Cacheless (Failed)
83 - testThinClientInterestList (Failed)
84 - testThinClientInterestList2 (Failed)
85 - testThinClientInterestNotify (Timeout)
88 - testThinClientListenerCallbackArgTest (Failed)
89 - testThinClientListenerEvents (Failed)
90 - testThinClientListenerInit (Failed)
91 - testThinClientListenerWriterWithSubscription (Failed)
92 - testThinClientListenerWriterWithoutSubscription (Failed)
94 - testThinClientLocator (Failed)
97 - testThinClientMultipleCaches (Failed)
98 - testThinClientNotification (Failed)
99 - testThinClientNotificationWithDeltaWithoutcache (Failed)
102 - testThinClientPRSingleHopServerGroup (Failed)
105 - testThinClientPdxEnum (Failed)
106 - testThinClientPdxInstance (Failed)
107 - testThinClientPdxSerializer (Failed)
108 - testThinClientPdxSerializerForJava (Failed)
111 - testThinClientPoolExecuteFunctionThrowsException (Failed)
112 - testThinClientPoolExecuteHAFunction (Failed)
113 - testThinClientPoolExecuteHAFunctionPrSHOP (Failed)
116 - testThinClientPoolRegInterest (Failed)
120 - testThinClientPutAllTimeout (Failed)
121 - testThinClientPutAllWithCallBackArgWithConcurrency (Failed)
123 - testThinClientPutGetAll (Failed)
125 - testThinClientRIwithlocalRegionDestroy (Failed)
126 - testThinClientRegex (Failed)
127 - testThinClientRegex2 (Failed)
128 - testThinClientRegex3 (Failed)
129 - testThinClientRegionQueryDifferentServerConfigs (Failed)
130 - testThinClientRegionQueryExclusiveness (Failed)
131 - testThinClientRemoteQueryFailover (Failed)
132 - testThinClientRemoteQueryFailoverPdx (Failed)
133 - testThinClientRemoteQueryRS (Failed)
134 - testThinClientRemoteQuerySS (Failed)
136 - testThinClientRemoteRegionQuery (Failed)
137 - testThinClientRemoveAll (Failed)
139 - testThinClientRemoveAllSequence (Failed)
143 - testThinClientSecurityAuthenticationSetAuthInitialize (Failed)
148 - testThinClientSecurityMultiUserTest (Failed)
153 - testThinClientTicket317 (Failed)
155 - testThinClientTransactionsWithSticky (Failed)
156 - testThinClientTransactionsWithoutSticky (Failed)
158 - testThinClientVersionedOps (Failed)
159 - testThinClientVersionedOpsPartitionPersistent (Failed)
160 - testThinClientVersionedOpsReplicate (Failed)
161 - testThinClientVersionedOpsReplicatePersistent (Failed)
165 - testXmlCacheCreationWithPools (Timeout)
167 - testXmlCacheInitialization (Timeout)
Errors while running CTest

@gaussianrecurrence
Copy link
Contributor Author

Oh my... let me see if I can set up an environment in Windows and run all the tests. Sorry for the inconviniences :S

@pdxcodemonkey
Copy link
Contributor

I'm running tests on a Windows EC2 instance to confirm it's a real thing and not just a hiccup in our infrastructure.

@pdxcodemonkey
Copy link
Contributor

I'm hitting the same test results on a Windows VM, so something major is wrong on that platform.

@gaussianrecurrence
Copy link
Contributor Author

Hi,
Ok... got it finally. After fighting 1-2 days with VS and the old testing framework I located the problem.
Problem has to do with move semantics and ConnectionWrapper. Thing is for some reason I forgot how to program and I implemented the movement of the conn_ attr just as conn_(std::move(other.conn_)) and whenever the original object conn_ pointer is not nullptr, therefore the pointer is destroyed, leaving the second object pointing to an invalid.

Thing is, it seems GCC implementation of the standard is different than MSVC's and it avoids calling to the move constructor and that's why test were passing in Linux.

Will upload a fix ASAP.
BR,
Mario.

 - Fixed move semantics for ConnectionWrapper.
 - Changed signed types to unsigned types.
@lgtm-com
Copy link

lgtm-com bot commented Oct 4, 2020

This pull request introduces 2 alerts when merging 66c5679 into 2e64e74 - view on LGTM.com

new alerts:

  • 2 for Wrong type of arguments to formatting function

 - Fix LGTM issues
@pdxcodemonkey
Copy link
Contributor

@gaussianrecurrence Still seeing a clangformat failure in ThinClientLocatorHelper.hpp, see Travis results above. Note that you need clangformat v6 or v7 to be compatible with the v6 we're using in Travis. When you have the right version installed on your dev machine, from a clean repo run cmake --build . --target all-clangformat, then check in any formatting changes the tool has made. Thanks!

 - Fix formatting
@gaussianrecurrence
Copy link
Contributor Author

@gaussianrecurrence Still seeing a clangformat failure in ThinClientLocatorHelper.hpp, see Travis results above. Note that you need clangformat v6 or v7 to be compatible with the v6 we're using in Travis. When you have the right version installed on your dev machine, from a clean repo run cmake --build . --target all-clangformat, then check in any formatting changes the tool has made. Thanks!

Yes, sorry I made the change quickly and forgot that :S

@pdxcodemonkey
Copy link
Contributor

Windows build still hitting an issue:

C:\nativeclient\cppcache\src\ThinClientPoolDM.cpp(572): error C2220: warning treated as error - no 'object' file generated [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientPoolDM.cpp(572): warning C4267: 'argument': conversion from 'size_t' to 'int32_t', possible loss of data [C:\build\cppcache\static\apache-geode-static.vcxproj]

@gaussianrecurrence
Copy link
Contributor Author

So sorry @pdxcodemonkey,
Will fix the issue and compile/run tests in all platforms, just to be sure. It might take a while.

@pdxcodemonkey
Copy link
Contributor

If you push the build fix to your branch, my pipeline will run build/test for all platforms automatically...

 - Fixed type convertion getStats().setLocators
@gaussianrecurrence
Copy link
Contributor Author

If you push the build fix to your branch, my pipeline will run build/test for all platforms automatically...

Oh that's nice. On either case I completed the execution of the tests.
They are all passing except for this new integration test:
BasicIPv6Test.queryResultForRange

Thing is the problem seems ip6-locahost does not exist in my host. I modified it so it uses ::1 as binding address, just to see if that was the problem but it failed due to ACE not being able to find the host. Which leds me to think.
Shouldn't be IPV6 tests disabled if WITH_IPV6 CMake option is not enabled?
Or should I always compile with IPV6 in order to run the tests?

Summaziring here: "In theory all test should pass except for the IPV6 one"

@pdxcodemonkey
Copy link
Contributor

Re: BasicIPV6Test, you're right this one won't pass unless your build is compiled WITH_IPV6 and your network is configured to route IPv6. We've never had a test bed that worked for this one, so in our CI we use the ctest -E switch to avoid running the test.

@gaussianrecurrence
Copy link
Contributor Author

  • Branch rebased.
  • Clang-format passed
  • Compilation ran OK both in Linux and Windows.
  • All tests passed both in Linux and Windows.

Please let me know your feeling about the pending topic (the tests). If you think I might need to come up with an intermediate test, or maybe I should make a prior PR to setup all the necessary scaffolding in order to make connection tests in the future.

Thanks 🙂

@codecov-io
Copy link

codecov-io commented Oct 11, 2020

Codecov Report

Merging #660 (ebb7cb8) into develop (0d9a99d) will increase coverage by 0.00%.
The diff coverage is 94.65%.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop     #660    +/-   ##
=========================================
  Coverage    74.04%   74.04%            
=========================================
  Files          644      644            
  Lines        51189    51086   -103     
=========================================
- Hits         37903    37828    -75     
+ Misses       13286    13258    -28     
Impacted Files Coverage Δ
cppcache/src/ThinClientLocatorHelper.cpp 93.87% <94.44%> (+7.76%) ⬆️
cppcache/src/ThinClientLocatorHelper.hpp 100.00% <100.00%> (ø)
cppcache/src/ThinClientPoolDM.cpp 76.32% <100.00%> (-0.06%) ⬇️
cppcache/src/ReadWriteLock.cpp 62.50% <0.00%> (-18.75%) ⬇️
cppcache/src/PdxTypeRegistry.cpp 77.94% <0.00%> (-2.21%) ⬇️
cppcache/src/ClientMetadataService.cpp 62.24% <0.00%> (-0.46%) ⬇️
cppcache/src/ThinClientRedundancyManager.cpp 76.09% <0.00%> (-0.32%) ⬇️
cppcache/src/TcrConnection.cpp 72.95% <0.00%> (+0.47%) ⬆️
cppcache/src/TcrEndpoint.cpp 55.11% <0.00%> (+0.56%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d9a99d...ebb7cb8. Read the comment docs.

@pdxcodemonkey
Copy link
Contributor

@gaussianrecurrence Is this ready for re-review?

@gaussianrecurrence
Copy link
Contributor Author

@gaussianrecurrence Is this ready for re-review?

Sure. Only thing left in here is the testing part, for which I would really appreciate any ideas.
As I told you we could even make a previous PR in which we add the necessary scaffolding to test this.
Or maybe there is another way I did not come up with, which is much simpler.

Thanks for putting some time on this :)

@gaussianrecurrence
Copy link
Contributor Author

Hi
Sorry to go into the topic again, but @mreddington @pdxcodemonkey or @pivotal-jbarrett did you had the chance to look into it?

 - Added new integration test to verify that threads don't inter-lock anymore.
 - The test works as follow:
   * Set 5s connect-timeout.
   * Deploy 1 locator.
   * Add 1 non-existent locator to the pool.
   * Send 64 concurrent requests to get the server list.
   * By probability if there are no interlocks ~50%+-E% of the requests should
     have completed in a brief period of time.
   * Note that E% is the error allowance percentage, which is set to 10%.
   * Therefore a 10 seconds timeout is set in order to reach the request
     completed treshold.
   * Test passes if the number of completed requests before the timeout was
     greater or equal to the thresholds, otherwise it fails.
@gaussianrecurrence
Copy link
Contributor Author

Ok @pdxcodemonkey and @pivotal-jbarrett I think I came up with an integration test for this change. It works as follows:

  • Set 5s connect-timeout.
  • Deploy 1 locator.
  • Add 1 non-existent locator to the pool.
  • Send 64 concurrent requests to get the server list.
  • By probability if there are no interlocks ~50%+-E% of the requests should
    have completed in a brief period of time.
  • Note that E% is the error allowance percentage, which is set to 10%.
  • Therefore a 10 seconds timeout is set in order to reach the request
    completed treshold.
  • Test passes if the number of completed requests before the timeout was
    greater or equal to the thresholds, otherwise it fails.

It does not cover everything, but I'd would appreciate your feedback in order to see If I am in the right direction. Thanks!

 - Fix travis issues having to do with a new-line missing at the end of
   LOcatorRequestsTest
@gaussianrecurrence
Copy link
Contributor Author

Sorry to bother you again, it has been quite a long time since this had any feedback. Is there anything else you would like me to do in order to get this ready?

Thanks!

Copy link
Contributor

@jake-at-work jake-at-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't have bandwidth to review this right now. Perhaps @pdxcodemonkey can.

@jake-at-work jake-at-work dismissed their stale review November 12, 2020 19:20

Don't have time to re-review.

@pdxcodemonkey pdxcodemonkey merged commit 72455cd into apache:develop Nov 17, 2020
gaussianrecurrence added a commit to Nordix/geode-native that referenced this pull request Nov 20, 2020
 - ThinClientLocatorHelper uses a mutex called m_locatorLock, which is
   used in the entire scope of all the public member functions for this
   class.
 - Problem with that is all of those functions perform some network
   communication and in the event of networking issues, this will result
   in having all the application threads inter-locked.
 - Also, the only purpose of this mutex is to avoid that the locators
   list is read while it's being updated.
 - Given all of the above, the class has been refactored, so every time
   the locators list has to be accessed, a copy of it is created, being
   that copy created only while owning the mutex.
 - And for the case of the update, a new locators list is composed and
   its only at the end of the update function where the mutex is locked
   and the locators list swapped by the new one.
 - This whole change ensures that the time the mutex is in use is
   minimal while the functionality remains intact.
 - Refactored ConnectionWrapper so is move constructible.
 - Locators list is now initialized using iterator range constructor.
 - Added getConnRetries utility function.
 - Changed mutex for a shared mutex.
 - getLocators function now returns a shuffled list each time.
 - Cleaned-up repeated code. As a result of this createConnection has been
   refactored and a new function called sendRequest has been created.
 - Implemented proposal for locator connections retries, treating the locators
   list as a circular buffer.
 - Added new integration test to verify that threads don't inter-lock anymore.
 - The test works as follow:
   * Set 5s connect-timeout.
   * Deploy 1 locator.
   * Add 1 non-existent locator to the pool.
   * Send 64 concurrent requests to get the server list.
   * By probability if there are no interlocks ~50%+-E% of the requests should
     have completed in a brief period of time.
   * Note that E% is the error allowance percentage, which is set to 10%.
   * Therefore a 10 seconds timeout is set in order to reach the request
     completed treshold.
   * Test passes if the number of completed requests before the timeout was
     greater or equal to the thresholds, otherwise it fails.

Jira: ADPPRG-34718
albertogpz pushed a commit to Nordix/geode-native that referenced this pull request Nov 20, 2020
 - ThinClientLocatorHelper uses a mutex called m_locatorLock, which is
   used in the entire scope of all the public member functions for this
   class.
 - Problem with that is all of those functions perform some network
   communication and in the event of networking issues, this will result
   in having all the application threads inter-locked.
 - Also, the only purpose of this mutex is to avoid that the locators
   list is read while it's being updated.
 - Given all of the above, the class has been refactored, so every time
   the locators list has to be accessed, a copy of it is created, being
   that copy created only while owning the mutex.
 - And for the case of the update, a new locators list is composed and
   its only at the end of the update function where the mutex is locked
   and the locators list swapped by the new one.
 - This whole change ensures that the time the mutex is in use is
   minimal while the functionality remains intact.
 - Refactored ConnectionWrapper so is move constructible.
 - Locators list is now initialized using iterator range constructor.
 - Added getConnRetries utility function.
 - Changed mutex for a shared mutex.
 - getLocators function now returns a shuffled list each time.
 - Cleaned-up repeated code. As a result of this createConnection has been
   refactored and a new function called sendRequest has been created.
 - Implemented proposal for locator connections retries, treating the locators
   list as a circular buffer.
 - Added new integration test to verify that threads don't inter-lock anymore.
 - The test works as follow:
   * Set 5s connect-timeout.
   * Deploy 1 locator.
   * Add 1 non-existent locator to the pool.
   * Send 64 concurrent requests to get the server list.
   * By probability if there are no interlocks ~50%+-E% of the requests should
     have completed in a brief period of time.
   * Note that E% is the error allowance percentage, which is set to 10%.
   * Therefore a 10 seconds timeout is set in order to reach the request
     completed treshold.
   * Test passes if the number of completed requests before the timeout was
     greater or equal to the thresholds, otherwise it fails.

Jira: ADPPRG-34718
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants