GEODE-8553: Fix inter-locks in ThinClientLocator #660

gaussianrecurrence · 2020-09-30T03:20:28Z

ThinClientLocatorHelper uses a mutex called m_locatorLock, which is
used in the entire scope of all the public member functions for this
class.
Problem with that is all of those functions perform some network
communication and in the event of networking issues, this will result
in having all the application threads inter-locked.
Also, the only purpose of this mutex is to avoid that the locators
list is read while it's being updated.
Given all of the above, the class has been refactored, so every time
the locators list has to be accessed, a copy of it is created, being
that copy created only while owning the mutex.
And for the case of the update, a new locators list is composed and
its only at the end of the update function where the mutex is locked
and the locators list swapped by the new one.
This whole change ensures that the time the mutex is in use is
minimal while the functionality remains intact.

- ThinClientLocatorHelper uses a mutex called m_locatorLock, which is used in the entire scope of all the public member functions for this class. - Problem with that is all of those functions perform some network communication and in the event of networking issues, this will result in having all the application threads inter-locked. - Also, the only purpose of this mutex is to avoid that the locators list is read while it's being updated. - Given all of the above, the class has been refactored, so every time the locators list has to be accessed, a copy of it is created, being that copy created only while owning the mutex. - And for the case of the update, a new locators list is composed and its only at the end of the update function where the mutex is locked and the locators list swapped by the new one. - This whole change ensures that the time the mutex is in use is minimal while the functionality remains intact.

mreddington

I'll tell you, I want this - but can we get a unit test demonstrating what you're trying to fix? I need to see a demonstration of the before and after so we can further capture edge cases in tests. We need to be cautious of unintended breaking changes in behavior.

pdxcodemonkey

Looks good overall, and existing tests all pass. Got a few practical questions and/or small improvements to make, but otherwise looks good. I agree with Matt's opinion, though, let's figure out how we can write a test that shows this code working with the copy of the locators list and not blowing up.

cppcache/src/ThinClientLocatorHelper.cpp

cppcache/src/ThinClientLocatorHelper.hpp

cppcache/src/ThinClientLocatorHelper.cpp

gaussianrecurrence · 2020-10-01T01:25:00Z

Regarding general comment about the mssing test ensuring this PR is not breaking anything, let me give it a thought and will come back with a proposal :)

- Refactored ConnectionWrapper so is move constructible. - Locators list is now initialized using iterator range constructor. - Added getConnRetries utility function. - Changed mutex for a shared mutex. - getLocators function now returns a shuffled list each time. - Cleaned-up repeated code. As a result of this createConnection has been refactored and a new function called sendRequest has been created. - Implemented proposal for locator connections retries, treating the locators list as a circular buffer. - TODO. Add tests to verify functionality is not broken.

gaussianrecurrence · 2020-10-02T13:33:53Z

Hi again,

Regarding the test proposal, I've got an idea I would like to share with you in order to get your opinion on the matter.

The purpose of this PR is avoid that whenever a thread is connecting to a locator or performing a request, other threads accesing the pool locator don't get locked.

Therefore what I thought is to:

Introduce a new abstract class ConnectorFactory, which basically creates a new Connector.
Create a ConnectorFactory implementation called TcpConnFactory which basically creates TcpConn/TcpSslConn. This itself would also help reduced duplicated code, as connections creation is duplicated both in TcrConnection and ThinClientLocatorHelper.
PoolFactory would be the one responsible for setting the instance of ConnectorFactory into the created pool, adding therefore a method to set ConnectorFactory instance in there.
ConnectorFactory should be then accesible throught the Pool.
Create a Google Mock for ConnectorFactory.
Create a Google Mock for Connector.

PST. While seeking a solution for the test proposal I think I've noticed something: "Is it possible that many of the factory class existing in NC, should be instead builders?"

Ok, so having set all the above scaffolding, then the idea would be to create a ThinClientLocatorHelper TS which tries to perform requests by multiple threads, and having the ability to control whats returned by the connections and how much time is spent on a connection operation, you can therefore check that if one thread is waiting for an operation to complete, the other thread can inmmediatly return.

An pseudo-code example would be the following:

testGetAllServerWhileUpdating
`
createCache();
createPoolFactoryWithCustomConnectorFactory();

updateT = createUpdateThread(connector_sleep_time=10s);
getAllServersT = createGetAllServersThread(connector_sleep_time=0s);

ASSERT_EQ(getAllServersT.wait_for(1s), std::future_state::ready);
`

So want do you think about it?
What I don't know is where would I put the test, for that I might need a hand from your side :)

BR
/Mario

pdxcodemonkey · 2020-10-02T14:17:48Z

I'll think about your idea today and get back to you. My first instinct is that it's a much larger change than even the initial one, however, so it may be beyond the scope of what I'm comfortable with, absent a lot more testing etc.

In the meantime, your last commit broke our Windows build, maybe due to some platform difference in boost(?). Here's the error output:

  ThinClientPoolStickyDM.cpp
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2039: 'shared_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2065: 'shared_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '{' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '}' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2039: 'unique_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2065: 'unique_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]

I'll try a build on a Windows instance here, and let you know if I find out what's up before you. Thanks!

- Added missing include to fix compilation for Windows.

gaussianrecurrence · 2020-10-02T15:14:16Z

I'll think about your idea today and get back to you. My first instinct is that it's a much larger change than even the initial one, however, so it may be beyond the scope of what I'm comfortable with, absent a lot more testing etc.

In the meantime, your last commit broke our Windows build, maybe due to some platform difference in boost(?). Here's the error output:

  ThinClientPoolStickyDM.cpp
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2039: 'shared_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2065: 'shared_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '{' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(73): error C2143: syntax error: missing ';' before '}' [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2039: 'unique_lock': is not a member of 'boost' [C:\build\cppcache\static\apache-geode-static.vcxproj]
  C:\build\dependencies\boost\RelWithDebInfo\include\boost/thread/shared_mutex.hpp(36): note: see declaration of 'boost'
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2065: 'unique_lock': undeclared identifier [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientLocatorHelper.cpp(303): error C2226: syntax error: unexpected type 'boost::shared_mutex' [C:\build\cppcache\static\apache-geode-static.vcxproj]

I'll try a build on a Windows instance here, and let you know if I find out what's up before you. Thanks!

Regarding about the size of the PR, totally agree. Indeed usually I tend to do many more but much smaller commits. But I don't know what's your WOW here. I guess we could always split this PR right?

Regarding the Windows error, sorry, my mistake, Thing is Boost.Thread includes different headers depending on the platform. Pthread implementation implicitly includes boost/thread/lock_types.hpp, but the windows implementation does not. I've already pushed a commit solving it.

Btw. Do you have a publicly available CI cluster or somwhere were test can be executed for every supported platform?
Or are we supposed to execute them all manually?

Thanks for all :)

cppcache/src/ThinClientLocatorHelper.hpp

pdxcodemonkey · 2020-10-02T17:57:31Z

@gaussianrecurrence Yes we have a CI pipeline for all the builds/tests, we're working out how to donate and make it public. In the meantime, we run all incoming PR branches through it to avoid unpleasant surprises.

I'm now seeing a ton of test failures on Windows:

18% tests passed, 86 tests failed out of 105

Label Time Summary:
FLAKY = 0.00 secproc (42 tests)
OMITTED = 0.00 secproc (20 tests)
QUICK = 60.32 secproc (15 tests)
STABLE = 13222.11 secproc (105 tests)

Total Test time (real) = 3604.23 sec

The following tests did not run:
7 - testFwPerf (Disabled)
11 - testOverflowPutGetSqLite (Disabled)
22 - testThinClientAfterRegionLive (Disabled)
25 - testThinClientCacheables (Disabled)
31 - testThinClientCq (Disabled)
33 - testThinClientCqDurable (Disabled)
34 - testThinClientCqFailover (Disabled)
35 - testThinClientCqHAFailover (Disabled)
44 - testThinClientDurableConnect (Disabled)
47 - testThinClientDurableDisconnectNormal (Disabled)
48 - testThinClientDurableDisconnectTimeout (Disabled)
49 - testThinClientDurableFailoverClientClosedNoRedundancy (Disabled)
51 - testThinClientDurableFailoverClientNotClosedRedundancy (Disabled)
54 - testThinClientDurableKeepAliveFalseTimeout (Disabled)
55 - testThinClientDurableKeepAliveTrueNormal (Disabled)
57 - testThinClientDurableReconnect (Disabled)
58 - testThinClientFailover (Disabled)
59 - testThinClientFailover2 (Disabled)
61 - testThinClientFailoverInterest (Disabled)
62 - testThinClientFailoverInterest2 (Disabled)
64 - testThinClientFailoverRegex (Disabled)
65 - testThinClientFixedPartitionResolver (Disabled)
66 - testThinClientGatewayTest (Disabled)
68 - testThinClientHADistOps (Disabled)
69 - testThinClientHAEventIDMap (Disabled)
70 - testThinClientHAFailover (Disabled)
71 - testThinClientHAFailoverRegex (Disabled)
72 - testThinClientHAMixedRedundancy (Disabled)
74 - testThinClientHAQueryFailover (Disabled)
86 - testThinClientLRUExpiration (Disabled)
87 - testThinClientLargePutAllWithCallBackArg (Disabled)
95 - testThinClientLocatorFailover (Disabled)
96 - testThinClientMultiDS (Disabled)
100 - testThinClientPRPutAllFailover (Disabled)
101 - testThinClientPRSingleHop (Disabled)
103 - testThinClientPartitionResolver (Disabled)
104 - testThinClientPdxDeltaWithNotification (Disabled)
109 - testThinClientPdxTests (Disabled)
110 - testThinClientPoolAttrTest (Disabled)
114 - testThinClientPoolLocator (Disabled)
115 - testThinClientPoolRedundancy (Disabled)
117 - testThinClientPoolServer (Disabled)
118 - testThinClientPutAll (Disabled)
119 - testThinClientPutAllPRSingleHop (Disabled)
122 - testThinClientPutAllWithCallBackArgWithoutConcurrency (Disabled)
124 - testThinClientPutWithDelta (Disabled)
135 - testThinClientRemoteQueryTimeout (Disabled)
140 - testThinClientRemoveOps (Disabled)
141 - testThinClientSecurityAuthentication (Disabled)
142 - testThinClientSecurityAuthenticationMU (Disabled)
144 - testThinClientSecurityAuthorization (Disabled)
145 - testThinClientSecurityAuthorizationMU (Disabled)
146 - testThinClientSecurityCQAuthorizationMU (Disabled)
147 - testThinClientSecurityDurableCQAuthorizationMU (Disabled)
149 - testThinClientSecurityPostAuthorization (Disabled)
150 - testThinClientTXFailover (Disabled)
151 - testThinClientTicket303 (Disabled)
152 - testThinClientTicket304 (Disabled)
154 - testThinClientTracking (Disabled)
157 - testThinClientTransactionsXA (Disabled)
162 - testThinClientWriterException (Disabled)
163 - testTimedSemaphore (Disabled)

The following tests FAILED:
1 - testCacheless (Timeout)
12 - testPdxMetadataCheckTest (Failed)
14 - testRegionAccessThreadSafe (Failed)
18 - testSerialization (Failed)
23 - testThinClientBigValue (Failed)
24 - testThinClientCacheableStringArray (Failed)
26 - testThinClientCacheablesLimits (Failed)
27 - testThinClientCallbackArg (Failed)
28 - testThinClientClearRegion (Failed)
29 - testThinClientConflation (Failed)
30 - testThinClientContainsKeyOnServer (Failed)
32 - testThinClientCqDelta (Failed)
36 - testThinClientCqIR (Failed)
37 - testThinClientDeltaWithNotification (Failed)
38 - testThinClientDisconnectionListioner (Failed)
39 - testThinClientDistOps2 (Failed)
40 - testThinClientDistOpsDontUpdateLocatorList (Failed)
41 - testThinClientDistOpsNotSticky (Failed)
42 - testThinClientDistOpsSticky (Failed)
43 - testThinClientDistOpsUpdateLocatorList (Failed)
45 - testThinClientDurableCrashNormal (Failed)
46 - testThinClientDurableCrashTimeout (Failed)
50 - testThinClientDurableFailoverClientClosedRedundancy (Failed)
52 - testThinClientDurableInterest (Failed)
53 - testThinClientDurableKeepAliveFalseNormal (Failed)
56 - testThinClientDurableKeepAliveTrueTimeout (Failed)
60 - testThinClientFailover3 (Failed)
63 - testThinClientFailoverInterestAllWithCache (Failed)
67 - testThinClientGetInterests (Failed)
73 - testThinClientHAPeriodicAck (Failed)
75 - testThinClientHeapLRU (Failed)
76 - testThinClientIntResPolKeysInv (Failed)
77 - testThinClientInterest1 (Failed)
78 - testThinClientInterest1Cacheless (Failed)
79 - testThinClientInterest1_Bug1001 (Failed)
80 - testThinClientInterest2Pooled (Failed)
81 - testThinClientInterest3 (Failed)
82 - testThinClientInterest3Cacheless (Failed)
83 - testThinClientInterestList (Failed)
84 - testThinClientInterestList2 (Failed)
85 - testThinClientInterestNotify (Timeout)
88 - testThinClientListenerCallbackArgTest (Failed)
89 - testThinClientListenerEvents (Failed)
90 - testThinClientListenerInit (Failed)
91 - testThinClientListenerWriterWithSubscription (Failed)
92 - testThinClientListenerWriterWithoutSubscription (Failed)
94 - testThinClientLocator (Failed)
97 - testThinClientMultipleCaches (Failed)
98 - testThinClientNotification (Failed)
99 - testThinClientNotificationWithDeltaWithoutcache (Failed)
102 - testThinClientPRSingleHopServerGroup (Failed)
105 - testThinClientPdxEnum (Failed)
106 - testThinClientPdxInstance (Failed)
107 - testThinClientPdxSerializer (Failed)
108 - testThinClientPdxSerializerForJava (Failed)
111 - testThinClientPoolExecuteFunctionThrowsException (Failed)
112 - testThinClientPoolExecuteHAFunction (Failed)
113 - testThinClientPoolExecuteHAFunctionPrSHOP (Failed)
116 - testThinClientPoolRegInterest (Failed)
120 - testThinClientPutAllTimeout (Failed)
121 - testThinClientPutAllWithCallBackArgWithConcurrency (Failed)
123 - testThinClientPutGetAll (Failed)
125 - testThinClientRIwithlocalRegionDestroy (Failed)
126 - testThinClientRegex (Failed)
127 - testThinClientRegex2 (Failed)
128 - testThinClientRegex3 (Failed)
129 - testThinClientRegionQueryDifferentServerConfigs (Failed)
130 - testThinClientRegionQueryExclusiveness (Failed)
131 - testThinClientRemoteQueryFailover (Failed)
132 - testThinClientRemoteQueryFailoverPdx (Failed)
133 - testThinClientRemoteQueryRS (Failed)
134 - testThinClientRemoteQuerySS (Failed)
136 - testThinClientRemoteRegionQuery (Failed)
137 - testThinClientRemoveAll (Failed)
139 - testThinClientRemoveAllSequence (Failed)
143 - testThinClientSecurityAuthenticationSetAuthInitialize (Failed)
148 - testThinClientSecurityMultiUserTest (Failed)
153 - testThinClientTicket317 (Failed)
155 - testThinClientTransactionsWithSticky (Failed)
156 - testThinClientTransactionsWithoutSticky (Failed)
158 - testThinClientVersionedOps (Failed)
159 - testThinClientVersionedOpsPartitionPersistent (Failed)
160 - testThinClientVersionedOpsReplicate (Failed)
161 - testThinClientVersionedOpsReplicatePersistent (Failed)
165 - testXmlCacheCreationWithPools (Timeout)
167 - testXmlCacheInitialization (Timeout)
Errors while running CTest

gaussianrecurrence · 2020-10-02T18:02:09Z

Oh my... let me see if I can set up an environment in Windows and run all the tests. Sorry for the inconviniences :S

pdxcodemonkey · 2020-10-02T18:10:15Z

I'm running tests on a Windows EC2 instance to confirm it's a real thing and not just a hiccup in our infrastructure.

pdxcodemonkey · 2020-10-02T19:44:14Z

I'm hitting the same test results on a Windows VM, so something major is wrong on that platform.

gaussianrecurrence · 2020-10-03T19:53:29Z

Hi,
Ok... got it finally. After fighting 1-2 days with VS and the old testing framework I located the problem.
Problem has to do with move semantics and ConnectionWrapper. Thing is for some reason I forgot how to program and I implemented the movement of the conn_ attr just as conn_(std::move(other.conn_)) and whenever the original object conn_ pointer is not nullptr, therefore the pointer is destroyed, leaving the second object pointing to an invalid.

Thing is, it seems GCC implementation of the standard is different than MSVC's and it avoids calling to the move constructor and that's why test were passing in Linux.

Will upload a fix ASAP.
BR,
Mario.

- Fixed move semantics for ConnectionWrapper. - Changed signed types to unsigned types.

lgtm-com · 2020-10-04T15:53:14Z

This pull request introduces 2 alerts when merging 66c5679 into 2e64e74 - view on LGTM.com

new alerts:

2 for Wrong type of arguments to formatting function

- Fix LGTM issues

pdxcodemonkey · 2020-10-04T16:27:17Z

@gaussianrecurrence Still seeing a clangformat failure in ThinClientLocatorHelper.hpp, see Travis results above. Note that you need clangformat v6 or v7 to be compatible with the v6 we're using in Travis. When you have the right version installed on your dev machine, from a clean repo run cmake --build . --target all-clangformat, then check in any formatting changes the tool has made. Thanks!

- Fix formatting

gaussianrecurrence · 2020-10-04T23:23:39Z

@gaussianrecurrence Still seeing a clangformat failure in ThinClientLocatorHelper.hpp, see Travis results above. Note that you need clangformat v6 or v7 to be compatible with the v6 we're using in Travis. When you have the right version installed on your dev machine, from a clean repo run cmake --build . --target all-clangformat, then check in any formatting changes the tool has made. Thanks!

Yes, sorry I made the change quickly and forgot that :S

pdxcodemonkey · 2020-10-05T15:57:02Z

Windows build still hitting an issue:

C:\nativeclient\cppcache\src\ThinClientPoolDM.cpp(572): error C2220: warning treated as error - no 'object' file generated [C:\build\cppcache\static\apache-geode-static.vcxproj]
C:\nativeclient\cppcache\src\ThinClientPoolDM.cpp(572): warning C4267: 'argument': conversion from 'size_t' to 'int32_t', possible loss of data [C:\build\cppcache\static\apache-geode-static.vcxproj]

gaussianrecurrence · 2020-10-05T18:13:30Z

So sorry @pdxcodemonkey,
Will fix the issue and compile/run tests in all platforms, just to be sure. It might take a while.

pdxcodemonkey · 2020-10-05T19:56:45Z

If you push the build fix to your branch, my pipeline will run build/test for all platforms automatically...

- Fixed type convertion getStats().setLocators

gaussianrecurrence · 2020-10-06T13:17:39Z

If you push the build fix to your branch, my pipeline will run build/test for all platforms automatically...

Oh that's nice. On either case I completed the execution of the tests.
They are all passing except for this new integration test:
BasicIPv6Test.queryResultForRange

Thing is the problem seems ip6-locahost does not exist in my host. I modified it so it uses ::1 as binding address, just to see if that was the problem but it failed due to ACE not being able to find the host. Which leds me to think.
Shouldn't be IPV6 tests disabled if WITH_IPV6 CMake option is not enabled?
Or should I always compile with IPV6 in order to run the tests?

Summaziring here: "In theory all test should pass except for the IPV6 one"

pdxcodemonkey · 2020-10-07T19:42:22Z

Re: BasicIPV6Test, you're right this one won't pass unless your build is compiled WITH_IPV6 and your network is configured to route IPv6. We've never had a test bed that worked for this one, so in our CI we use the ctest -E switch to avoid running the test.

gaussianrecurrence · 2020-10-11T17:28:57Z

Branch rebased.
Clang-format passed
Compilation ran OK both in Linux and Windows.
All tests passed both in Linux and Windows.

Please let me know your feeling about the pending topic (the tests). If you think I might need to come up with an intermediate test, or maybe I should make a prior PR to setup all the necessary scaffolding in order to make connection tests in the future.

Thanks 🙂

codecov-io · 2020-10-11T19:11:13Z

Codecov Report

Merging #660 (ebb7cb8) into develop (0d9a99d) will increase coverage by 0.00%.
The diff coverage is 94.65%.

@@            Coverage Diff            @@
##           develop     #660    +/-   ##
=========================================
  Coverage    74.04%   74.04%            
=========================================
  Files          644      644            
  Lines        51189    51086   -103     
=========================================
- Hits         37903    37828    -75     
+ Misses       13286    13258    -28

Impacted Files	Coverage Δ
cppcache/src/ThinClientLocatorHelper.cpp	`93.87% <94.44%> (+7.76%)`	⬆️
cppcache/src/ThinClientLocatorHelper.hpp	`100.00% <100.00%> (ø)`
cppcache/src/ThinClientPoolDM.cpp	`76.32% <100.00%> (-0.06%)`	⬇️
cppcache/src/ReadWriteLock.cpp	`62.50% <0.00%> (-18.75%)`	⬇️
cppcache/src/PdxTypeRegistry.cpp	`77.94% <0.00%> (-2.21%)`	⬇️
cppcache/src/ClientMetadataService.cpp	`62.24% <0.00%> (-0.46%)`	⬇️
cppcache/src/ThinClientRedundancyManager.cpp	`76.09% <0.00%> (-0.32%)`	⬇️
cppcache/src/TcrConnection.cpp	`72.95% <0.00%> (+0.47%)`	⬆️
cppcache/src/TcrEndpoint.cpp	`55.11% <0.00%> (+0.56%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d9a99d...ebb7cb8. Read the comment docs.

pdxcodemonkey · 2020-10-13T18:06:58Z

@gaussianrecurrence Is this ready for re-review?

gaussianrecurrence · 2020-10-13T18:10:33Z

@gaussianrecurrence Is this ready for re-review?

Sure. Only thing left in here is the testing part, for which I would really appreciate any ideas.
As I told you we could even make a previous PR in which we add the necessary scaffolding to test this.
Or maybe there is another way I did not come up with, which is much simpler.

Thanks for putting some time on this :)

gaussianrecurrence · 2020-10-17T09:04:39Z

Hi
Sorry to go into the topic again, but @mreddington @pdxcodemonkey or @pivotal-jbarrett did you had the chance to look into it?

- Added new integration test to verify that threads don't inter-lock anymore. - The test works as follow: * Set 5s connect-timeout. * Deploy 1 locator. * Add 1 non-existent locator to the pool. * Send 64 concurrent requests to get the server list. * By probability if there are no interlocks ~50%+-E% of the requests should have completed in a brief period of time. * Note that E% is the error allowance percentage, which is set to 10%. * Therefore a 10 seconds timeout is set in order to reach the request completed treshold. * Test passes if the number of completed requests before the timeout was greater or equal to the thresholds, otherwise it fails.

gaussianrecurrence · 2020-11-03T22:08:26Z

Ok @pdxcodemonkey and @pivotal-jbarrett I think I came up with an integration test for this change. It works as follows:

Set 5s connect-timeout.
Deploy 1 locator.
Add 1 non-existent locator to the pool.
Send 64 concurrent requests to get the server list.
By probability if there are no interlocks ~50%+-E% of the requests should
have completed in a brief period of time.
Note that E% is the error allowance percentage, which is set to 10%.
Therefore a 10 seconds timeout is set in order to reach the request
completed treshold.
Test passes if the number of completed requests before the timeout was
greater or equal to the thresholds, otherwise it fails.

It does not cover everything, but I'd would appreciate your feedback in order to see If I am in the right direction. Thanks!

- Fix travis issues having to do with a new-line missing at the end of LOcatorRequestsTest

gaussianrecurrence · 2020-11-11T09:14:10Z

Sorry to bother you again, it has been quite a long time since this had any feedback. Is there anything else you would like me to do in order to get this ready?

Thanks!

jake-at-work

Unfortunately I don't have bandwidth to review this right now. Perhaps @pdxcodemonkey can.

Don't have time to re-review.

- ThinClientLocatorHelper uses a mutex called m_locatorLock, which is used in the entire scope of all the public member functions for this class. - Problem with that is all of those functions perform some network communication and in the event of networking issues, this will result in having all the application threads inter-locked. - Also, the only purpose of this mutex is to avoid that the locators list is read while it's being updated. - Given all of the above, the class has been refactored, so every time the locators list has to be accessed, a copy of it is created, being that copy created only while owning the mutex. - And for the case of the update, a new locators list is composed and its only at the end of the update function where the mutex is locked and the locators list swapped by the new one. - This whole change ensures that the time the mutex is in use is minimal while the functionality remains intact. - Refactored ConnectionWrapper so is move constructible. - Locators list is now initialized using iterator range constructor. - Added getConnRetries utility function. - Changed mutex for a shared mutex. - getLocators function now returns a shuffled list each time. - Cleaned-up repeated code. As a result of this createConnection has been refactored and a new function called sendRequest has been created. - Implemented proposal for locator connections retries, treating the locators list as a circular buffer. - Added new integration test to verify that threads don't inter-lock anymore. - The test works as follow: * Set 5s connect-timeout. * Deploy 1 locator. * Add 1 non-existent locator to the pool. * Send 64 concurrent requests to get the server list. * By probability if there are no interlocks ~50%+-E% of the requests should have completed in a brief period of time. * Note that E% is the error allowance percentage, which is set to 10%. * Therefore a 10 seconds timeout is set in order to reach the request completed treshold. * Test passes if the number of completed requests before the timeout was greater or equal to the thresholds, otherwise it fails. Jira: ADPPRG-34718

gaussianrecurrence force-pushed the feature/GEODE-8553 branch from 47e108e to 6453bf9 Compare September 30, 2020 03:29

gaussianrecurrence marked this pull request as ready for review September 30, 2020 07:26

mreddington requested changes Sep 30, 2020

View reviewed changes

pdxcodemonkey requested changes Sep 30, 2020

View reviewed changes