New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEODE-8553: Fix inter-locks in ThinClientLocator #660
Conversation
- ThinClientLocatorHelper uses a mutex called m_locatorLock, which is used in the entire scope of all the public member functions for this class. - Problem with that is all of those functions perform some network communication and in the event of networking issues, this will result in having all the application threads inter-locked. - Also, the only purpose of this mutex is to avoid that the locators list is read while it's being updated. - Given all of the above, the class has been refactored, so every time the locators list has to be accessed, a copy of it is created, being that copy created only while owning the mutex. - And for the case of the update, a new locators list is composed and its only at the end of the update function where the mutex is locked and the locators list swapped by the new one. - This whole change ensures that the time the mutex is in use is minimal while the functionality remains intact.
47e108e
to
6453bf9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tell you, I want this - but can we get a unit test demonstrating what you're trying to fix? I need to see a demonstration of the before and after so we can further capture edge cases in tests. We need to be cautious of unintended breaking changes in behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, and existing tests all pass. Got a few practical questions and/or small improvements to make, but otherwise looks good. I agree with Matt's opinion, though, let's figure out how we can write a test that shows this code working with the copy of the locators list and not blowing up.
Regarding general comment about the mssing test ensuring this PR is not breaking anything, let me give it a thought and will come back with a proposal :) |
- Refactored ConnectionWrapper so is move constructible. - Locators list is now initialized using iterator range constructor. - Added getConnRetries utility function. - Changed mutex for a shared mutex. - getLocators function now returns a shuffled list each time. - Cleaned-up repeated code. As a result of this createConnection has been refactored and a new function called sendRequest has been created. - Implemented proposal for locator connections retries, treating the locators list as a circular buffer. - TODO. Add tests to verify functionality is not broken.
Hi again, Regarding the test proposal, I've got an idea I would like to share with you in order to get your opinion on the matter. The purpose of this PR is avoid that whenever a thread is connecting to a locator or performing a request, other threads accesing the pool locator don't get locked. Therefore what I thought is to:
PST. While seeking a solution for the test proposal I think I've noticed something: "Is it possible that many of the factory class existing in NC, should be instead builders?" Ok, so having set all the above scaffolding, then the idea would be to create a ThinClientLocatorHelper TS which tries to perform requests by multiple threads, and having the ability to control whats returned by the connections and how much time is spent on a connection operation, you can therefore check that if one thread is waiting for an operation to complete, the other thread can inmmediatly return. An pseudo-code example would be the following: testGetAllServerWhileUpdating updateT = createUpdateThread(connector_sleep_time=10s); ASSERT_EQ(getAllServersT.wait_for(1s), std::future_state::ready); So want do you think about it? BR |
I'll think about your idea today and get back to you. My first instinct is that it's a much larger change than even the initial one, however, so it may be beyond the scope of what I'm comfortable with, absent a lot more testing etc. In the meantime, your last commit broke our Windows build, maybe due to some platform difference in boost(?). Here's the error output:
I'll try a build on a Windows instance here, and let you know if I find out what's up before you. Thanks! |
- Added missing include to fix compilation for Windows.
Regarding about the size of the PR, totally agree. Indeed usually I tend to do many more but much smaller commits. But I don't know what's your WOW here. I guess we could always split this PR right? Regarding the Windows error, sorry, my mistake, Thing is Boost.Thread includes different headers depending on the platform. Pthread implementation implicitly includes boost/thread/lock_types.hpp, but the windows implementation does not. I've already pushed a commit solving it. Btw. Do you have a publicly available CI cluster or somwhere were test can be executed for every supported platform? Thanks for all :) |
@gaussianrecurrence Yes we have a CI pipeline for all the builds/tests, we're working out how to donate and make it public. In the meantime, we run all incoming PR branches through it to avoid unpleasant surprises. I'm now seeing a ton of test failures on Windows: 18% tests passed, 86 tests failed out of 105 Label Time Summary: Total Test time (real) = 3604.23 sec The following tests did not run: The following tests FAILED: |
Oh my... let me see if I can set up an environment in Windows and run all the tests. Sorry for the inconviniences :S |
I'm running tests on a Windows EC2 instance to confirm it's a real thing and not just a hiccup in our infrastructure. |
I'm hitting the same test results on a Windows VM, so something major is wrong on that platform. |
Hi, Thing is, it seems GCC implementation of the standard is different than MSVC's and it avoids calling to the move constructor and that's why test were passing in Linux. Will upload a fix ASAP. |
- Fixed move semantics for ConnectionWrapper. - Changed signed types to unsigned types.
This pull request introduces 2 alerts when merging 66c5679 into 2e64e74 - view on LGTM.com new alerts:
|
- Fix LGTM issues
@gaussianrecurrence Still seeing a clangformat failure in ThinClientLocatorHelper.hpp, see Travis results above. Note that you need clangformat v6 or v7 to be compatible with the v6 we're using in Travis. When you have the right version installed on your dev machine, from a clean repo run |
- Fix formatting
Yes, sorry I made the change quickly and forgot that :S |
Windows build still hitting an issue:
|
So sorry @pdxcodemonkey, |
If you push the build fix to your branch, my pipeline will run build/test for all platforms automatically... |
- Fixed type convertion getStats().setLocators
Oh that's nice. On either case I completed the execution of the tests. Thing is the problem seems ip6-locahost does not exist in my host. I modified it so it uses ::1 as binding address, just to see if that was the problem but it failed due to ACE not being able to find the host. Which leds me to think. Summaziring here: "In theory all test should pass except for the IPV6 one" |
Re: BasicIPV6Test, you're right this one won't pass unless your build is compiled WITH_IPV6 and your network is configured to route IPv6. We've never had a test bed that worked for this one, so in our CI we use the |
Please let me know your feeling about the pending topic (the tests). If you think I might need to come up with an intermediate test, or maybe I should make a prior PR to setup all the necessary scaffolding in order to make connection tests in the future. Thanks 🙂 |
Codecov Report
@@ Coverage Diff @@
## develop #660 +/- ##
=========================================
Coverage 74.04% 74.04%
=========================================
Files 644 644
Lines 51189 51086 -103
=========================================
- Hits 37903 37828 -75
+ Misses 13286 13258 -28
Continue to review full report at Codecov.
|
@gaussianrecurrence Is this ready for re-review? |
Sure. Only thing left in here is the testing part, for which I would really appreciate any ideas. Thanks for putting some time on this :) |
Hi |
- Added new integration test to verify that threads don't inter-lock anymore. - The test works as follow: * Set 5s connect-timeout. * Deploy 1 locator. * Add 1 non-existent locator to the pool. * Send 64 concurrent requests to get the server list. * By probability if there are no interlocks ~50%+-E% of the requests should have completed in a brief period of time. * Note that E% is the error allowance percentage, which is set to 10%. * Therefore a 10 seconds timeout is set in order to reach the request completed treshold. * Test passes if the number of completed requests before the timeout was greater or equal to the thresholds, otherwise it fails.
Ok @pdxcodemonkey and @pivotal-jbarrett I think I came up with an integration test for this change. It works as follows:
It does not cover everything, but I'd would appreciate your feedback in order to see If I am in the right direction. Thanks! |
- Fix travis issues having to do with a new-line missing at the end of LOcatorRequestsTest
Sorry to bother you again, it has been quite a long time since this had any feedback. Is there anything else you would like me to do in order to get this ready? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't have bandwidth to review this right now. Perhaps @pdxcodemonkey can.
- ThinClientLocatorHelper uses a mutex called m_locatorLock, which is used in the entire scope of all the public member functions for this class. - Problem with that is all of those functions perform some network communication and in the event of networking issues, this will result in having all the application threads inter-locked. - Also, the only purpose of this mutex is to avoid that the locators list is read while it's being updated. - Given all of the above, the class has been refactored, so every time the locators list has to be accessed, a copy of it is created, being that copy created only while owning the mutex. - And for the case of the update, a new locators list is composed and its only at the end of the update function where the mutex is locked and the locators list swapped by the new one. - This whole change ensures that the time the mutex is in use is minimal while the functionality remains intact. - Refactored ConnectionWrapper so is move constructible. - Locators list is now initialized using iterator range constructor. - Added getConnRetries utility function. - Changed mutex for a shared mutex. - getLocators function now returns a shuffled list each time. - Cleaned-up repeated code. As a result of this createConnection has been refactored and a new function called sendRequest has been created. - Implemented proposal for locator connections retries, treating the locators list as a circular buffer. - Added new integration test to verify that threads don't inter-lock anymore. - The test works as follow: * Set 5s connect-timeout. * Deploy 1 locator. * Add 1 non-existent locator to the pool. * Send 64 concurrent requests to get the server list. * By probability if there are no interlocks ~50%+-E% of the requests should have completed in a brief period of time. * Note that E% is the error allowance percentage, which is set to 10%. * Therefore a 10 seconds timeout is set in order to reach the request completed treshold. * Test passes if the number of completed requests before the timeout was greater or equal to the thresholds, otherwise it fails. Jira: ADPPRG-34718
- ThinClientLocatorHelper uses a mutex called m_locatorLock, which is used in the entire scope of all the public member functions for this class. - Problem with that is all of those functions perform some network communication and in the event of networking issues, this will result in having all the application threads inter-locked. - Also, the only purpose of this mutex is to avoid that the locators list is read while it's being updated. - Given all of the above, the class has been refactored, so every time the locators list has to be accessed, a copy of it is created, being that copy created only while owning the mutex. - And for the case of the update, a new locators list is composed and its only at the end of the update function where the mutex is locked and the locators list swapped by the new one. - This whole change ensures that the time the mutex is in use is minimal while the functionality remains intact. - Refactored ConnectionWrapper so is move constructible. - Locators list is now initialized using iterator range constructor. - Added getConnRetries utility function. - Changed mutex for a shared mutex. - getLocators function now returns a shuffled list each time. - Cleaned-up repeated code. As a result of this createConnection has been refactored and a new function called sendRequest has been created. - Implemented proposal for locator connections retries, treating the locators list as a circular buffer. - Added new integration test to verify that threads don't inter-lock anymore. - The test works as follow: * Set 5s connect-timeout. * Deploy 1 locator. * Add 1 non-existent locator to the pool. * Send 64 concurrent requests to get the server list. * By probability if there are no interlocks ~50%+-E% of the requests should have completed in a brief period of time. * Note that E% is the error allowance percentage, which is set to 10%. * Therefore a 10 seconds timeout is set in order to reach the request completed treshold. * Test passes if the number of completed requests before the timeout was greater or equal to the thresholds, otherwise it fails. Jira: ADPPRG-34718
used in the entire scope of all the public member functions for this
class.
communication and in the event of networking issues, this will result
in having all the application threads inter-locked.
list is read while it's being updated.
the locators list has to be accessed, a copy of it is created, being
that copy created only while owning the mutex.
its only at the end of the update function where the mutex is locked
and the locators list swapped by the new one.
minimal while the functionality remains intact.