New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEODE-8698: Remove TcrPoolEndpoint::registerDM lock #699
Conversation
34ae3c4
to
90d23a0
Compare
- Removed pool lock when registering a new DM. - This solves the thread inter-lock whenever putting back connections.
90d23a0
to
7f42c52
Compare
Just to clarify. This is part of the effort to remove all not necessary dead-locks having to do with connection establishment. I am not sure if there is a clear way on how to test this. I've done some manual stress testing and haven't seen any negative effect on this change, rather than the opposite. Maybe @mreddington @echobravopapa @moleske @pivotal-jbarrett @pdxcodemonkey @mivanac @mkevo could take a look at this? Thanks! |
Do you know why there would have been a lock here in the first place. Without rationalizing the locking, and yes its a mess, I fear removing locks. Perhaps we should use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to @pivotal-jbarrett comments, it would be great to add testing around a change like this...
@gaussianrecurrence What's the status on this one? It's been pending with changes requested for a little over a month... |
Sorry @pdxcodemonkey. Thing is testing this change is quite complex, and I am working in prioritized issues, so I will try to work on this again ASAP. |
@gaussianrecurrence you might consider closing this PR until it becomes a priority to finish, thanks for keeping geode-native tidy. |
I am currently working on looking to write a suitable test. This might be not easy, but let's see if I can find a consistent way of reproducing the inter-lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a bit more time reviewing this. Based on the issue and the code change I am comfortable with it.
While a test would be nice, I really don't know how we would produce one that is deterministic without some serious refactoring.
@echobravopapa I am not sure a deterministic test is reasonably achievable for this change. I think it would require a significant refactor to add hooks and metrics around this block of code just to maybe detect the other threads didn't pause for some meaningful time. |
Actually, I managed to write a test on linux by pausing the maintain redundancy threads. However thread naming is not supported until recent versions of Windows. So this test won't be available in windows. I agree with you @pivotal-jbarrett, I am not sure is worth it to put that much effort into making this test. |
Any thoughts on this @echobravopapa? If you really think there is a way to test this, I'll gladly do it :) |
- Removed pool lock when registering a new DM. - This solves the thread inter-lock whenever putting back connections.
- Removed pool lock when registering a new DM. - This solves the thread inter-lock whenever putting back connections.
while recovering subscription on a failed endpoint.
Context: During our tests, whenever a server goes down, all the threads in the geode-native client were getting stuck while trying to recover subscriptions on the endpoint which is down. This was due to registerDM method holding the pool lock.
As this pool was only used to guarantee the max-connections hard-limit and it seems that's not a hard-limit anymore as @albertogpz demonstrated with PR #676, then I'd say it's safe to remove this lock, avoiding therefore any dead-locks.