Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEODE-8698: Remove TcrPoolEndpoint::registerDM lock #699

Merged
merged 2 commits into from Jun 7, 2021

Conversation

gaussianrecurrence
Copy link
Contributor

  • Removed pool lock when registering a new DM.
  • This solves the thread inter-lock whenever putting back connections
    while recovering subscription on a failed endpoint.

Context: During our tests, whenever a server goes down, all the threads in the geode-native client were getting stuck while trying to recover subscriptions on the endpoint which is down. This was due to registerDM method holding the pool lock.
As this pool was only used to guarantee the max-connections hard-limit and it seems that's not a hard-limit anymore as @albertogpz demonstrated with PR #676, then I'd say it's safe to remove this lock, avoiding therefore any dead-locks.

@gaussianrecurrence gaussianrecurrence changed the title Remove TcrPoolEndpoint::registerDM lock GEODE-8698: Remove TcrPoolEndpoint::registerDM lock Nov 19, 2020
 - Removed pool lock when registering a new DM.
 - This solves the thread inter-lock whenever putting back connections.
@gaussianrecurrence
Copy link
Contributor Author

Just to clarify. This is part of the effort to remove all not necessary dead-locks having to do with connection establishment. I am not sure if there is a clear way on how to test this. I've done some manual stress testing and haven't seen any negative effect on this change, rather than the opposite.

Maybe @mreddington @echobravopapa @moleske @pivotal-jbarrett @pdxcodemonkey @mivanac @mkevo could take a look at this? Thanks!

@jake-at-work
Copy link
Contributor

Do you know why there would have been a lock here in the first place. Without rationalizing the locking, and yes its a mess, I fear removing locks. Perhaps we should use std::lock to avoid deadlocks on these locks. This would allow us to keep the locks in place until we have a clearer picture about the locking relationships between these two. If you have already done that analysis can you share it here?

Copy link

@echobravopapa echobravopapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @pivotal-jbarrett comments, it would be great to add testing around a change like this...

@pdxcodemonkey
Copy link
Contributor

@gaussianrecurrence What's the status on this one? It's been pending with changes requested for a little over a month...

@gaussianrecurrence
Copy link
Contributor Author

@gaussianrecurrence What's the status on this one? It's been pending with changes requested for a little over a month...

Sorry @pdxcodemonkey. Thing is testing this change is quite complex, and I am working in prioritized issues, so I will try to work on this again ASAP.

@echobravopapa
Copy link

@gaussianrecurrence you might consider closing this PR until it becomes a priority to finish, thanks for keeping geode-native tidy.

@gaussianrecurrence
Copy link
Contributor Author

@gaussianrecurrence you might consider closing this PR until it becomes a priority to finish, thanks for keeping geode-native tidy.

I am currently working on looking to write a suitable test. This might be not easy, but let's see if I can find a consistent way of reproducing the inter-lock.

Copy link
Contributor

@jake-at-work jake-at-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a bit more time reviewing this. Based on the issue and the code change I am comfortable with it.

While a test would be nice, I really don't know how we would produce one that is deterministic without some serious refactoring.

@jake-at-work
Copy link
Contributor

@echobravopapa I am not sure a deterministic test is reasonably achievable for this change. I think it would require a significant refactor to add hooks and metrics around this block of code just to maybe detect the other threads didn't pause for some meaningful time.

@gaussianrecurrence
Copy link
Contributor Author

@echobravopapa I am not sure a deterministic test is reasonably achievable for this change. I think it would require a significant refactor to add hooks and metrics around this block of code just to maybe detect the other threads didn't pause for some meaningful time.

Actually, I managed to write a test on linux by pausing the maintain redundancy threads. However thread naming is not supported until recent versions of Windows. So this test won't be available in windows. I agree with you @pivotal-jbarrett, I am not sure is worth it to put that much effort into making this test.

@gaussianrecurrence
Copy link
Contributor Author

Any thoughts on this @echobravopapa? If you really think there is a way to test this, I'll gladly do it :)

@pdxcodemonkey pdxcodemonkey merged commit ab6fe7d into apache:develop Jun 7, 2021
@gaussianrecurrence gaussianrecurrence deleted the feature/GEODE-8698 branch June 7, 2021 16:23
davebarnes97 pushed a commit to davebarnes97/geode-native that referenced this pull request Feb 9, 2022
- Removed pool lock when registering a new DM.
 - This solves the thread inter-lock whenever putting back connections.
davebarnes97 pushed a commit to davebarnes97/geode-native that referenced this pull request Feb 9, 2022
- Removed pool lock when registering a new DM.
 - This solves the thread inter-lock whenever putting back connections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants