New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPIRun Hangs after Slurm 20.11 Update #8378
Comments
Please see https://www.open-mpi.org/faq/?category=slurm#slurm-20.11-mpirun. Does this help? FYI: @wickberg |
I have seen that page but nothing there helps me. It's not that my job is killed or runs slowly, mpirun just doesn't run at all. Hangs the moment I run it. I'm using pmi2 in slurm. |
I'm not sure how to proceed. Does openmpi have any logs I can examine/share other than the strace? I couldn't make much sense of the strace, it just hangs on poll(). Please let me know if I can provide additional information. |
Can you try the steps listed here: https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems (obviously the In addition to that, try running with Let's see if that shows anything illuminating. |
Using the host flag, I'm able to get an output on the node I'm on, but not on the node I'm not on. (this is an srun job allocated to nodes 19 and 20, I'm on node19).
That srun error seems promising. Slurm isn't allowing the srun command launched by mpirun to create a job step because that node is busy. Not sure why it would be flagging it as busy though. If it makes a difference, we are using a module openmpi, so I have to run Here's the output of that command:
Hangs when running the srun command. Thanks for the ongoing help! |
This is how I'm launching the srun job: |
Oh -- the results from running Is node 20 actually in your job and available? E.g., if you |
Yes:
Both nodes work for non-mpi jobs. |
This also works:
|
Ok, let's try this... in your This is what
We obviously don't want to run
|
That command just hangs:
This leads me to believe this is definitely a slurm issue, though I don't know what has changed that is causing slurm to disable job step creation. |
Try removing the |
Basically, I'm suggesting you remove one of those options at a time until we can identify the one causing the problem |
That command hangs the same. Getting rid of "nodelist" also causes a hang. When I get rid of the --ntasks parameter, it prints this:
|
Sorry to have you playing "whack-a-mole", but try leaving the |
Hmmm...you know, I wonder if you specified only 1 core/node in your allocation request? That might be the root cause here - I believe Slurm changed their cmd line so that it defaulted to 1 core/node. Might want to check that out. |
|
I am a user on the system that @hakasapl manages. I can confirm the following. If I run with one core per node, mpirun will work. E.g.
But if I request more cores per node, I get a dead hang, e.g.
The Is this related to @rhc54 comment about one core per node default?? |
Yeah - what was your |
I simply used I don't think I'm misusing |
So here's the problem. In 20.11, Slurm changed their allocation methods. I won't go into all the gory details as it gets very confusing. Suffice to say that SchedMD got yelled at a lot by people encountering all kinds of problems such as this one, and the change has been backed out. They are strongly advising everyone to skip 20.11.0 and jump to 20.11.3, which has the reversion in it. I suggest you follow their advice and do the upgrade. If you aren't ready for that, then I would strongly advise you revert and go back to an earlier-than-20.11.0 release. Otherwise, you'll spend endless time trying to wade through this disaster. |
We are currently running 20.11.2, which is the latest release on github and their website. Is there another location I'm missing that has the .3 release? If not, we will have to revert back to 20.02. |
I'm afraid I don't know - could be that .3 release is imminent? If you don't see it, then it hasn't come out yet. |
Okay, we will revert to 20.02.6 for the moment. I'll let you know if that fixes the issue on this thread. Thank you for all the support! |
Looks like 20.11.3 is on github but not officially released yet. Thanks, @rhc54! |
Downgrading to |
Background information
I'm running an HPC slurm cluster. Recently, we updated to slurm 20.11. I was aware of the openmpi changes going into it. However, an unexpected problem arose. When running mpirun on an interactive srun job for at least two nodes, it just hangs.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Source distribution, built with slurm support.
Details of the problem
I'm inclined to believe this is a networking issue. There is no firewall between the two hosts, and they are on the same subnet. I ran an strace on the mpirun command, which i'll attach. I'm not sure how to proceed. Would you recommend some other troubleshooting steps?
strace-mpirun.txt
The text was updated successfully, but these errors were encountered: