Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIRun Hangs after Slurm 20.11 Update #8378

Closed
hakasapl opened this issue Jan 14, 2021 · 26 comments
Closed

MPIRun Hangs after Slurm 20.11 Update #8378

hakasapl opened this issue Jan 14, 2021 · 26 comments
Labels

Comments

@hakasapl
Copy link

Background information

I'm running an HPC slurm cluster. Recently, we updated to slurm 20.11. I was aware of the openmpi changes going into it. However, an unexpected problem arose. When running mpirun on an interactive srun job for at least two nodes, it just hangs.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Source distribution, built with slurm support.

  • Operating system/version: ubuntu 20.04
  • Network type: 10gb fiber

Details of the problem

I'm inclined to believe this is a networking issue. There is no firewall between the two hosts, and they are on the same subnet. I ran an strace on the mpirun command, which i'll attach. I'm not sure how to proceed. Would you recommend some other troubleshooting steps?

strace-mpirun.txt

@jsquyres
Copy link
Member

Please see https://www.open-mpi.org/faq/?category=slurm#slurm-20.11-mpirun. Does this help?

FYI: @wickberg

@hakasapl
Copy link
Author

I have seen that page but nothing there helps me. It's not that my job is killed or runs slowly, mpirun just doesn't run at all. Hangs the moment I run it. I'm using pmi2 in slurm.

@hakasapl
Copy link
Author

I'm not sure how to proceed. Does openmpi have any logs I can examine/share other than the strace? I couldn't make much sense of the strace, it just hangs on poll().

Please let me know if I can provide additional information.

@jsquyres
Copy link
Member

Can you try the steps listed here: https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems

(obviously the ssh-related stuff in there is not relevant, but the idea of testing with a non-MPI program, etc. is relevant)

In addition to that, try running with mpirun --mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 ...? That should emit a LOT of output.

Let's see if that shows anything illuminating.

@hakasapl
Copy link
Author

Using the host flag, I'm able to get an output on the node I'm on, but not on the node I'm not on. (this is an srun job allocated to nodes 19 and 20, I'm on node19).

hsaplakoglu_umass_edu@node19:~$ mpirun --host node19 hostname
node19
node19
node19
node19
hsaplakoglu_umass_edu@node19:~$ mpirun --host node20 hostname
srun: Job 1333683 step creation temporarily disabled, retrying (Requested nodes are busy)

That srun error seems promising. Slurm isn't allowing the srun command launched by mpirun to create a job step because that node is busy. Not sure why it would be flagging it as busy though.

If it makes a difference, we are using a module openmpi, so I have to run module load openmpi to run the mpirun command. Although, even when using the system-installed openmpi, the same issue persists.

Here's the output of that command:

hsaplakoglu_umass_edu@node19:~$ mpirun --mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 uptime
[node19:2167501] mca: base: components_register: registering framework plm components
[node19:2167501] mca: base: components_register: found loaded component isolated
[node19:2167501] mca: base: components_register: component isolated has no register or open function
[node19:2167501] mca: base: components_register: found loaded component rsh
[node19:2167501] mca: base: components_register: component rsh register function successful
[node19:2167501] mca: base: components_register: found loaded component slurm
[node19:2167501] mca: base: components_register: component slurm register function successful
[node19:2167501] mca: base: components_open: opening plm components
[node19:2167501] mca: base: components_open: found loaded component isolated
[node19:2167501] mca: base: components_open: component isolated open function successful
[node19:2167501] mca: base: components_open: found loaded component rsh
[node19:2167501] mca: base: components_open: component rsh open function successful
[node19:2167501] mca: base: components_open: found loaded component slurm
[node19:2167501] mca: base: components_open: component slurm open function successful
[node19:2167501] mca:base:select: Auto-selecting plm components
[node19:2167501] mca:base:select:(  plm) Querying component [isolated]
[node19:2167501] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[node19:2167501] mca:base:select:(  plm) Querying component [rsh]
[node19:2167501] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[node19:2167501] mca:base:select:(  plm) Querying component [slurm]
[node19:2167501] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[node19:2167501] mca:base:select:(  plm) Selected component [slurm]
[node19:2167501] mca: base: close: component isolated closed
[node19:2167501] mca: base: close: unloading component isolated
[node19:2167501] mca: base: close: component rsh closed
[node19:2167501] mca: base: close: unloading component rsh
[node19:2167501] mca: base: components_register: registering framework ras components
[node19:2167501] mca: base: components_register: found loaded component simulator
[node19:2167501] mca: base: components_register: component simulator register function successful
[node19:2167501] mca: base: components_register: found loaded component slurm
[node19:2167501] mca: base: components_register: component slurm register function successful
[node19:2167501] mca: base: components_open: opening ras components
[node19:2167501] mca: base: components_open: found loaded component simulator
[node19:2167501] mca: base: components_open: found loaded component slurm
[node19:2167501] mca: base: components_open: component slurm open function successful
[node19:2167501] mca:base:select: Auto-selecting ras components
[node19:2167501] mca:base:select:(  ras) Querying component [simulator]
[node19:2167501] mca:base:select:(  ras) Querying component [slurm]
[node19:2167501] mca:base:select:(  ras) Query of component [slurm] set priority to 50
[node19:2167501] mca:base:select:(  ras) Selected component [slurm]
[node19:2167501] mca: base: close: unloading component simulator
[node19:2167501] mca: base: components_register: registering framework rmaps components
[node19:2167501] mca: base: components_register: found loaded component mindist
[node19:2167501] mca: base: components_register: component mindist register function successful
[node19:2167501] mca: base: components_register: found loaded component ppr
[node19:2167501] mca: base: components_register: component ppr register function successful
[node19:2167501] mca: base: components_register: found loaded component rank_file
[node19:2167501] mca: base: components_register: component rank_file register function successful
[node19:2167501] mca: base: components_register: found loaded component resilient
[node19:2167501] mca: base: components_register: component resilient register function successful
[node19:2167501] mca: base: components_register: found loaded component round_robin
[node19:2167501] mca: base: components_register: component round_robin register function successful
[node19:2167501] mca: base: components_register: found loaded component seq
[node19:2167501] mca: base: components_register: component seq register function successful
[node19:2167501] [[8505,0],0] rmaps:base set policy with NULL device NONNULL
[node19:2167501] mca: base: components_open: opening rmaps components
[node19:2167501] mca: base: components_open: found loaded component mindist
[node19:2167501] mca: base: components_open: component mindist open function successful
[node19:2167501] mca: base: components_open: found loaded component ppr
[node19:2167501] mca: base: components_open: component ppr open function successful
[node19:2167501] mca: base: components_open: found loaded component rank_file
[node19:2167501] mca: base: components_open: component rank_file open function successful
[node19:2167501] mca: base: components_open: found loaded component resilient
[node19:2167501] mca: base: components_open: component resilient open function successful
[node19:2167501] mca: base: components_open: found loaded component round_robin
[node19:2167501] mca: base: components_open: component round_robin open function successful
[node19:2167501] mca: base: components_open: found loaded component seq
[node19:2167501] mca: base: components_open: component seq open function successful
[node19:2167501] mca:rmaps:select: checking available component mindist
[node19:2167501] mca:rmaps:select: Querying component [mindist]
[node19:2167501] mca:rmaps:select: checking available component ppr
[node19:2167501] mca:rmaps:select: Querying component [ppr]
[node19:2167501] mca:rmaps:select: checking available component rank_file
[node19:2167501] mca:rmaps:select: Querying component [rank_file]
[node19:2167501] mca:rmaps:select: checking available component resilient
[node19:2167501] mca:rmaps:select: Querying component [resilient]
[node19:2167501] mca:rmaps:select: checking available component round_robin
[node19:2167501] mca:rmaps:select: Querying component [round_robin]
[node19:2167501] mca:rmaps:select: checking available component seq
[node19:2167501] mca:rmaps:select: Querying component [seq]
[node19:2167501] [[8505,0],0]: Final mapper priorities
[node19:2167501]        Mapper: ppr Priority: 90
[node19:2167501]        Mapper: seq Priority: 60
[node19:2167501]        Mapper: resilient Priority: 40
[node19:2167501]        Mapper: mindist Priority: 20
[node19:2167501]        Mapper: round_robin Priority: 10
[node19:2167501]        Mapper: rank_file Priority: 0

======================   ALLOCATED NODES   ======================
        node19: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
        node20: flags=0x10 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[node19:2167501] [[8505,0],0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "557383680" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:19-20]@0(2)" -mca orte_hnp_uri "557383680.0;tcp://10.100.10.19:56759" --mca plm_base_verbose "100" --mca ras_base_verbose "100" --mca rss_base_verbose "100" --mca rmaps_base_verbose "100"

Hangs when running the srun command.

Thanks for the ongoing help!

@hakasapl
Copy link
Author

This is how I'm launching the srun job: srun -p cpu --nodelist=node[19,20] -N 2 -n 8 --pty bash

@jsquyres
Copy link
Member

Oh -- the results from running hostname may well be telling.

Is node 20 actually in your job and available? E.g., if you srun hostname, does it actually run on both nodes 19 and 20? If it hangs while trying to run on 20, that would likely be effectively the same thing that's happening to Open MPI (because mpirun uses srun under the covers to launch on other nodes in your SLURM job).

@hakasapl
Copy link
Author

Yes:

hsaplakoglu_umass_edu@login:~$ srun --nodelist=node19 hostname
node19
hsaplakoglu_umass_edu@login:~$ srun --nodelist=node20 hostname
node20

Both nodes work for non-mpi jobs.

@hakasapl
Copy link
Author

This also works:

hsaplakoglu_umass_edu@login:~$ srun --nodelist=node[19-20] hostname
node20
node19

@jsquyres
Copy link
Member

Ok, let's try this... in your srun -p cpu --nodelist=node[19,20] -N 2 -n 8 --pty bash job, try to run (effectively) the same command that mpirun is invoking to launch on node 20.

This is what mpirun tried to launch:

srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "557383680" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:19-20]@0(2)" -mca orte_hnp_uri "557383680.0;tcp://10.100.10.19:56759" --mca plm_base_verbose "100" --mca ras_base_verbose "100" --mca rss_base_verbose "100" --mca rmaps_base_verbose "100"

We obviously don't want to run orted here, so just try running:

srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 hostname

@hakasapl
Copy link
Author

That command just hangs:

hsaplakoglu_umass_edu@node19:~$ srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 hostname
srun: Job 1333693 step creation temporarily disabled, retrying (Requested nodes are busy)

This leads me to believe this is definitely a slurm issue, though I don't know what has changed that is causing slurm to disable job step creation.

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

Try removing the --ntasks-per-node option and see if that helps

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

Basically, I'm suggesting you remove one of those options at a time until we can identify the one causing the problem

@hakasapl
Copy link
Author

hsaplakoglu_umass_edu@node19:~$ srun --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 hostname

That command hangs the same. Getting rid of "nodelist" also causes a hang.

When I get rid of the --ntasks parameter, it prints this:

hsaplakoglu_umass_edu@node19:~$ srun --kill-on-bad-exit --nodes=1 --nodelist=node20 hostname
srun: error: Unable to create step for job 1333693: More processors requested than permitted

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

Sorry to have you playing "whack-a-mole", but try leaving the --ntasks-per-node and removing --ntasks

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

hsaplakoglu_umass_edu@node19:~$ srun --kill-on-bad-exit --nodes=1 --nodelist=node20 hostname
srun: error: Unable to create step for job 1333693: More processors requested than permitted

Hmmm...you know, I wonder if you specified only 1 core/node in your allocation request? That might be the root cause here - I believe Slurm changed their cmd line so that it defaulted to 1 core/node. Might want to check that out.

@hakasapl
Copy link
Author

hsaplakoglu_umass_edu@node19:~$ srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 hostname
srun: Warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 8 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 1333693: More processors requested than permitted

@The9Cat
Copy link

The9Cat commented Jan 15, 2021

I am a user on the system that @hakasapl manages. I can confirm the following. If I run with one core per node, mpirun will work. E.g.

$srun --ntasks-per-node=1 --nodes=8 --ntasks=8 --pty $SHELL
$mpirun --mca btl tcp,self hostname
node3
node15
node16
node22
node17
node21
node23
node18

But if I request more cores per node, I get a dead hang, e.g.

$srun --ntasks-per-node=2 --nodes=4 --ntasks=8 --pty $SHELL
$mpirun --mca btl tcp,self hostnam

The mpirun above appears to hang. I also checked that I can execute a complex mpi application using the first srun allocation, not just hostname. But not for the second srun.

Is this related to @rhc54 comment about one core per node default??

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

Yeah - what was your salloc cmd to get that allocation? I think that is where the problem begins.

@The9Cat
Copy link

The9Cat commented Jan 15, 2021

I simply used srun to make the allocation. If I use salloc to get the same allocation followed by srun --jobid xxxx, the behavior is the same. I'm probably missing your point here, sorry.

I don't think I'm misusing srun here. To make sure, I repeated the same command in the post above on a different cluster, and they work as expected. So that suggests to me that slurm and openmpi are interacting in a strange way. Any thoughts on that?

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

So here's the problem. In 20.11, Slurm changed their allocation methods. I won't go into all the gory details as it gets very confusing. Suffice to say that SchedMD got yelled at a lot by people encountering all kinds of problems such as this one, and the change has been backed out. They are strongly advising everyone to skip 20.11.0 and jump to 20.11.3, which has the reversion in it.

I suggest you follow their advice and do the upgrade. If you aren't ready for that, then I would strongly advise you revert and go back to an earlier-than-20.11.0 release. Otherwise, you'll spend endless time trying to wade through this disaster.

@hakasapl
Copy link
Author

We are currently running 20.11.2, which is the latest release on github and their website. Is there another location I'm missing that has the .3 release?

If not, we will have to revert back to 20.02.

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2021

I'm afraid I don't know - could be that .3 release is imminent? If you don't see it, then it hasn't come out yet.

@hakasapl
Copy link
Author

Okay, we will revert to 20.02.6 for the moment. I'll let you know if that fixes the issue on this thread. Thank you for all the support!

@The9Cat
Copy link

The9Cat commented Jan 15, 2021

Looks like 20.11.3 is on github but not officially released yet.

Thanks, @rhc54!

@hakasapl
Copy link
Author

Downgrading to 20.02.6 did the trick. Thanks for all the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants