MPIRun Hangs after Slurm 20.11 Update #8378

hakasapl · 2021-01-14T05:43:24Z

Background information

I'm running an HPC slurm cluster. Recently, we updated to slurm 20.11. I was aware of the openmpi changes going into it. However, an unexpected problem arose. When running mpirun on an interactive srun job for at least two nodes, it just hangs.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Source distribution, built with slurm support.

Operating system/version: ubuntu 20.04
Network type: 10gb fiber

Details of the problem

I'm inclined to believe this is a networking issue. There is no firewall between the two hosts, and they are on the same subnet. I ran an strace on the mpirun command, which i'll attach. I'm not sure how to proceed. Would you recommend some other troubleshooting steps?

strace-mpirun.txt

jsquyres · 2021-01-14T15:51:03Z

Please see https://www.open-mpi.org/faq/?category=slurm#slurm-20.11-mpirun. Does this help?

FYI: @wickberg

hakasapl · 2021-01-14T15:53:37Z

I have seen that page but nothing there helps me. It's not that my job is killed or runs slowly, mpirun just doesn't run at all. Hangs the moment I run it. I'm using pmi2 in slurm.

hakasapl · 2021-01-14T23:55:44Z

I'm not sure how to proceed. Does openmpi have any logs I can examine/share other than the strace? I couldn't make much sense of the strace, it just hangs on poll().

Please let me know if I can provide additional information.

jsquyres · 2021-01-15T18:36:46Z

Can you try the steps listed here: https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems

(obviously the ssh-related stuff in there is not relevant, but the idea of testing with a non-MPI program, etc. is relevant)

In addition to that, try running with mpirun --mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 ...? That should emit a LOT of output.

Let's see if that shows anything illuminating.

hakasapl · 2021-01-15T19:16:34Z

Using the host flag, I'm able to get an output on the node I'm on, but not on the node I'm not on. (this is an srun job allocated to nodes 19 and 20, I'm on node19).

hsaplakoglu_umass_edu@node19:~$ mpirun --host node19 hostname
node19
node19
node19
node19
hsaplakoglu_umass_edu@node19:~$ mpirun --host node20 hostname
srun: Job 1333683 step creation temporarily disabled, retrying (Requested nodes are busy)

That srun error seems promising. Slurm isn't allowing the srun command launched by mpirun to create a job step because that node is busy. Not sure why it would be flagging it as busy though.

If it makes a difference, we are using a module openmpi, so I have to run module load openmpi to run the mpirun command. Although, even when using the system-installed openmpi, the same issue persists.

Here's the output of that command:

hsaplakoglu_umass_edu@node19:~$ mpirun --mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca rmaps_base_verbose 100 uptime
[node19:2167501] mca: base: components_register: registering framework plm components
[node19:2167501] mca: base: components_register: found loaded component isolated
[node19:2167501] mca: base: components_register: component isolated has no register or open function
[node19:2167501] mca: base: components_register: found loaded component rsh
[node19:2167501] mca: base: components_register: component rsh register function successful
[node19:2167501] mca: base: components_register: found loaded component slurm
[node19:2167501] mca: base: components_register: component slurm register function successful
[node19:2167501] mca: base: components_open: opening plm components
[node19:2167501] mca: base: components_open: found loaded component isolated
[node19:2167501] mca: base: components_open: component isolated open function successful
[node19:2167501] mca: base: components_open: found loaded component rsh
[node19:2167501] mca: base: components_open: component rsh open function successful
[node19:2167501] mca: base: components_open: found loaded component slurm
[node19:2167501] mca: base: components_open: component slurm open function successful
[node19:2167501] mca:base:select: Auto-selecting plm components
[node19:2167501] mca:base:select:(  plm) Querying component [isolated]
[node19:2167501] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[node19:2167501] mca:base:select:(  plm) Querying component [rsh]
[node19:2167501] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[node19:2167501] mca:base:select:(  plm) Querying component [slurm]
[node19:2167501] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[node19:2167501] mca:base:select:(  plm) Selected component [slurm]
[node19:2167501] mca: base: close: component isolated closed
[node19:2167501] mca: base: close: unloading component isolated
[node19:2167501] mca: base: close: component rsh closed
[node19:2167501] mca: base: close: unloading component rsh
[node19:2167501] mca: base: components_register: registering framework ras components
[node19:2167501] mca: base: components_register: found loaded component simulator
[node19:2167501] mca: base: components_register: component simulator register function successful
[node19:2167501] mca: base: components_register: found loaded component slurm
[node19:2167501] mca: base: components_register: component slurm register function successful
[node19:2167501] mca: base: components_open: opening ras components
[node19:2167501] mca: base: components_open: found loaded component simulator
[node19:2167501] mca: base: components_open: found loaded component slurm
[node19:2167501] mca: base: components_open: component slurm open function successful
[node19:2167501] mca:base:select: Auto-selecting ras components
[node19:2167501] mca:base:select:(  ras) Querying component [simulator]
[node19:2167501] mca:base:select:(  ras) Querying component [slurm]
[node19:2167501] mca:base:select:(  ras) Query of component [slurm] set priority to 50
[node19:2167501] mca:base:select:(  ras) Selected component [slurm]
[node19:2167501] mca: base: close: unloading component simulator
[node19:2167501] mca: base: components_register: registering framework rmaps components
[node19:2167501] mca: base: components_register: found loaded component mindist
[node19:2167501] mca: base: components_register: component mindist register function successful
[node19:2167501] mca: base: components_register: found loaded component ppr
[node19:2167501] mca: base: components_register: component ppr register function successful
[node19:2167501] mca: base: components_register: found loaded component rank_file
[node19:2167501] mca: base: components_register: component rank_file register function successful
[node19:2167501] mca: base: components_register: found loaded component resilient
[node19:2167501] mca: base: components_register: component resilient register function successful
[node19:2167501] mca: base: components_register: found loaded component round_robin
[node19:2167501] mca: base: components_register: component round_robin register function successful
[node19:2167501] mca: base: components_register: found loaded component seq
[node19:2167501] mca: base: components_register: component seq register function successful
[node19:2167501] [[8505,0],0] rmaps:base set policy with NULL device NONNULL
[node19:2167501] mca: base: components_open: opening rmaps components
[node19:2167501] mca: base: components_open: found loaded component mindist
[node19:2167501] mca: base: components_open: component mindist open function successful
[node19:2167501] mca: base: components_open: found loaded component ppr
[node19:2167501] mca: base: components_open: component ppr open function successful
[node19:2167501] mca: base: components_open: found loaded component rank_file
[node19:2167501] mca: base: components_open: component rank_file open function successful
[node19:2167501] mca: base: components_open: found loaded component resilient
[node19:2167501] mca: base: components_open: component resilient open function successful
[node19:2167501] mca: base: components_open: found loaded component round_robin
[node19:2167501] mca: base: components_open: component round_robin open function successful
[node19:2167501] mca: base: components_open: found loaded component seq
[node19:2167501] mca: base: components_open: component seq open function successful
[node19:2167501] mca:rmaps:select: checking available component mindist
[node19:2167501] mca:rmaps:select: Querying component [mindist]
[node19:2167501] mca:rmaps:select: checking available component ppr
[node19:2167501] mca:rmaps:select: Querying component [ppr]
[node19:2167501] mca:rmaps:select: checking available component rank_file
[node19:2167501] mca:rmaps:select: Querying component [rank_file]
[node19:2167501] mca:rmaps:select: checking available component resilient
[node19:2167501] mca:rmaps:select: Querying component [resilient]
[node19:2167501] mca:rmaps:select: checking available component round_robin
[node19:2167501] mca:rmaps:select: Querying component [round_robin]
[node19:2167501] mca:rmaps:select: checking available component seq
[node19:2167501] mca:rmaps:select: Querying component [seq]
[node19:2167501] [[8505,0],0]: Final mapper priorities
[node19:2167501]        Mapper: ppr Priority: 90
[node19:2167501]        Mapper: seq Priority: 60
[node19:2167501]        Mapper: resilient Priority: 40
[node19:2167501]        Mapper: mindist Priority: 20
[node19:2167501]        Mapper: round_robin Priority: 10
[node19:2167501]        Mapper: rank_file Priority: 0

======================   ALLOCATED NODES   ======================
        node19: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
        node20: flags=0x10 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[node19:2167501] [[8505,0],0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "557383680" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:19-20]@0(2)" -mca orte_hnp_uri "557383680.0;tcp://10.100.10.19:56759" --mca plm_base_verbose "100" --mca ras_base_verbose "100" --mca rss_base_verbose "100" --mca rmaps_base_verbose "100"

Hangs when running the srun command.

Thanks for the ongoing help!

hakasapl · 2021-01-15T19:17:39Z

This is how I'm launching the srun job: srun -p cpu --nodelist=node[19,20] -N 2 -n 8 --pty bash

jsquyres · 2021-01-15T19:26:46Z

Oh -- the results from running hostname may well be telling.

Is node 20 actually in your job and available? E.g., if you srun hostname, does it actually run on both nodes 19 and 20? If it hangs while trying to run on 20, that would likely be effectively the same thing that's happening to Open MPI (because mpirun uses srun under the covers to launch on other nodes in your SLURM job).

hakasapl · 2021-01-15T19:29:01Z

Yes:

hsaplakoglu_umass_edu@login:~$ srun --nodelist=node19 hostname
node19
hsaplakoglu_umass_edu@login:~$ srun --nodelist=node20 hostname
node20

Both nodes work for non-mpi jobs.

hakasapl · 2021-01-15T19:33:49Z

This also works:

hsaplakoglu_umass_edu@login:~$ srun --nodelist=node[19-20] hostname
node20
node19

jsquyres · 2021-01-15T19:49:17Z

Ok, let's try this... in your srun -p cpu --nodelist=node[19,20] -N 2 -n 8 --pty bash job, try to run (effectively) the same command that mpirun is invoking to launch on node 20.

This is what mpirun tried to launch:

srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid "557383680" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:19-20]@0(2)" -mca orte_hnp_uri "557383680.0;tcp://10.100.10.19:56759" --mca plm_base_verbose "100" --mca ras_base_verbose "100" --mca rss_base_verbose "100" --mca rmaps_base_verbose "100"

We obviously don't want to run orted here, so just try running:

srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 hostname

hakasapl · 2021-01-15T20:01:48Z

That command just hangs:

hsaplakoglu_umass_edu@node19:~$ srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 hostname
srun: Job 1333693 step creation temporarily disabled, retrying (Requested nodes are busy)

This leads me to believe this is definitely a slurm issue, though I don't know what has changed that is causing slurm to disable job step creation.

rhc54 · 2021-01-15T20:05:10Z

Try removing the --ntasks-per-node option and see if that helps

rhc54 · 2021-01-15T20:05:47Z

Basically, I'm suggesting you remove one of those options at a time until we can identify the one causing the problem

hakasapl · 2021-01-15T20:24:24Z

hsaplakoglu_umass_edu@node19:~$ srun --kill-on-bad-exit --nodes=1 --nodelist=node20 --ntasks=1 hostname

That command hangs the same. Getting rid of "nodelist" also causes a hang.

When I get rid of the --ntasks parameter, it prints this:

hsaplakoglu_umass_edu@node19:~$ srun --kill-on-bad-exit --nodes=1 --nodelist=node20 hostname
srun: error: Unable to create step for job 1333693: More processors requested than permitted

rhc54 · 2021-01-15T20:31:08Z

Sorry to have you playing "whack-a-mole", but try leaving the --ntasks-per-node and removing --ntasks

rhc54 · 2021-01-15T20:33:47Z

hsaplakoglu_umass_edu@node19:~$ srun --kill-on-bad-exit --nodes=1 --nodelist=node20 hostname
srun: error: Unable to create step for job 1333693: More processors requested than permitted

Hmmm...you know, I wonder if you specified only 1 core/node in your allocation request? That might be the root cause here - I believe Slurm changed their cmd line so that it defaulted to 1 core/node. Might want to check that out.

hakasapl · 2021-01-15T20:40:01Z

hsaplakoglu_umass_edu@node19:~$ srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1 --nodelist=node20 hostname
srun: Warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 8 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 1333693: More processors requested than permitted

The9Cat · 2021-01-15T20:47:23Z

I am a user on the system that @hakasapl manages. I can confirm the following. If I run with one core per node, mpirun will work. E.g.

$srun --ntasks-per-node=1 --nodes=8 --ntasks=8 --pty $SHELL
$mpirun --mca btl tcp,self hostname
node3
node15
node16
node22
node17
node21
node23
node18

But if I request more cores per node, I get a dead hang, e.g.

$srun --ntasks-per-node=2 --nodes=4 --ntasks=8 --pty $SHELL
$mpirun --mca btl tcp,self hostnam

The mpirun above appears to hang. I also checked that I can execute a complex mpi application using the first srun allocation, not just hostname. But not for the second srun.

Is this related to @rhc54 comment about one core per node default??

rhc54 · 2021-01-15T20:51:05Z

Yeah - what was your salloc cmd to get that allocation? I think that is where the problem begins.

The9Cat · 2021-01-15T21:09:29Z

I simply used srun to make the allocation. If I use salloc to get the same allocation followed by srun --jobid xxxx, the behavior is the same. I'm probably missing your point here, sorry.

I don't think I'm misusing srun here. To make sure, I repeated the same command in the post above on a different cluster, and they work as expected. So that suggests to me that slurm and openmpi are interacting in a strange way. Any thoughts on that?

rhc54 · 2021-01-15T22:07:43Z

So here's the problem. In 20.11, Slurm changed their allocation methods. I won't go into all the gory details as it gets very confusing. Suffice to say that SchedMD got yelled at a lot by people encountering all kinds of problems such as this one, and the change has been backed out. They are strongly advising everyone to skip 20.11.0 and jump to 20.11.3, which has the reversion in it.

I suggest you follow their advice and do the upgrade. If you aren't ready for that, then I would strongly advise you revert and go back to an earlier-than-20.11.0 release. Otherwise, you'll spend endless time trying to wade through this disaster.

hakasapl · 2021-01-15T22:30:16Z

We are currently running 20.11.2, which is the latest release on github and their website. Is there another location I'm missing that has the .3 release?

If not, we will have to revert back to 20.02.

rhc54 · 2021-01-15T22:32:10Z

I'm afraid I don't know - could be that .3 release is imminent? If you don't see it, then it hasn't come out yet.

hakasapl · 2021-01-15T22:34:02Z

Okay, we will revert to 20.02.6 for the moment. I'll let you know if that fixes the issue on this thread. Thank you for all the support!

The9Cat · 2021-01-15T22:39:52Z

Looks like 20.11.3 is on github but not officially released yet.

Thanks, @rhc54!

hakasapl · 2021-01-19T01:20:42Z

Downgrading to 20.02.6 did the trick. Thanks for all the support!

jsquyres added the question label Jan 14, 2021

hakasapl closed this as completed Jan 19, 2021

jsquyres mentioned this issue Jan 19, 2021

Revert "v4.1.x: Update Slurm launch support" #8354

Merged

biagas mentioned this issue Sep 24, 2021

fix movie & cinema test hang visit-dav/visit#17044

Merged

9 tasks

maituoy mentioned this issue Sep 26, 2021

Interactive node connection closed by remote host ArjunaCluster/ArjunaUsers#71

Closed

mwheinz mentioned this issue Sep 25, 2023

Hang when running Open MPI 4.1.4 from inside a SLURM 23.02 srun session. #11946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPIRun Hangs after Slurm 20.11 Update #8378

MPIRun Hangs after Slurm 20.11 Update #8378

hakasapl commented Jan 14, 2021

jsquyres commented Jan 14, 2021

hakasapl commented Jan 14, 2021

hakasapl commented Jan 14, 2021

jsquyres commented Jan 15, 2021

hakasapl commented Jan 15, 2021

hakasapl commented Jan 15, 2021

jsquyres commented Jan 15, 2021

hakasapl commented Jan 15, 2021

hakasapl commented Jan 15, 2021

jsquyres commented Jan 15, 2021

hakasapl commented Jan 15, 2021

rhc54 commented Jan 15, 2021

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

rhc54 commented Jan 15, 2021

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

The9Cat commented Jan 15, 2021

rhc54 commented Jan 15, 2021

The9Cat commented Jan 15, 2021 •

edited

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

The9Cat commented Jan 15, 2021

hakasapl commented Jan 19, 2021

MPIRun Hangs after Slurm 20.11 Update #8378

MPIRun Hangs after Slurm 20.11 Update #8378

Comments

hakasapl commented Jan 14, 2021

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Details of the problem

jsquyres commented Jan 14, 2021

hakasapl commented Jan 14, 2021

hakasapl commented Jan 14, 2021

jsquyres commented Jan 15, 2021

hakasapl commented Jan 15, 2021

hakasapl commented Jan 15, 2021

jsquyres commented Jan 15, 2021

hakasapl commented Jan 15, 2021

hakasapl commented Jan 15, 2021

jsquyres commented Jan 15, 2021

hakasapl commented Jan 15, 2021

rhc54 commented Jan 15, 2021

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

rhc54 commented Jan 15, 2021

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

The9Cat commented Jan 15, 2021

rhc54 commented Jan 15, 2021

The9Cat commented Jan 15, 2021 • edited

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

rhc54 commented Jan 15, 2021

hakasapl commented Jan 15, 2021

The9Cat commented Jan 15, 2021

hakasapl commented Jan 19, 2021

The9Cat commented Jan 15, 2021 •

edited