Slurm Troubleshooting Guide

This guide is meant as a tool to help system administrators or operators troubleshoot Slurm failures and restore services. The Frequently Asked Questions document may also prove useful.
このガイドは、システム管理者またはオペレーターがSlurm障害のトラブルシューティングとサービスの復元を支援するツールを目的としています。Frequently Asked Questions文書も役立つかもしれません。

Slurm is not responding
スラムは反応していません
Jobs are not getting scheduled
ジョブがスケジュールされていない
Jobs and nodes are stuck in COMPLETING state
ジョブとノードがCOMPLETING状態でスタックしている
Nodes are getting set to a DOWN state
ノードがダウン状態に設定されています
Networking and configuration problems
ネットワークと構成の問題

Slurm is not responding

Execute "scontrol ping" to determine if the primary and backup controllers are responding.
「scontrol ping」を実行して、プライマリコントローラとバックアップコントローラが応答しているかどうかを確認します。
If it responds for you, this could be a networking or configuration problem specific to some user or node in the cluster.
応答がある場合は、クラスタ内の一部のユーザーまたはノードに固有のネットワークまたは構成の問題である可能性があります。
If not responding, directly login to the machine and try again to rule out network and configuration problems.
応答しない場合は、直接マシンにログインして、ネットワークおよび構成の問題を除外するために再試行してください。
If still not responding, check if there is an active slurmctld daemon by executing "ps -el | grep slurmctld".
それでも応答しない場合は、「ps -el | grep slurmctld」を実行して、アクティブなslurmctldデーモンがあるかどうかを確認します。
If slurmctld is not running, restart it (typically as user root using the command "/etc/init.d/slurm start"). You should check the log file (SlurmctldLog in the slurm.conf file) for an indication of why it failed.
slurmctldが実行されていない場合は、再起動します（通常、コマンド「/etc/init.d/slurm start」を使用してユーザーrootとして）。失敗した理由を示すログファイル（slurm.confファイルのSlurmctldLog）を確認する必要があります。
If slurmctld is running but not responding (a very rare situation), then kill and restart it (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm start").
slurmctldが実行されているが応答しない（非常にまれな状況）場合は、それを強制終了して再起動します（通常、ユーザーrootとして「/etc/init.d/slurm stop」コマンドを使用し、次に「/etc/init.d/slurm start」を使用します。 "）。
If it hangs again, increase the verbosity of debug messages (increase SlurmctldDebug in the slurm.conf file) and restart. Again check the log file for an indication of why it failed.
再びハングする場合は、デバッグメッセージの詳細度を上げ（slurm.confファイルのSlurmctldDebugを増やして）、再起動します。再度ログファイルをチェックして、失敗した理由を示します。
If it continues to fail without an indication as to the failure mode, restart without preserving state (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm startclean"). Note: All running jobs and other state information will be lost.
失敗モードを示さずに失敗し続ける場合は、状態を保持せずに再起動します（通常、ユーザーrootとして、コマンド「/etc/init.d/slurm stop」を使用してから「/etc/init.d/slurm startclean」を使用します））。注：実行中のすべてのジョブとその他の状態情報は失われます。

Jobs are not getting scheduled

This is dependent upon the scheduler used by Slurm. Executing the command "scontrol show config | grep SchedulerType" to determine this.
これは、Slurmが使用するスケジューラーに依存します。これを確認するには、コマンド「scontrol show config | grep SchedulerType」を実行します。
For any scheduler, you can check priorities of jobs using the command "scontrol show job".
どのスケジューラーでも、コマンド「scontrol show job」を使用してジョブの優先順位を確認できます。

If the scheduler type is builtin, then jobs will be executed in the order of submission for a given partition. Even if resources are available to initiate jobs immediately, it will be deferred until no previously submitted job is pending.
スケジューラタイプが組み込みの場合、ジョブは指定されたパーティションの送信順に実行されます。リソースを使用してジョブをすぐに開始できる場合でも、以前に送信されたジョブが保留状態になるまで、リソースは延期されます。
If the scheduler type is backfill, then jobs will generally be executed in the order of submission for a given partition with one exception: later submitted jobs will be initiated early if doing so does not delay the expected execution time of an earlier submitted job. In order for backfill scheduling to be effective, users jobs should specify reasonable time limits.
スケジューラのタイプがバックフィルの場合、ジョブは通常、指定されたパーティションの送信順に実行されますが、1つの例外があります。それ以前に送信されたジョブの予想実行時間を遅らせない場合、後で送信されたジョブは早期に開始されます。バックフィルスケジュールを効果的にするために、ユーザーのジョブは適切な時間制限を指定する必要があります。
If jobs do not specify time limits, then all jobs will receive the same time limit (that associated with the partition), and the ability to backfill schedule jobs will be limited.
ジョブに時間制限が指定されていない場合、すべてのジョブが同じ時間制限（パーティションに関連付けられている時間制限）を受け取り、スケジュールジョブをバックフィルする機能が制限されます。
The backfill scheduler does not alter job specifications of required or excluded nodes, so jobs which specify nodes will substantially reduce the effectiveness of backfill scheduling.
バックフィルスケジューラは、必須または除外されたノードのジョブ仕様を変更しないため、ノードを指定するジョブは、バックフィルスケジューリングの効率を大幅に低下させます。
See the backfill documentation for more details.
詳細については、バックフィルのドキュメントを参照してください。

Jobs and nodes are stuck in COMPLETING state

This is typically due to non-killable processes associated with the job. Slurm will continue to attempt terminating the processes with SIGKILL, but some jobs may be stuck performing I/O and non-killable. This is typically due to a file system problem and may be addressed in a couple of ways.
これは通常、ジョブに関連付けられた強制終了できないプロセスが原因です。Slurmは引き続きSIGKILLを使用してプロセスの終了を試みますが、一部のジョブはI / Oの実行でスタックし、強制終了できない場合があります。これは通常、ファイルシステムの問題が原因であり、いくつかの方法で対処できます。

Fix the file system and/or reboot the node. -OR-
ファイルシステムを修正するか、ノードを再起動します。-または-
Set the node to a DOWN state and then return it to service ("scontrol update NodeName=<node> State=down Reason=hung_proc" and "scontrol update NodeName=<node> State=resume"). This permits other jobs to use the node, but leaves the non-killable process in place.
ノードをDOWN状態に設定し、サービスに戻します（ "scontrol update NodeName = State = down Reason = hung_proc "および" scontrol update NodeName = State = resume "）。これにより、他のジョブがノードを使用できるようになりますが、強制終了できないプロセスはそのまま残ります。
If the process should ever complete the I/O, the pending SIGKILL should terminate it immediately. -OR-
プロセスがI / Oを完了する必要がある場合、保留中のSIGKILLはすぐにそれを終了する必要があります。-または-
Use the UnkillableStepProgram and UnkillableStepTimeout configuration parameters to automatically respond to processes which can not be killed, by sending email or rebooting the node. For more information, see the slurm.conf documentation.
UnkillableStepProgramおよびUnkillableStepTimeout構成パラメーターを使用して、電子メールを送信するか、ノードを再起動することにより、強制終了できないプロセスに自動的に応答します。詳細については、slurm.confのドキュメントを参照してください。

Nodes are getting set to a DOWN state

Check the reason why the node is down using the command "scontrol show node <name>". This will show the reason why the node was set down and the time when it happened.
コマンド「scontrol show node」を使用して、ノードがダウンしている理由を確認してください「これは、ノードがダウンした理由とそれが発生した時刻を示します。
If there is insufficient disk space, memory space, etc. compared to the parameters specified in the slurm.conf file then either fix the node or change slurm.conf.
slurm.confファイルで指定されたパラメーターと比較してディスク容量、メモリ容量などが不十分な場合は、ノードを修正するか、slurm.confを変更してください。
If the reason is "Not responding", then check communications between the control machine and the DOWN node using the command "ping <address>" being sure to specify the NodeAddr values configured in slurm.conf.
理由が「応答なし」の場合は、「ping」コマンドを使用して、制御マシンとDOWNノード間の通信を確認します。"slurm.confで構成されたNodeAddr値を必ず指定してください。
If ping fails, then fix the network or addresses in slurm.conf.
pingが失敗した場合は、slurm.confのネットワークまたはアドレスを修正します。
Next, login to a node that Slurm considers to be in a DOWN state and check if the slurmd daemon is running with the command "ps -el | grep slurmd".
次に、SlurmがDOWN状態であると見なしているノードにログインし、コマンド「ps -el | grep slurmd」を使用してslurmdデーモンが実行されているかどうかを確認します。
If slurmd is not running, restart it (typically as user root using the command "/etc/init.d/slurm start").
slurmdが実行されていない場合は、再起動します（通常、 "/ etc / init.d / slurm start"コマンドを使用してユーザーrootとして）。
You should check the log file (SlurmdLog in the slurm.conf file) for an indication of why it failed.
失敗した理由を示すログファイル（slurm.confファイルのSlurmdLog）を確認する必要があります。
You can get the status of the running slurmd daemon by executing the command "scontrol show slurmd" on the node of interest.
対象のノードでコマンド「scontrol show slurmd」を実行すると、実行中のslurmdデーモンのステータスを取得できます。
Check the value of "Last slurmctld msg time" to determine if the slurmctld is able to communicate with the slurmd.
「最後のslurmctldメッセージ時間」の値を確認して、slurmctldがslurmdと通信できるかどうかを判別してください。
If slurmd is running but not responding (a very rare situation), then kill and restart it (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm start").
slurmdが実行されているが応答しない場合（非常にまれな状況）、それを強制終了して再起動します（通常、ユーザーrootとして「/etc/init.d/slurm stop」コマンドを使用し、次に「/etc/init.d/slurm start」を使用します） "）。
If still not responding, try again to rule out network and configuration problems.
それでも応答しない場合は、ネットワークと構成の問題を除外するために再試行してください。
If still not responding, increase the verbosity of debug messages (increase SlurmdDebug in the slurm.conf file) and restart.
それでも応答しない場合は、デバッグメッセージの詳細度を上げ（slurm.confファイルのSlurmdDebugを増やします）、再起動します。
Again check the log file for an indication of why it failed.
再度ログファイルをチェックして、失敗した理由を示します。
If still not responding without an indication as to the failure mode, restart without preserving state (typically as user root using the commands "/etc/init.d/slurm stop" and then "/etc/init.d/slurm startclean").
それでも障害モードを示さずに応答しない場合は、状態を保持せずに再起動します（通常、ユーザーrootとして、コマンド「/etc/init.d/slurm stop」、次に「/etc/init.d/slurm startclean」を使用します）。
Note: All jobs and other state information on that node will be lost.
注：そのノード上のすべてのジョブおよびその他の状態情報は失われます。

Networking and configuration problems

Check the controller and/or slurmd log files (SlurmctldLog and SlurmdLog in the slurm.conf file) for an indication of why it is failing.
コントローラまたはslurmdログファイル（slurm.confファイルのSlurmctldLogおよびSlurmdLog）をチェックして、失敗の理由を確認してください。
Check for consistent slurm.conf and credential files on the node(s) experiencing problems.
問題が発生しているノードで、slurm.confと認証情報ファイルの一貫性を確認します。
If this is user-specific problem, check that the user is configured on the controller computer(s) as well as the compute nodes.
これがユーザー固有の問題である場合は、ユーザーがコントローラーコンピューターと計算ノードで構成されていることを確認します。
The user doesn't need to be able to login, but his user ID must exist.
ユーザーはログインできる必要はありませんが、ユーザーIDが存在している必要があります。
Check that compatible versions of Slurm exists on all of the nodes (execute "sinfo -V" or "rpm -qa | grep slurm").
Slurmの互換バージョンがすべてのノードに存在することを確認します（「sinfo -V」または「rpm -qa | grep slurm」を実行します）。
The Slurm version number contains three period-separated numbers that represent both the major Slurm release and maintenance release level. The first two parts combine together to represent the major release, and match the year and month of that major release. The third number in the version designates a specific maintenance level:
Slurmのバージョン番号には、Slurmのメジャーリリースとメンテナンスリリースレベルの両方を表す、ピリオドで区切られた3つの番号が含まれています。最初の2つの部分を組み合わせてメジャーリリースを表し、そのメジャーリリースの年と月を一致させます。バージョンの3番目の番号は、特定のメンテナンスレベルを示します。
year.month.maintenance-release (e.g. 17.11.5 is major Slurm release 17.11, and maintenance version 5).
year.month.maintenance-release（たとえば、17.11.5はSlurmのメジャーリリース17.11、メンテナンスバージョン5です）。
Thus version 17.11.x was initially released in November 2017.
したがって、バージョン17.11.xは2017年11月に最初にリリースされました。
Slurm daemons will support RPCs and state files from the two previous major releases (e.g. a version 17.11.x SlurmDBD will support slurmctld daemons and commands with a version of 17.11.x, 17.02.x or 16.05.x).
Slurmデーモンは、以前の2つのメジャーリリースからのRPCと状態ファイルをサポートします（たとえば、バージョン17.11.xのSlurmDBDは、バージョン17.11.x、17.02.x、16.05.xのslurmctldデーモンとコマンドをサポートします）。

Last modified 26 April 2019