Failure Management Support

Overview
Architecture
Configuration
Commands and API
Example

Overview

Modern high performance computers can include thousands of nodes and millions of cores. The shear number of components can result in a mean time between failure measured in hours, which is lower than the execution time of many applications. In order to effectively manage such large systems, Slurm includes infrastructure designed to work with applications allowing them to manage failures and keep running. Slurm infrastructure includes support for:
現代の高性能コンピューターには、数千のノードと数百万のコアが含まれる場合があります。コンポーネントのせん断数により、障害間の平均時間は時間単位で測定され、多くのアプリケーションの実行時間よりも短くなります。このような大規模システムを効率的に管理するために、Slurmには、アプリケーションを操作して障害を管理し、実行を継続できるように設計されたインフラストラクチャが含まれています。Slurmインフラストラクチャには、次のサポートが含まれます。

Permitting users to drain nodes they believe are failing
失敗していると思われるノードをユーザーがドレインできるようにする
A pool of hot-spare resources, which applications can use to replace failed or failing resources in their current allocation
ホットスペアリソースのプール。アプリケーションは、現在の割り当てで失敗したリソースまたは失敗したリソースを置き換えるために使用できます。
Extending a job's time limit to recover from failures
障害から回復するためのジョブの制限時間の延長
Real time application notification of failures and availability of replacement resources
障害のリアルタイムのアプリケーション通知と交換リソースの可用性
Access control list control over failure management capabilities
障害管理機能に対するアクセス制御リスト制御
A library that applications can link to directly without contamination by GPL licensing
アプリケーションがGPLライセンスによる汚染なしに直接リンクできるライブラリ
A command line interface to these capabilities
これらの機能へのコマンドラインインターフェイス

Architecture

The failure management is implemented as a client server in which the server is a plugin in slurmctld and the client is an API that negotiate the resource management allocation from an already running job.
障害管理はクライアントサーバーとして実装され、サーバーはslurmctldのプラグインであり、クライアントは既に実行中のジョブからのリソース管理割り当てをネゴシエートするAPIです。

These are the architectural components:
これらは建築コンポーネントです：

The plugin implements the server logic which keeps track of nodes allocated to jobs and job steps, the state of nodes and their availability. Having the complete view of resource allocations the server can effectively help applications to be resilient to node failures.
プラグインは、ジョブとジョブステップに割り当てられたノード、ノードの状態、およびそれらの可用性を追跡するサーバーロジックを実装します。サーバーはリソース割り当ての完全なビューを持っているので、アプリケーションがノードの障害に対して回復力を持つのに効果的に役立ちます。
libsmd.so, libsmd.a and smd.h are the client interface and library to the non stop services based on a jobID.
libsmd.so、libsmd.a、およびsmd.hは、jobIDに基づく非停止サービスへのクライアントインターフェイスおよびライブラリです。
smd command build on top of the library provides command line interface to failure management services. nonstop.sh shell script which automates the node replacement based on user supplied environment variables.
ライブラリの上に構築されたsmdコマンドは、障害管理サービスへのコマンドラインインターフェイスを提供します。ユーザーが指定した環境変数に基づいてノードの置換を自動化するnonstop.shシェルスクリプト。

The controller keeps nodes in several states which reflects their usability.
コントローラーはノードをいくつかの状態に保ち、その使いやすさを反映しています。

Failed hosts, currently out of service.
障害が発生したホスト、現在サービス停止中。
Failing hosts, malfunctioning and/or expected to fail.
ホストの障害、誤動作、または障害が予想されます。
Hot Spare. A cluster-wide pool of resources to be made available to jobs with failed/failing nodes. The hot spare pool can be partition based, the administrator specifies how many spares in a given partition.
ホットスペア。障害のあるノードまたは失敗したノードのあるジョブで利用できるようにする、クラスター全体のリソースのプール。ホットスペアプールはパーティションベースにすることができ、管理者は特定のパーティション内のスペアの数を指定します。

Failing hosts can be drained and then dropped from the allocation, giving application flexibility to manage its own resources
障害のあるホストはドレインしてから割り当てから削除できるため、アプリケーションが独自のリソースを管理する柔軟性が得られます

Drained nodes can be put back on-line by the administrator and they will go automatically back to the spare pool. Failed nodes can be put back on-line and they will go automatically back to the spare pool.
ドレインされたノードは、管理者がオンラインに戻すことができ、自動的にスペアプールに戻ります。障害が発生したノードはオンラインに戻すことができ、自動的にスペアプールに戻ります。

Application usually detects the failure by itself when losing one or more of its component, then it is able to notify Slurm of failures and drain nodes. Application can also query Slurm about state of nodes in its allocation and/or asks Slurm to replace its failed/failing nodes, then it can
アプリケーションは通常、コンポーネントを1つ以上失うと、それ自体で障害を検出し、Slurmに障害とドレインノードを通知できます。アプリケーションは、割り当てのノードの状態についてSlurmにクエリを実行したり、Slurmに障害のある/障害のあるノードを交換するように要求したりできます。

Wait for nodes become available, eventually increasing its runtime till then
ノードが利用可能になるまで待ち、最終的にはそれまでの実行時間を増やします
Increase its runtime upon node replacement
ノード交換時にランタイムを増加させる
Drop the nodes and continue, eventually increasing its runtime
ノードを削除して続行すると、最終的にはランタイムが増加します

Fig1. - Components of the failure management.

The figure 1 shows the component interaction.
図1は、コンポーネントの相互作用を示しています。

The job step. Rank 0 is the rank responsible for monitoring the state of other ranks, it then negotiates with the controller resource allocation in case of failure. There are two ways in which rank 0 interacts with the controller:
ジョブステップ。ランク0は、他のランクの状態の監視を担当するランクで、障害が発生した場合にコントローラーのリソース割り当てとネゴシエートします。ランク0がコントローラーと相互作用する方法は2つあります。
1. Using a command that detects failure and generates a new hostlist for the application.
  障害を検出し、アプリケーションの新しいホストリストを生成するコマンドを使用します。
2. Link with the smd library and subscribe with the controller for events.
  smdライブラリーとリンクし、イベントのコントローラーをサブスクライブします。
The slurmctld process itself which loads the failure management plugin.
障害管理プラグインをロードするslurmctldプロセス自体。
The failure management plugin itself. The plugin talks to the smd library over a tcp/ip connection.
障害管理プラグイン自体。プラグインは、tcp / ip接続を介してsmdライブラリと通信します。

The figure 2 show a simple protocol interaction between the application and the controller. In this case the application detects the failure by itself so the job steps terminates. The application then asks for a node replacement either using the smd command with appropriate parameters or via an API call. In this specific example the controller allocates a new node to the application which can then initiates a new job step.
図2は、アプリケーションとコントローラー間の単純なプロトコル対話を示しています。この場合、アプリケーション自体が障害を検出するため、ジョブステップが終了します。次に、アプリケーションは、適切なパラメーターを指定したsmdコマンドを使用するか、API呼び出しを介して、ノードの交換を要求します。この特定の例では、コントローラーは新しいノードをアプリケーションに割り当て、アプリケーションは新しいジョブステップを開始できます。

Fig3. - More complex protocol interaction.

The figure 3 shows more complex protocol negotiation. In this case the application query the controller to check if any node in the step has failed, then ask for replacement which is available in 1 hour. The application then decides to drop the node and keep running with less resources but having its runtime extended by 1 hour.
図3は、より複雑なプロトコルネゴシエーションを示しています。この場合、アプリケーションはコントローラーにクエリを送信して、ステップ内のいずれかのノードが失敗したかどうかを確認し、1時間以内に利用可能な交換を要求します。次に、アプリケーションはノードを削除し、少ないリソースで実行を継続することを決定しますが、ランタイムは1時間延長されます。

Configuration

Slurm's slurmctld/nonstop plugin acts as the server for these functions. Configure in slurm.conf as follows:
Slurmのslurmctld / nonstopプラグインは、これらの機能のサーバーとして機能します。slurm.confで次のように構成します。

SlurmctldPlugin=slurmctld/nonstop

This plugin uses a nonstop.conf file for it's configuration details. See "man nonstop.conf" for details about the options. The nonstop.conf file must be in the same directory as your slurm.conf file. A sample nonstop.conf file appears below.
このプラグインは、設定の詳細にnonstop.confファイルを使用します。オプションの詳細については、「man nonstop.conf」を参照してください。nonstop.confファイルは、slurm.confファイルと同じディレクトリにある必要があります。nonstop.confファイルのサンプルを以下に示します。

#
# Sample nonstop.conf file
#
# Same as BackupAddr in slurm.conf
BackupAddr=bacco
# Same as ControlAddr in slurm.conf
ControlAddr=prometeo
# Verbosity of plugin logging
Debug=0
# Count of hot spare nodes in partition "batch"
HotSpareCount=batch:8
# Maximum hot spare node use by any single job
MaxSpareNodeCount=4
# Port the slurmctld/nonstop plugin reads from
Port=9114
# Maximum time limit extension if no hot spares
TimeLimitDelay=600
# Time limit extension for each replaced node
TimeLimitExtend=2
# Users allowed to drain nodes
UserDrainAllow=alan,brenda

An additional package named smd will also need to be installed. This package is available at no additional cost for systems under a support contract with SchedMD LLC (including support contracts procured through other vendors of Slurm support such as Bull and Cray). Please contact the appropriate vendor for the software.
smdという名前の追加パッケージもインストールする必要があります。このパッケージは、SchedMD LLCとのサポート契約（BullやCrayなど、Slurmサポートの他のベンダーを通じて調達されたサポート契約を含む）に基づくシステムでは追加費用なしで利用できます。ソフトウェアの適切なベンダーにお問い合わせください。

The smd package includes a library and command line interface, libsmd and smd respectively. Note that libsmd is not under the GPL license and can be link by other programs without those programs needing to be released under a GPL license.
smdパッケージには、ライブラリとコマンドラインインターフェイス、それぞれlibsmdとsmdが含まれています。libsmdはGPLライセンスのもとではなく、他のプログラムからリンクできることに注意してください。これらのプログラムはGPLライセンスのもとでリリースする必要はありません。

Commands and API

In this paragraph we examine the user interaction with the system. As stated previously users can use the smd command or the API.
この段落では、ユーザーとシステムとの対話を調べます。前述のとおり、ユーザーはsmdコマンドまたはAPIを使用できます。

The smd command is driven by command line options, see smd man page for more details. However the smd command can also be driven by environment variables which makes it easier to be deployed in user applications. The environment variables describes what action to take upon host failure or if the host is failing.
smdコマンドは、コマンドラインオプションによって駆動されます。詳細については、smdのマニュアルページを参照してください。ただし、smdコマンドは環境変数によっても制御できるため、ユーザーアプリケーションでの展開が容易になります。環境変数は、ホストの障害時、またはホストに障害が発生した場合に実行するアクションを記述します。

The environmental variables to set are SMD_NONSTOP_FAILED or SMD_NONSTOP_FAILING they control the controller actions when a node failed or it is failing.
設定する環境変数は、SMD_NONSTOP_FAILEDまたはSMD_NONSTOP_FAILINGで、ノードに障害が発生したとき、またはノードが失敗したときのコントローラーのアクションを制御します。

The variables can take the following values based on the desired action.
変数は、目的のアクションに基づいて次の値をとることができます。

REPLACE
Replace the failed/failing nodes.
故障した/故障したノードを交換します。
DROP Drop the nodes from the allocation.
ノードを割り当てから削除します。
TIME_LIMIT_DELAY=Xmin
If a job requires replacement resources and none are immediately available, then permit a job to extend its time limit by the length of time required to secure replacement resources up to the number of minutes specified.
ジョブに交換リソースが必要であり、すぐに利用できるリソースがない場合は、ジョブがその制限時間を、指定された分数まで交換リソースを確保するのに必要な時間だけ延長できるようにします。
TIME_LIMIT_EXTEND=Ymin
Specifies the number of minutes that a job can extend its time limit for each replaced node.
置き換えられた各ノードについて、ジョブが時間制限を延長できる分数を指定します。
TIME_LIMIT_DROP=Zmin
Specifies the number of minutes that a job can extend its time limit for each failed or failing node removed from the job's allocation.
ジョブの割り当てから削除された、失敗したノードまたは失敗したノードごとに、ジョブが時間制限を延長できる分数を指定します。
EXIT_JOB
Exit the job upon node failure.
ノード障害時にジョブを終了します。

The variables can be combined in to logical and expressions.
変数は、論理式と式に組み合わせることができます。

SMD_NONSTOP_FAILED="REPLACE:TIME_LIMIT_DELAY=4:EXIT_JOB
This directive instructs the smd command to attempt to replace failed/failing nodes, if replacement is not immediately available wait up to 4 minutes, then if still not available exit the job.
このディレクティブは、障害が発生したノードまたは障害が発生したノードの交換を試みるようにsmdコマンドに指示します。交換がすぐに利用できない場合は、最大4分間待機し、それでも利用できない場合はジョブを終了します。
SMD_NONSTOP_FAILED="REPLACE:TIME_LIMIT_DELAY=4:DROP:TIME_LIMIT_DROP=10
This instructs smd command to attempt to replace nodes with a 4 minutes timeout, then drop the failed nodes and extend the step runtime by 10 minutes.
これにより、smdコマンドは、4分のタイムアウトでノードの置き換えを試み、失敗したノードを削除して、ステップランタイムを10分延長します。

Example

In this use case analysis we illustrate how to use the nonstop.sh and steps.sh script provided with the failure management package. The steps.sh is the Slurm batch script that executes job steps.
このユースケース分析では、障害管理パッケージで提供されるnonstop.shおよびsteps.shスクリプトの使用方法を示します。steps.shは、ジョブステップを実行するSlurmバッチスクリプトです。

Its main loop looks like this:
メインループは次のようになります。

for ((i = 1; i <= $num; i++))
do
# Run the $i step of my application
  srun -l --mpi=pmi2 $PWD/star -t 10 > /dev/null
# After each step invoke the snonstop.sh script
# which will detect if any node has failed and
# will execute the actions specified by the user
# via the SMD_NONSTOP_FAILED variable
  $PWD/nonstop.sh
  if [ $? -ne 0 ]; then
     exit 1
  fi
echo "start: step $i `date`"
done

Let's run several tests, in all test we submit the batch jobs as sbatch steps.sh. Note The batch job must use the --no-kill option to prevent Slurm from terminating the entire batch job upon one step failure.
いくつかのテストを実行してみましょう。すべてのテストで、バッチジョブをsbatch steps.shとして送信します。注バッチジョブでは、-no-killオプションを使用して、1つのステップが失敗したときにSlurmがバッチジョブ全体を終了しないようにする必要があります。

Set SMD_NONSTOP_FAILED="REPLACE"
Setting num=2 to run only 2 steps.
With no host failure the job will run to completion and the nonstop.sh will print to its stdout the status of the nodes in the allocation.
ホストに障害が発生しない場合、ジョブは最後まで実行され、nonstop.shは割り当て内のノードのステータスを標準出力に出力します。
```
start: step 0 Fri Mar 28 11:47:19 PDT 2014
is_failed: job 58 searching for FAILED hosts
is_failed: job 58 has no FAILED nodes
start: step 1 Fri Mar 28 11:47:55 PDT 2014
is_failed: job 58 searching for FAILED hosts
is_failed: job 58 has no FAILED nodes
```

Set SMD_NONSTOP_FAILED="REPLACE
Setting num=10 to run 10 steps.
As the job starts set one of the execution nodes down using the scontrol command. As soon as the running step finishes the nonstop.sh runs instructing the smd command to detect node failure and ask for replacement. The script stdout is:
ジョブが開始したら、scontrolコマンドを使用して実行ノードの1つを停止します。実行中のステップが完了するとすぐに、nonstop.shが実行され、smdコマンドにノードの障害を検出して交換を要求するように指示します。スクリプトstdoutは次のとおりです。
```
is_failed: job 59 searching for FAILED hosts
is_failed: job 59 has 1 FAILED nodes
is_failed: job 59 FAILED node ercole cpu_count 1
_handle_fault: job 59 handle failed_hosts
_try_replace: job 59 trying to replace 1 nodes
_try_replace: job 59 node ercole replaced by prometeo
_generate_node_file: job 59 all nodes replaced
```
Examine the output of sinfo--format="%12P %.10n %.5T %.14C" command.
```
->sinfo
PARTITION HOSTNAMES  STATE   CPUS(A/I/O/T)
markab*     dario    mixed    1/15/0/16
markab*     prometeo mixed    1/15/0/16
markab*     spartaco idle     0/16/0/16
markab*     ercole   down     0/0/16/16
```
From the output we see the initial host ercole which went down was replaced by the host prometeo.
出力から、ダウンした最初のホストercoleがホストprometeoに置き換えられたことがわかります。

Set SMD_NONSTOP_FAILED="REPLACE:TIME_LIMIT_DELAY=1:DROP:TIME_LIMIT_DROP=30"
Modify steps.sh to run on all available nodes so when a node fails there is no replacement available.
使用可能なすべてのノードで実行されるようにsteps.shを変更して、ノードに障害が発生したときに代替が利用できないようにします。

Upon the node failure the system will try for one minute to get a replacement after that period of time will drop the node and extend the job's runtime. This time the stdout is more verbose.
ノード障害が発生すると、システムは1分間試行して交換を取得します。その期間が経過すると、ノードが削除され、ジョブのランタイムが延長されます。今回はstdoutがより冗長になりました。

is_failed: job 151 searching for FAILED hosts
is_failed: job 151 has 1 FAILED nodes
is_failed: job 151 FAILED node spartaco cpu_count 1
_handle_fault: job 151 handle failed_hosts
_try_replace: job 151 trying to replace 1 nodes
_try_replace: smd_replace_node() error job_id 151: Failed to replace the node
_time_limit_extend: job 151 extending job time limit by 1 minutes
_increase_job_runtime: job 151 run time limit extended by 1min successfully
_try_replace: job 151 waited for 0 sec cnt 0 trying every 20 sec...
_try_replace: job 151 trying to replace 1 nodes
_try_replace: smd_replace_node() error job_id 151: Failed to replace the node
_try_replace: job 151 waited for 20 sec cnt 1 trying every 20 sec..._try_replace: smd_replace_node() error job_id 151: Failed to replace the node
_try_replace: job 151 waited for 40 sec cnt 2 trying every 20 sec...
_try_replace: job 151 failed to replace down or failing nodes:
   spartaco
_drop_nodes: job 151 node spartaco dropped all right
_generate_node_file: job 151 all nodes replaced
source the /tmp/smd_job_151_nodes.sh hostfile to get the new job environment
_time_limit_extend: job 151 extending job time limit by 30 minutes
_increase_job_runtime: job 151 run time limit extended by 30min successfully

As instructed the system tried to replace the failed node for a minute, then it drop the node and extended the runtime for 30 minutes. The extended runtime must be within the system TimeLimitDrop configured in nonstop.conf.
指示されたとおり、システムは障害が発生したノードを1分間交換しようとしましたが、ノードを削除し、ランタイムを30分間延長しました。拡張ランタイムは、nonstop.confで構成されたシステムTimeLimitDrop内にある必要があります。

The job step acquires the new environment from the hosts file smd_job_$SLURM_JOB_ID_nodes.sh which is created by the smd command in /tmp on the node where the batch script runs.
ジョブステップは、バッチスクリプトが実行されるノードの/ tmpにあるsmdコマンドによって作成されるホストファイルsmd_job_ ＄ SLURM_JOB_ID_nodes.shから新しい環境を取得します。

export SLURM_NODELIST=dario,prometeo
export SLURM_JOB_NODELIST=dario,prometeo
export SLURM_NNODES=2
export SLURM_JOB_NUM_NODES=2
export SLURM_JOB_CPUS_PER_NODE=2\(x2\)
unset SLURM_TASKS_PER_NODE

This environment is sourced by the steps.sh script so the next job steps will run using the new environment.
この環境は、steps.shスクリプトによって提供されるため、次のジョブステップは新しい環境を使用して実行されます。

Last modified 20 February 2014