gres.conf

Section: Slurm Configuration File (5)
Updated: Slurm Configuration File
Index

NAME

gres.conf - Slurm configuration file for Generic RESource (GRES) management.
gres.conf-Generic RESource（GRES）管理用のSlurm構成ファイル。

DESCRIPTION

gres.conf is an ASCII file which describes the configuration of Generic RESource (GRES) on each compute node. If the GRES information in the slurm.conf file does not fully describe those resources, then a gres.conf file should be included on each compute node. The file location can be modified at system build time using the DEFAULT_SLURM_CONF parameter or at execution time by setting the SLURM_CONF environment variable. The file will always be located in the same directory as the slurm.conf file.
gres.confは、各計算ノードのGeneric RESource（GRES）の構成を記述するASCIIファイルです。slurm.confファイルのGRES情報でこれらのリソースが完全に記述されていない場合は、各計算ノードにgres.confファイルを含める必要があります。ファイルの場所は、システム構築時にDEFAULT_SLURM_CONFパラメータを使用して変更するか、実行時にSLURM_CONF環境変数を設定して変更できます。ファイルは常にslurm.confファイルと同じディレクトリにあります。

If the GRES information in the slurm.conf file fully describes those resources (i.e. no "Cores", "File" or "Links" specification is required for that GRES type or that information is automatically detected), that information may be omitted from the gres.conf file and only the configuration information in the slurm.conf file will be used. The gres.conf file may be omitted completely if the configuration information in the slurm.conf file fully describes all GRES.
slurm.confファイル内のGRES情報がそれらのリソースを完全に説明している場合（つまり、そのGRESタイプに「コア」、「ファイル」、または「リンク」の指定が不要であるか、その情報が自動的に検出される場合）、その情報はgres.confファイルとslurm.confファイル内の構成情報のみが使用されます。slurm.confファイルの構成情報にすべてのGRESが完全に記述されている場合は、gres.confファイルを完全に省略できます。

Parameter names are case insensitive. Any text following a "#" in the configuration file is treated as a comment through the end of that line. Changes to the configuration file take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the command "scontrol reconfigure" unless otherwise noted.
パラメータ名は大文字と小文字を区別しません。構成ファイルで「＃」に続くテキストは、その行の終わりまでコメントとして扱われます。特に明記されていない限り、構成ファイルへの変更は、Slurmデーモンの再起動、SIGHUPシグナルのデーモン受信、またはコマンド「scontrol reconfigure」の実行時に有効になります。

NOTE: Slurm support for gres/mps requires the use of the select/cons_tres plugin. For more information on how to configure MPS, see https://slurm.schedmd.com/gres.html#MPS_Management.
注：gres / mpsのSlurmサポートには、select / cons_tresプラグインの使用が必要です。MPSの構成方法の詳細については、https：//slurm.schedmd.com/gres.html#MPS_Managementを参照してください。

For more information on GRES scheduling in general, see https://slurm.schedmd.com/gres.html.
一般的なGRESスケジューリングの詳細については、https：//slurm.schedmd.com/gres.htmlを参照してください。

The overall configuration parameters available include:
利用可能な全体的な構成パラメータは次のとおりです。

AutoDetect

The hardware detection mechanisms to enable for automatic GRES configuration. This should be on a line by itself. Current, options are:
自動GRES構成を有効にするハードウェア検出メカニズム。これはそれ自体でライン上にあるはずです。現在のオプションは次のとおりです。

nvml: Used to automatically detect NVIDIA GPUs
NVIDIA GPUを自動的に検出するために使用されます
rsmi: Used to automatically detect AMD GPUs
AMD GPUを自動的に検出するために使用されます

Count

Number of resources of this type available on this node. The default value is set to the number of File values specified (if any), otherwise the default value is one. A suffix of "K", "M", "G", "T" or "P" may be used to multiply the number by 1024, 1048576, 1073741824, etc. respectively. For example: "Count=10G".
このノードで使用可能なこのタイプのリソースの数。デフォルト値は、指定されたファイル値（存在する場合）の数に設定されます。それ以外の場合、デフォルト値は1です。「K」、「M」、「G」、「T」、または「P」の接尾辞を使用して、数値にそれぞれ1024、1048576、1073741824などを掛けることができます。例：「Count = 10G」。

Cores

Optionally specify the first thread CPU index numbers for the specific cores which can use this resource. For example, it may be strongly preferable to use specific cores with specific GRES devices (e.g. on a NUMA architecture). While Slurm can track and assign resources at the CPU or thread level, its scheduling algorithms used to co-allocate GRES devices with CPUs operates at a socket or NUMA level. Therefore it is not possible to preferentially assign GRES with different specific CPUs on the same NUMA or socket and this option should be used to identify all cores on some socket.
オプションで、このリソースを使用できる特定のコアの最初のスレッドCPUインデックス番号を指定します。たとえば、特定のGRESデバイス（NUMAアーキテクチャなど）で特定のコアを使用することを強くお勧めします。SlurmはCPUまたはスレッドレベルでリソースを追跡して割り当てることができますが、GRESデバイスをCPUと一緒に割り当てるために使用されるそのスケジューリングアルゴリズムは、ソケットレベルまたはNUMAレベルで動作します。したがって、同じNUMAまたはソケットで異なる特定のCPUをGRESに優先的に割り当てることはできません。このオプションは、一部のソケットのすべてのコアを識別するために使用する必要があります。

Multiple cores may be specified using a comma delimited list or a range may be specified using a "-" separator (e.g. "0,1,2,3" or "0-3"). If a job specifies --gres-flags=enforce-binding, then only the identified cores can be allocated with each generic resource. This will tend to improve performance of jobs, but delay the allocation of resources to them. If specified and a job is not submitted with the --gres-flags=enforce-binding option the identified cores will be preferred for scheduled with each generic resource.
コンマ区切りのリストを使用して複数のコアを指定するか、「-」区切り文字を使用して範囲を指定できます（例：「0、1、2、3」または「0-3」）。ジョブが--gres-flags = enforce-bindingを指定している場合、識別されたコアのみが各汎用リソースに割り当てられます。これはジョブのパフォーマンスを向上させる傾向がありますが、それらへのリソースの割り当てを遅らせます。指定されていて--gres-flags = enforce-bindingオプションを使用してジョブが送信されていない場合、識別されたコアが各汎用リソースでスケジュールされるのに優先されます。

If --gres-flags=disable-binding is specified, then any core can be used with the resources, which also increases the speed of Slurm's scheduling algorithm but can degrade the application performance. The --gres-flags=disable-binding option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). If any core can be effectively used with the resources, then do not specify the cores option for improved speed in the Slurm scheduling logic. A restart of the slurmctld is needed for changes to the Cores option to take effect.
--gres-flags = disable-bindingが指定されている場合、任意のコアをリソースで使用できます。これにより、Slurmのスケジューリングアルゴリズムの速度は向上しますが、アプリケーションのパフォーマンスが低下する可能性があります。--gres-flags = disable-bindingオプションは、GRESにバインドされているよりも多くのCPUを使用するために現在必要です（つまり、GPUが1つのソケットのCPUにバインドされているが、ジョブ）。リソースでコアを効果的に使用できる場合は、Slurmスケジューリングロジックで速度を向上させるためにコアオプションを指定しないでください。Coresオプションへの変更を有効にするには、slurmctldの再起動が必要です。

NOTE: If your cores contain multiple threads only the first thread (processing unit) of each core needs to be listed. Also note that since Slurm must be able to perform resource management on heterogeneous clusters having various processing unit numbering schemes, a logical processing unit index must be specified instead of the physical processing unit index. That processing unit logical index might not correspond to your physical index number. Processing unit 0 will be the first socket, first core and (if configured) first thread. If hyperthreading is enabled, processing unit 1 will always be the first socket, first core and second thread. If hyperthreading is not enabled, processing unit 1 will always be the first socket and second core. This numbering coincides with the processing unit logical number (PU L#) seen in "lstopo -l" command output.
注：コアに複数のスレッドが含まれている場合は、各コアの最初のスレッド（処理ユニット）のみをリストする必要があります。また、Slurmはさまざまな処理ユニット番号付けスキームを持つ異種クラスターでリソース管理を実行できる必要があるため、物理処理ユニットインデックスの代わりに論理処理ユニットインデックスを指定する必要があることにも注意してください。その処理装置の論理インデックスは、物理インデックス番号に対応していない場合があります。処理装置0は、最初のソケット、最初のコア、および（構成されている場合）最初のスレッドになります。ハイパースレッディングが有効な場合、処理ユニット1は常に最初のソケット、最初のコア、2番目のスレッドになります。ハイパースレッディングが有効でない場合、処理ユニット1は常に最初のソケットと2番目のコアになります。この番号付けは、「lstopo -l」で見られる処理装置の論理番号（PU L＃）と一致します。

File

Fully qualified pathname of the device files associated with a resource. The name can include a numeric range suffix to be interpreted by Slurm (e.g. File=/dev/nvidia[0-3]).
リソースに関連付けられているデバイスファイルの完全修飾パス名。名前には、Slurmによって解釈される数値範囲のサフィックスを含めることができます（例：File = / dev / nvidia [0-3]）。

This field is generally required if enforcement of generic resource allocations is to be supported (i.e. prevents users from making use of resources allocated to a different user). Enforcement of the file allocation relies upon Linux Control Groups (cgroups) and Slurm's task/cgroup plugin, which will place the allocated files into the job's cgroup and prevent use of other files. Please see Slurm's Cgroups Guide for more information: https://slurm.schedmd.com/cgroups.html.
一般的にこのフィールドは、一般的なリソース割り当ての実施をサポートする場合に必要です（つまり、ユーザーが別のユーザーに割り当てられたリソースを利用できないようにする）。ファイル割り当ての実施は、Linux Control Group（cgroups）とSlurmのtask / cgroupプラグインに依存しています。これらのプラグインは、割り当てられたファイルをジョブのcgroupに配置し、他のファイルの使用を防止します。詳細については、SlurmのCgroupsガイドを参照してください：https://slurm.schedmd.com/cgroups.html。

If File is specified then Count must be either set to the number of file names specified or not set (the default value is the number of files specified). The exception to this is MPS. For MPS, each GPU would be identified by device file using the File parameter and Count would specify the number of MPS entries that would correspond to that GPU (typically 100 or some multiple of 100).
Fileを指定する場合、Countは指定したファイル名の数に設定するか、設定しないでください（デフォルト値は指定したファイルの数です）。これの例外はMPSです。MPSの場合、各GPUはFileパラメーターを使用してデバイスファイルによって識別され、CountはそのGPUに対応するMPSエントリの数（通常は100または100の倍数）を指定します。

NOTE: If you specify the File parameter for a resource on some node, the option must be specified on all nodes and Slurm will track the assignment of each specific resource on each node. Otherwise Slurm will only track a count of allocated resources rather than the state of each individual device file.
注：一部のノードのリソースにFileパラメーターを指定する場合、オプションはすべてのノードで指定する必要があり、Slurmは各ノードの各特定のリソースの割り当てを追跡します。それ以外の場合、Slurmは、個々のデバイスファイルの状態ではなく、割り当てられたリソースの数のみを追跡します。

NOTE: Drain a node before changing the count of records with File parameters (i.e. if you want to add or remove GPUs from a node's configuration). Failure to do so will result in any job using those GRES being aborted.
注：ファイルパラメータを使用してレコードの数を変更する前にノードをドレインします（つまり、ノードの構成にGPUを追加または削除する場合）。そうしないと、これらのGRESを使用しているジョブが中止されます。

Flags

Optional flags that can be specified to change configured behavior of the GRES.
GRESの構成された動作を変更するために指定できるオプションのフラグ。

Allowed values at present are:
現在許可されている値は次のとおりです。

CountOnly: Do not attempt to load plugin as this GRES will only be used to track counts of GRES used. This avoids attempting to load non-existent plugin which can affect filesystems with high latency metadata operations for non-existent files.
このGRESは使用されたGRESの数を追跡するためにのみ使用されるため、プラグインをロードしないでください。これにより、存在しないファイルのメタデータ操作の待ち時間が長いファイルシステムに影響を与える可能性のある、存在しないプラグインのロードを回避できます。

Links

A comma-delimited list of numbers identifying the number of connections between this device and other devices to allow coscheduling of better connected devices. This is an ordered list in which the number of connections this specific device has to device number 0 would be in the first position, the number of connections it has to device number 1 in the second position, etc. A -1 indicates the device itself and a 0 indicates no connection. If specified, then this line can only contain a single GRES device (i.e. can only contain a single file via File).
このデバイスと他のデバイスとの間の接続数を識別する、カンマで区切られた数値のリスト。これにより、より適切に接続されたデバイスの同時スケジュールが可能になります。これは、この特定のデバイスがデバイス番号0に持つ接続の数が最初の位置にあり、2番目の位置にあるデバイス番号1にある必要のある接続の数などの順序付きリストです。-1はデバイス自体を示します0は接続がないことを示します。指定した場合、この行にはGRESデバイスを1つだけ含めることができます（つまり、Fileを介してファイルを1つだけ含めることができます）。

This is an optional value and is usually automatically determined if AutoDetect is enabled. A typical use case would be to identify GPUs having NVLink connectivity. Note that for GPUs, the minor number assigned by the OS and used in the device file (i.e. the X in /dev/nvidiaX) is not necessarily the same as the device number/index. The device number is created by sorting the GPUs by PCI bus ID and then numbering them starting from the smallest bus ID. See https://slurm.schedmd.com/gres.html#GPU_Management
これはオプションの値であり、通常、自動検出が有効になっている場合は自動的に決定されます。典型的な使用例は、NVLink接続を持つGPUを識別することです。GPUの場合、OSによって割り当てられ、デバイスファイルで使用されるマイナー番号（/ dev / nvidiaXのX）は、必ずしもデバイス番号/インデックスと同じではないことに注意してください。デバイス番号は、GPUをPCIバスIDでソートし、最小のバスIDから番号を付けて作成されます。https://slurm.schedmd.com/gres.html#GPU_Managementを参照してください

Name

Name of the generic resource. Any desired name may be used. The name must match a value in GresTypes in slurm.conf. Each generic resource has an optional plugin which can provide resource-specific functionality. Generic resources that currently include an optional plugin are:
総称リソースの名前。任意の名前を使用できます。名前は、slurm.confのGresTypesの値と一致する必要があります。各汎用リソースには、リソース固有の機能を提供できるオプションのプラグインがあります。現在オプションのプラグインを含む一般的なリソースは次のとおりです。

gpu: Graphics Processing Unit
グラフィックスプロセッシングユニット
mps: CUDA Multi-Process Service (MPS)
CUDAマルチプロセスサービス（MPS）
nic: Network Interface Card
ネットワークインターフェースカード
mic: Intel Many Integrated Core (MIC) processor
Intel Many Integrated Core（MIC）プロセッサー

NodeName

An optional NodeName specification can be used to permit one gres.conf file to be used for all compute nodes in a cluster by specifying the node(s) that each line should apply to. The NodeName specification can use a Slurm hostlist specification as shown in the example below.
オプションのNodeName指定を使用すると、各行に適用するノードを指定することにより、1つのgres.confファイルをクラスター内のすべての計算ノードに使用できます。以下の例に示すように、NodeName仕様ではSlurmホストリスト仕様を使用できます。

Type

An optional arbitrary string identifying the type of device. For example, this might be used to identify a specific model of GPU, which users can then specify in a job request. If Type is specified, then Count is limited in size (currently 1024).
デバイスのタイプを識別するオプションの任意の文字列。たとえば、これは、GPUの特定のモデルを識別するために使用でき、ユーザーはジョブリクエストでそれを指定できます。Typeを指定すると、Countのサイズが制限されます（現在は1024）。

EXAMPLES

##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define GPU devices with MPS support
##################################################################
AutoDetect=nvml
Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
Name=gpu Type=tesla File=/dev/nvidia1 COREs=2,3
Name=mps Count=100 File=/dev/nvidia0 COREs=0,1
Name=mps Count=100 File=/dev/nvidia1 COREs=2,3

##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Overwrite system defaults and explicitly configure three GPUs
##################################################################
Name=gpu Type=tesla File=/dev/nvidia[0-1] COREs=0,1
# Name=gpu Type=tesla File=/dev/nvidia[2-3] COREs=2,3
# NOTE: nvidia2 device is out of service
Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3

##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Use a single gres.conf file for all compute nodes - positive method
##################################################################
## Explicitly specify devices on nodes tux0-tux15
# NodeName=tux[0-15] Name=gpu File=/dev/nvidia[0-3]
# NOTE: tux3 nvidia1 device is out of service
NodeName=tux[0-2] Name=gpu File=/dev/nvidia[0-3]
NodeName=tux3 Name=gpu File=/dev/nvidia[0,2-3]
NodeName=tux[4-15] Name=gpu File=/dev/nvidia[0-3]

##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Use NVML to gather GPU configuration information
# Information about all other GRES gathered from slurm.conf
##################################################################
AutoDetect=nvml

COPYING

This file is part of Slurm, a resource management program. For details, see <https://slurm.schedmd.com/>.

Slurm is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.