Slurm Job Manager
Basics of Using the Slurm Job Scheduler
To run computations on the cluster, you must use the Slurm queuing system. This system allows users to request the necessary compute resources and have their jobs automatically start as soon as those resources become available, without having to monitor server load or resource availability. Submitted jobs are executed based on resource availability, with no further user intervention. To submit your job to the queue, run:
sbatch <script name>
Slurm examples
You can find examples in the cluster directory home/INFO
.
General example:
#!/bin/bash
#SBATCH --job-name=python_job # Arbitrary job name
#SBATCH --mem=60G # Total RAM
#SBATCH --ntasks=8 # Number of CPU cores
#SBATCH --time=00:15:00 # Time limit hh:mm:ss
#SBATCH --output=%x.%J.out # Output filename
module load <names of modules used in your job>
<your commands>
#!/bin/bash
#SBATCH --job-name=python_job # Arbitrary job name
#SBATCH --mem=60G # Total RAM
#SBATCH --ntasks=8 # Number of CPU cores
#SBATCH --time=00:15:00 # Time limit hh:mm:ss
#SBATCH --output=%x.%J.out # Output filename
guix install <names of packages used in your job>
<your commands>
After submitting the job, an output file named as specified by --output will appear.
This log file contains both Slurm system messages and everything your commands would normally print to the console (standard output).
In the example above, the output filename uses the sbatch variables %x
(job name) and %J
(Slurm-assigned job ID).
See other sbatch filename-pattern variables here.
It is particularly important to specify the number of CPU cores, amount of RAM, and expected runtime in your script. If you omit these parameters, the defaults are: 1 CPU core, 1 GB RAM, 24 h runtime in the regular and weakold queues, and 2 h in the debug queue.
During execution, Slurm will enforce the resource limits you requested. If a job tries to use more RAM than allocated, it will not be killed but will be throttled to the requested amount, and you will see warnings in the log file. If a job exceeds its time limit, it will be terminated.
You can get general information about cluster load with the command sinfo (for detailed per-node info: sinfo -Nl). To view all queued jobs: squeue. To view your queued jobs list with their job IDs and statuses, run:
squeue -u <your-username>
To cancel or remove your jobs from the waiting queue, use the command:
scancel <job ID>
It is also possible to launch an interactive job (i.e., use the command line for compiling or testing computations using compute node resources). The example below will allow you to work interactively with 1 CPU core on the debug partition:
srun --partition=debug --pty bash
The srun command can be extended with additional parameters to request more resources. The example below shows how to request 4 CPU cores, 12 GB of RAM, and X11 forwarding for graphical display:
srun --cpus-per-task=4 --mem=12G --x11 --pty bash
Frequently Used Commands
Command | Description |
---|---|
sbatch <script name> |
Submit a job to the queue |
sinfo |
Get information about cluster load |
sinfo -Nl |
Get detailed information for each node |
squeue |
List all queued jobs |
squeue -u <your-username> |
List jobs of a specific user |
scancel <job ID> |
Cancel or remove a job from the queue |
srun |
Launch an interactive job with specified parameters |
tail -f <filename> |
Follow the output of a running job |