SLURM sbatch headers
Last update: 20240820
Here we outline the usage of SLURM sbatch headers to efficiently manage jobs that process genomic data across multiple samples, chromosomes, or genomic regions.
Some nodes show 32G memory on their partition but will not run jobs that have more than #SBATCH –mem 28G. Keep this in mind for other types of overheadthat might prevent a job from launching.
Key SLURM sbatch directives
Here are examples of SLURM sbatch headers used in our scripts:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --mem 30G
#SBATCH --time 96:00:00
#SBATCH --job-name=genomic_analysis
#SBATCH --output=/path/to/log/%x_%A_%a_%J.out
#SBATCH --error=/path/to/log/%x_%A_%a_%J.err
#SBATCH --partition=all-nodes-cpu
#SBATCH --array=1-1215
Explanation of directives
--nodes
: Number of nodes required.--ntasks
: Number of tasks.--cpus-per-task
: Number of CPUs per task.--mem
: Memory required.--time
: Time limit.--job-name
: Name of the job.--output
and--error
: Output and error file paths.--partition
: Specifies the partition.--array
: Job array settings to split tasks.
Common output and error placeholders:
%x
: Job name specified by--job-name
.%A
: Job ID for the array job.%a
: Array index of the specific task within a job array.%J
: Job ID with an optional array task ID, formatted asjobID_arrayID
(or justjobID
for non-array jobs).
Additional placeholders:
%j
: The job ID, used when no array is involved.%N
: Short hostname of the first compute node where the job runs.%n
: Node index relative to the job.%u
: Username of the job owner.
Examples:
- Track job details:
--output=/path/to/log/%x_%A_%a_%J.out
- Log by node:
--output=/path/to/log/%x_%N.out
- User-specific logs:
--output=/path/to/log/%u_%x_%j.out
Best practices:
- Organize Logs: Use structured directories and naming conventions.
- Use Unique Identifiers: Include
%J
or%A_%a
to prevent overwrites. - Maintain Privacy: Be cautious about sensitive information in filenames.
Using SLURM_ARRAY_TASK_ID
The SLURM_ARRAY_TASK_ID
is used to assign specific tasks within a job array. Each task can be used to process a specific file or a part of the dataset.
Example: Genomic analysis per sample
This example shows how to use SLURM_ARRAY_TASK_ID
to process individual genomic samples:
#!/bin/bash
#SBATCH --array=0-99 # Adjust based on the number of samples
# Assuming BAM_FILES is an array containing paths to BAM files
BAM_FILES=("/path/to/sample1.bam" "/path/to/sample2.bam" ...)
SMOOVE="singularity exec ${DATA}/smoove_latest.sif smoove"
EXCLUDE_BED="/path/to/exclude.bed"
REF_NONZIP="/path/to/reference.fasta"
OUTDIR="/path/to/output"
SAMPLE_ID=$(basename ${BAM_FILES[$SLURM_ARRAY_TASK_ID]} .bam)
$SMOOVE call --outdir $OUTDIR \
--exclude $EXCLUDE_BED \
--name $SAMPLE_ID \
--fasta $REF_NONZIP \
-p 1 \
--genotype ${BAM_FILES[$SLURM_ARRAY_TASK_ID]}
Example: Processing genomic data by chromosome
This example demonstrates setting up an array to process data by chromosome:
#!/bin/bash
#SBATCH --array=0-24 # For chromosomes 1..22, X, Y, M
declare -a CHROMOSOMES=('1' '2' '3' ... '22' 'X' 'Y' 'M')
CHROM=${CHROMOSOMES[$SLURM_ARRAY_TASK_ID]}
INPUT_DIR="/path/to/vcfs"
OUTPUT_DIR="/path/to/output"
VCF_FILE="${INPUT_DIR}/chr${CHROM}_data.vcf.gz"
OUTPUT_VCF="${OUTPUT_DIR}/chr${CHROM}_processed.vcf.gz"
echo "Processing chromosome: ${CHROM}"
echo "Input: ${VCF_FILE}"
# Check if the input file exists
if [[ ! -f "$VCF_FILE" ]]; then
echo "Input file for chromosome ${CHROM} does not exist: $VCF_FILE"
echo "Skipping processing for this job."
exit 0
fi
# Run processing commands
bcftools filter -i 'QUAL>=30 & INFO/DP>=20' -Oz -o ${OUTPUT_VCF} ${VCF_FILE}
Conclusion
Utilizing SLURM sbatch headers and the SLURM_ARRAY_TASK_ID
variable efficiently parallelizes tasks across a cluster, enhancing throughput for large-scale genomic analyses.