BAM to fastq
Last update: 20240227
See reference: https://www.metagenomics.wiki/tools/samtools/converting-bam-to-fastq
We have WGS data aligned to an old reference genome which required updated analysis. We must convert BAM to FASTQ so that we can re-align. For subsequent re-alignement we will use BWA. The Samtools method is the most reasonable first approach bacuase it was written by the author of BWA, Heng Li, with modifications by Martin Pollard and Jennifer Liddle, all from the Sanger Institute. The other alternative methods are listed below for comparison.
SAMtools
- sort paired read alignment .bam file (sort by name -n)
samtools sort -n SAMPLE.bam -o SAMPLE_sorted.bam
- save fastq reads in separate R1 and R2 files
samtools fastq -@ 8 SAMPLE_sorted.bam \
-1 SAMPLE_R1.fastq.gz \
-2 SAMPLE_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
http://www.htslib.org/doc/samtools-fasta.html
On our system:
save fastq reads in separate R1 and R2 files
- Example input: 78G BAM file aligned to GRCh37
- Example output: Size R1 and R2 FASTQ
- Time: 1hr 30min
- Expected size: 70 GB per subject (total FASTQ)
- Memory use: 1GB
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 24G
#SBATCH --time 12:00:00
# Result indicates that reads are not paired and we require singletons
bam2fastq_1_40002.err
[M::bam2fq_mainloop] discarded 903778738 singletons
[M::bam2fq_mainloop] processed 909146794 reads
Collate interleaved
echo "samsort collate shuffles and groups reads together by their names."
samtools sort -n ${TEMP_DIR}/${file1} -o ${TEMP_DIR}/${file1}_collate.bam
echo "samsort -n complete"
echo "bam2fastq for interleaved file start"
samtools fastq -@ 8 \
-0 /dev/null \
${TEMP_DIR}/${file1}_collate.bam \
> ${TEMP_DIR}/${file1}_all_reads.fq
Paried files
echo "samsort -n start"
samtools sort -n ${TEMP_DIR}/${file1} -o ${TEMP_DIR}/${file1}_sortn.bam
echo "samsort -n complete"
echo "bam2fastq for paired files start"
samtools fastq -@ 8 ${TEMP_DIR}/${file1}_sortn.bam \
-1 ${TEMP_DIR}/${file1}_R1.fastq.gz \
-2 ${TEMP_DIR}/${file1}_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
- Using bam2fq
samtools bam2fq SAMPLE.bam > SAMPLE.fastq
paired-end reads: ‘/1’ or ‘/2’ is added to the end of read names
http://www.htslib.org/doc/samtools.html
- How to split a single .fastq file of paired-end reads into two separated files?
# extracting reads ending with '/1' or '/2' cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_R1.fastq cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_R2.fastq
Picard
converting a SAMPLE.bam file into paired end SAMPLE_R1.fastq and SAMPLE_R2.fastq files
- F2 to get two files for paired-end reads (R1 and R2)
- -Xmx2g allows a maximum use of 2GB memory for the JVM
java -Xmx2g -jar Picard/SamToFastq.jar I=SAMPLE.bam F=SAMPLE_R1.fastq F2=SAMPLE_R2.fastq
http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq
bam2fastx
http://manpages.ubuntu.com/manpages/quantal/man1/bam2fastx.1.html
Bedtools
bedtools bamtofastq -i input.bam -fq output.fastq
paired-end reads:
samtools sort -n input.bam -o input_sorted.bam # sort reads by identifier-name (-n)
bedtools bamtofastq -i input_sorted.bam -fq output_R3.fastq -fq2 output_R2.fastq
http://bedtools.readthedocs.org/en/latest/content/tools/bamtofastq.html
Bamtools
http://github.com/pezmaster31/bamtools