Skip to main content Link Menu Expand (external link) Document Search Copy Copied

BAM to fastq

Last update: 20240227

See reference: https://www.metagenomics.wiki/tools/samtools/converting-bam-to-fastq

We have WGS data aligned to an old reference genome which required updated analysis. We must convert BAM to FASTQ so that we can re-align. For subsequent re-alignement we will use BWA. The Samtools method is the most reasonable first approach bacuase it was written by the author of BWA, Heng Li, with modifications by Martin Pollard and Jennifer Liddle, all from the Sanger Institute. The other alternative methods are listed below for comparison.

SAMtools

  • sort paired read alignment .bam file (sort by name -n)
samtools sort -n SAMPLE.bam -o SAMPLE_sorted.bam
  • save fastq reads in separate R1 and R2 files
samtools fastq -@ 8 SAMPLE_sorted.bam \
    -1 SAMPLE_R1.fastq.gz \
    -2 SAMPLE_R2.fastq.gz \
    -0 /dev/null -s /dev/null -n

http://www.htslib.org/doc/samtools-fasta.html

On our system:

  • save fastq reads in separate R1 and R2 files
  • Example input: 78G BAM file aligned to GRCh37
  • Example output: Size R1 and R2 FASTQ
  • Time: 1hr 30min
  • Expected size: 70 GB per subject (total FASTQ)
  • Memory use: 1GB
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 24G
#SBATCH --time 12:00:00
# Result indicates that reads are not paired and we require singletons
bam2fastq_1_40002.err
[M::bam2fq_mainloop] discarded 903778738 singletons
[M::bam2fq_mainloop] processed 909146794 reads

Collate interleaved

echo "samsort collate shuffles and groups reads together by their names."
samtools sort -n ${TEMP_DIR}/${file1} -o ${TEMP_DIR}/${file1}_collate.bam
echo "samsort -n complete"

echo "bam2fastq for interleaved file start"
samtools fastq -@ 8 \
-0 /dev/null \
${TEMP_DIR}/${file1}_collate.bam \
> ${TEMP_DIR}/${file1}_all_reads.fq

Paried files

echo "samsort -n start"
samtools sort -n ${TEMP_DIR}/${file1} -o ${TEMP_DIR}/${file1}_sortn.bam
echo "samsort -n complete"

echo "bam2fastq for paired files start"
samtools fastq -@ 8 ${TEMP_DIR}/${file1}_sortn.bam \
-1 ${TEMP_DIR}/${file1}_R1.fastq.gz \
-2 ${TEMP_DIR}/${file1}_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
  • Using bam2fq
    samtools bam2fq SAMPLE.bam > SAMPLE.fastq
    

    paired-end reads: ‘/1’ or ‘/2’ is added to the end of read names

http://www.htslib.org/doc/samtools.html

  • How to split a single .fastq file of paired-end reads into two separated files?
    # extracting reads ending with '/1' or '/2'
    cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_R1.fastq
    cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_R2.fastq
    

Picard

  • converting a SAMPLE.bam file into paired end SAMPLE_R1.fastq and SAMPLE_R2.fastq files

    • F2 to get two files for paired-end reads (R1 and R2)
    • -Xmx2g allows a maximum use of 2GB memory for the JVM
      java -Xmx2g -jar Picard/SamToFastq.jar I=SAMPLE.bam F=SAMPLE_R1.fastq F2=SAMPLE_R2.fastq
      

http://broadinstitute.github.io/picard/command-line-overview.html#SamToFastq

bam2fastx

http://manpages.ubuntu.com/manpages/quantal/man1/bam2fastx.1.html

Bedtools

bedtools bamtofastq -i input.bam -fq output.fastq

paired-end reads:

samtools sort -n input.bam -o input_sorted.bam   # sort reads by identifier-name (-n)
bedtools bamtofastq -i input_sorted.bam -fq output_R3.fastq -fq2 output_R2.fastq

http://bedtools.readthedocs.org/en/latest/content/tools/bamtofastq.html

Bamtools

http://github.com/pezmaster31/bamtools