Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Last update: 20240819

Benchmarking pipeline output

To compare the output of GATK and DeepVariant, we process the data to first look at a subset of known variants for which we have “ground truth” Sanger sequencing confirmatation of which subjectsare carriers and genotype.

Script: 17_get_momic_known_subset.sh

This script is designed to process VCF files by reading a list of variants at specific to genomic positions, known to carry disease-causing variants, using various command-line tools to extract and index the data.

  • Environment Setup and Loading Modules: The script starts by setting up the environment to stop on errors (set -e) and loading necessary modules such as vcftools, bcftools, and vcflib for handling VCF files.

  • Defining Known Variant List: A list of known variant positions is sourced, which is used to filter down and focus the analysis on specific genomic locations that are of interest. Format: No header and two columns for chromosome and position: chr1 11790681.

  • Copying and Indexing VCF Files: The script copies VCF files from a shared directory and indexes them using bcftools index, allowing efficient querying of the VCF data.

  • Extracting Chromosome Numbers and Looping: It extracts unique chromosome identifiers from the position list and processes each chromosome separately. This ensures that the analysis is streamlined and efficiently uses resources.

  • Defining Input Paths for Each Source: Multiple input sources for VCF files are defined, including unfiltered GATK, VQSR processed files, and files processed using BCFTOOL_QC and DeepVariant. These represent different stages or types of quality control and variant calling.

  • Filtering VCF for Known Positions: For each chromosome and each data source, the script uses bcftools view to filter VCF files to only include variants from the known positions list. The results are outputted in both VCF and TSV formats.

  • Indexing and Converting VCF to TSV: The filtered VCF files are re-indexed and converted to TSV format using vcf2tsv, facilitating easier data manipulation in subsequent analyses, typically done in environments like R.

  • Logging and Error Handling: Throughout the script, there are numerous echo statements used to log the process, which is useful for debugging and tracking the progress of the script execution.

Script: 18_momic_known_subset_interpret.R

The R script uses libraries such as dplyr, ggplot2, and patchwork to analyze the filtered data. The main focus of this script is on counting overlaps between datasets and visualizing these overlaps to understand how different variant calling methods compare.

  • Data Loading and Preparation: The script loads data from TSV files, ensuring correct data types and selecting relevant columns. This preparation is crucial for accurate downstream analysis.

  • Combining Data: Data from different sources are combined into a single dataframe, which is then used to identify shared and unique variants.

  • Analyzing Overlaps: The script identifies variants that are shared across different data sources and those that are unique to each source. It summarizes these findings both in text and visually using histograms and Venn diagrams.

  • Visualizing Data: Using ggplot2 and patchwork, the script creates visual summaries of the overlaps and differences between the data sources, providing insights into the consistency and discrepancies of variant calling methods.

Summary

The Bash script efficiently prepares and filters genomic data for analysis, focusing specifically on a subset of known disease-related variants across different sources and chromosome segments. The R script then takes this prepared data to perform a comparative analysis, visualizing overlaps and unique findings in an interpretable manner. This two-step approach allows for rigorous analysis of genomic data while focusing on specific areas of interest, such as disease-related genetic variants.