Last update: 20240819
Benchmarking pipeline output
To compare the output of GATK
and DeepVariant
, we process the data to first look at a subset of known variants for which we have “ground truth” Sanger sequencing confirmatation of which subjectsare carriers and genotype.
Script: 17_get_momic_known_subset.sh
This script is designed to process VCF files by reading a list of variants at specific to genomic positions, known to carry disease-causing variants, using various command-line tools to extract and index the data.
Environment Setup and Loading Modules: The script starts by setting up the environment to stop on errors (
set -e
) and loading necessary modules such asvcftools
,bcftools
, andvcflib
for handling VCF files.Defining Known Variant List: A list of known variant positions is sourced, which is used to filter down and focus the analysis on specific genomic locations that are of interest. Format: No header and two columns for chromosome and position:
chr1 11790681
.Copying and Indexing VCF Files: The script copies VCF files from a shared directory and indexes them using
bcftools index
, allowing efficient querying of the VCF data.Extracting Chromosome Numbers and Looping: It extracts unique chromosome identifiers from the position list and processes each chromosome separately. This ensures that the analysis is streamlined and efficiently uses resources.
Defining Input Paths for Each Source: Multiple input sources for VCF files are defined, including unfiltered
GATK
,VQSR
processed files, and files processed usingBCFTOOL_QC
andDeepVariant
. These represent different stages or types of quality control and variant calling.Filtering VCF for Known Positions: For each chromosome and each data source, the script uses
bcftools view
to filter VCF files to only include variants from the known positions list. The results are outputted in both VCF and TSV formats.Indexing and Converting VCF to TSV: The filtered VCF files are re-indexed and converted to TSV format using
vcf2tsv
, facilitating easier data manipulation in subsequent analyses, typically done in environments like R.Logging and Error Handling: Throughout the script, there are numerous
echo
statements used to log the process, which is useful for debugging and tracking the progress of the script execution.
Script: 18_momic_known_subset_interpret.R
The R script uses libraries such as dplyr
, ggplot2
, and patchwork
to analyze the filtered data. The main focus of this script is on counting overlaps between datasets and visualizing these overlaps to understand how different variant calling methods compare.
Data Loading and Preparation: The script loads data from TSV files, ensuring correct data types and selecting relevant columns. This preparation is crucial for accurate downstream analysis.
Combining Data: Data from different sources are combined into a single dataframe, which is then used to identify shared and unique variants.
Analyzing Overlaps: The script identifies variants that are shared across different data sources and those that are unique to each source. It summarizes these findings both in text and visually using histograms and Venn diagrams.
Visualizing Data: Using
ggplot2
andpatchwork
, the script creates visual summaries of the overlaps and differences between the data sources, providing insights into the consistency and discrepancies of variant calling methods.
Summary
The Bash script efficiently prepares and filters genomic data for analysis, focusing specifically on a subset of known disease-related variants across different sources and chromosome segments. The R script then takes this prepared data to perform a comparative analysis, visualizing overlaps and unique findings in an interpretable manner. This two-step approach allows for rigorous analysis of genomic data while focusing on specific areas of interest, such as disease-related genetic variants.