Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Last update: 20241217

Design DNA SNV INDEL v1

Germline short variant discovery (SNVs + Indels) and interpretation



Protocol name: design_dna_snvindel_v1

Aims

PhaseAimStatusTask
Phase 2(1)CompleteProcess all WGS from the study cohort
https://www.swisspedhealth.ch
to a consensus format.
Phase 2(2)2 of 2 completePrepare qualifying variant (QV) sets
for each downstream aim.
Phase 2(3)v1 complete
v2 in progress
Clinical genetics report per individual
(i.e., baseline benchmark of
known disease-causing).
Phase 2(4)2 experiment completeGWAS: Statistical genomics to find new
cohort-level associations with
disease.
Phase 2(5)2 experiment completeGene-VSAT: Statistical genomics to find new
cohort-level associations with
disease.
Phase 2(6)1 experiment completeProteome-VSAT: Statistical genomics to find new
cohort-level associations with
disease (Proteom-VSAT).
Phase 2(7)1 experiment completeACAT: Statistical multiomics to find new
cohort-level associations with
disease..
Phase 2(8)In progressNew methods (ML/DL, causal inference)
for individual and cohort-level
discovery.

Figure 1: Summary of design DNA SNV INDEL v1 pipeline plan.

Introduction

This protocol is designed to process DNA WGS data in FASTQ format into qualifying qariants (QV) based on consensus variables and thresholds (figure 1). The QV can then be used in multiple applications such as ML/DL to find disease-related variants or gene functions. Additionally, in the clinical genetic protocol further standardised filtering criteria are used to reach a single genetic determinant in a clinical genetics report for each subject. The design name Design DNA SNV INDEL v1 indicates that this protocol is tailored to single nucleotide variants (SNVs) and short insertion/deletions (INDELs) (e.g. GATK pipeline). We implement the genome analysis tool kit GATK best practices workflow for germline short variant discovery (open source licence here). This GATK workflow is designed to operate on a set of samples constituting a study cohort; specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing. Single-variant and genomics-only analysis will be followed up to confirm if causal effects are identified in RNA and protein layers. Joint-multiomic analysis will include all layers in a single statistical model.

Protocol summary

  1. Process all raw WGS into an analysis-ready format - geonmic VCF (gVCF).
    • The first goal is to process all raw whole genome sequencing (WGS) data into analysis-ready formats, specifically into joint cohort Variant Call Format (VCF) using the emit-ref-confidence (ERC) gVCF mode. This involves using a reference model to emit data with condensed non-variant blocks, adhering to the gVCF format. gVCF is split per chromosome.
  2. The joint cohort chromosome level gVCF are filtered into qualifying variants (QV).
  3. The QV sets are used individually or mixed to produce the main analysis results:
    • QV set 1 for clinical genetics (known disease-causing) report for each individual
    • QV set 2 for statistical genomics (new associations with established methods) for cohort level discovery
    • QV set 1 or 2 for other methods (ML/DL, causal inference) (new methods) for individual and cohort level discovery
  4. Release data.
  5. If not already included in an analysis model, the candidate causal variants will be followed up to confirm if causal effects are identified in RNA and protein layers.

Protocol steps

The major processing steps in sequential order are:

Figure 2: Extended methods of figure 1 DNA germline short variant discovery pipeline plan.

Metrics

Study book data:

  1. CollectWgsMetrics: 03b_collectwgsmetrics.sh -> study_book/qc_summary_stats mapping, depth, and more. See metrics_collectwgsmetrics.
  2. bcftools stats and plot-vcfstats: 07c_qc_summary_stats.sh -> study_book/qc_summary_stats gVCF summary after HC. See metrics_bcftoolsstats.

Data release

The private internal data release: design_dna_snvindel_v1_release