Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Data concepts

Last update: 20230619

Table of contents

How data is collected and stored

Database concepts

The DCC is responsible for making research data suitable for database management. To do this they design the concepts. Training material available here, but is a DCC responsibility.

Several concepts for the types of data that we use alredy exist. e.g. genetic data concepts.

We also assist in generating new omics-related concepts for future expansion. This page will be updated to summarise the progress as these new concepts are defined.

Alt text

Availability

The following information is copied from https://sphn.ch/2023/03/20/sphn-dataset-rdf-schema-2023-release/

Specific genetic examples

Potential concepts





Examples

Genomics England (GE)

GE represents one of the best national platforms so far and they have complete analysis of ~200’000 genomes for clinical use.

GE model documentation shows the concepts in use: http://gelreportmodels.genomicsengland.co.uk/models.html#. Concept development could start with the GE 1.3.0-SNAPSHOT. For example, VariantMetadata is a good starting place since it includes “individual” (i.e. subject), “sample type”, “experiment”, etc., which are required in most genomic data scenarios. We could work through a prioritised list of concepts over several months-years for any new concept that matches user requirements. They have derived some logical structures derived from Ontobee, OpenCB, and probably other common sources.

Example: metadata

Here are some examples which most users will require, from VariantMetadata:

|--VariantMetadata:
   |-- Cohort
      |-- Experiment
         |--- center, date, molecule, technique, library, libraryLayout, platform, description
   |-- Individual
       |-- id, family, father, mother, sex, phenotype, samples
   |-- Program
       |-- ...
   |-- Sample
       |-- ...
   |-- SampleSetType
	    |-- CASE_CONTROL, CASE_SET, CONTROL_SET, PAIRED, TIME_SERIES, FAMILY, TRIO, MISCELLANEOUS, UNKNOWN
   |--Species
       |-- ...

Example: variant Some of the most important concepts for our genomics needs are listed on the variant procol page, which come from org.opencb.biodata.models.variant.avro. These are beyond our current needs but this list of 55 entries likely cover most of the conceivable needs:

AdditionalAttribute, AlleleOrigin Enum, AllelesCode Enum, AlternateCoordinate, ClinVar, ClinicalSignificance Enum, Confidence Enum, ConsequenceType, ConsistencyStatus Enum, Cosmic, Cytoband, Drug, DrugResponseClassification Enum, EthnicCategory Enum, EvidenceEntry, EvidenceImpact Enum, EvidenceSource, EvidenceSubmission, ExonOverlap, Expression, ExpressionCall Enum, FeatureTypes Enum, FileEntry, GeneDrugInteraction, GeneTraitAssociation, GenomicFeature, Genotype, Gwas, HeritableTrait, ModeOfInheritance Enum, Penetrance Enum, PopulationFrequency, Property, ProteinFeature, ProteinVariantAnnotation, Repeat, Score, SequenceOntologyTerm, SomaticInformation, StructuralVariantType Enum, StructuralVariation, StudyEntry, TraitAssociation Enum, TumorigenesisClassification Enum, VariantAnnotation, VariantAvro, VariantClassification, VariantFunctionalEffect Enum, VariantHardyWeinbergStats, VariantStats, VariantTraitAssociation, VariantType Enum, Xref,

Let’s pick one example from that list which would come under the heading “variant interpretation” - ClinicalSignificance:

Example: variant interpretation - ClinicalSignificance

We use this exact example for our clinical genetics work: ClinicalSignificance.

Mendelian variants classification with ACMG terminology as defined in Richards, S. et al. (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine, 17(5), 405?423. https://doi.org/10.1038/gim.2015.30.

Classification for variants associated with disease, etc., based on the ACMG recommendations and ClinVar classification (https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/).

  • benign_variant : Benign variants interpreted for Mendelian disorders
  • likely_benign_variant : Likely benign variants interpreted for Mendelian disorders with a certainty of at least 90%
  • pathogenic_variant : Pathogenic variants interpreted for Mendelian disorders
  • likely_pathogenic_variant : Likely pathogenic variants interpreted for Mendelian disorders with a certainty of at least 90%
  • uncertain_significance : Uncertain significance variants interpreted for Mendelian disorders. Variants with conflicting evidences should be classified as uncertain_significance
  • Enum symbols:
    • benign, likely_benign, VUS, likely_pathogenic, pathogenic, uncertain_significance

However, we do not restrict our use of ACMG standard for variant interpretation using only ClinicalSignificance, since there are a large number of other variant interpretation datasets which can be used to make the final determination. One method we use is the ACMG scoring sytem to score all variants based on the ACMG evidence categorisation method. Therefore, ACMG_score could be derived from ClinicalSignificance, but also from other sources. Perhaps EvidenceEntry is a major entry which includes most of these subtypes for interpretation evidence.

OpenCB

Ontobee

|-- sequence_attribute
   |-- variant_quality
   |-- variant_origin
      |-- maternal_variant
      |-- paternal_variant
      |-- somatic_variant
      |-- pedigree_specific_variant
      |-- population_specific_variant
      |-- de_novo_variant
      |-- germline_variant
         |-- RDF sourcode