Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Last update: 20240920

Table of contents

Variant features to RDF concept metadata

This process is starting with sequencing_assay, which includes library_preparation, sequencing_run, etc. We will continue through the pipeline until we reach the final end-point required to report a pathogenic variant.

This documentation outlines the transformation of variant information from whole genome sequence (WGS) data to a format adhering to RDF structure data concepts. The aim is to ensure that the omic output from genomic analyses can be seamlessly integrated into clinical data warehouses with high fidelity and clarity.

Number of variables:

  • All SPHN RDF concept info (see SPHN_dataset_release_2024_2_20240502.xlsx) = 1503
  • Subset of relevant concepts = 76 (see example_subset_concepts.tsv)
  • Relevant WGS pipeline logs = 62 (see example_report.tsv)
  • Currently automated match = 13 (see example_report_concepts.Rds)

    This repository uses a public dataset of example genetic variants and sequencing/analysis log data.

Overview

The process begins with the extraction of variant data from a genomic study, (no sensitive data is included in the public example set). The key variant features such as Chromosome (CHROM), Position (POS), Reference Allele (REF), and Alternate Allele (ALT) are formatted alongside metadata that describes their relationship to RDF concepts. This ensures downstream users can map these data accurately within clinical and research frameworks.

This document is to be updated as we improve the linking of result terms to SPHN_dataset_release_2024_2_20240502.xlsx which is critical so that downstream users can correctly map data.

Aims

  1. Data preparation: Start with the extracted variant information from the genomic pipeline.
  2. Key term identification: Focus on essential genomic terms like CHROM, POS, REF, and ALT, Sequencing run, Sequencing instrument.
  3. Metadata addition: Attach metadata columns that specify RDF concept requirements such as type and cardinality.
  4. Validation checklist:
    • Do we have all necessary variant descriptors present?
    • Is there inclusion and accuracy of all metadata explanations?
    • Is there alignment of metadata with SPHN omic concepts?
    • Downstream users (mapping) can choose from TSV, HTML, JSON, and Rds. Any others needed?

Current version

The observation column is highlighted in GREEN. It contains the data which we report as output from the pipeline for use in our database Here is the completed concept observations (this is file example_report_concepts.html):

cardinalityViolated concept_reference_general_concept_name concept_reference general_concept_name observation release unique_ID IRI active_status_(yes/no) deprecated_in replaced_by concept_or_concept_compositions_or_inherited general_description contextualized_concept_name contextualized_description parent type excluded_type_descendants standard value_set_or_subset meaning_binding additional_information cardinality_for_composedOf cardinality_for_concept_to_Administrative_Case cardinality_for_concept_to_Data_Provider cardinality_for_concept_to_Subject_Pseudo_Identifier cardinality_for_concept_to_Source_System sensitive_(yes/no) color_inherited color_reference color_observation color_cardinality
FALSE sequencing_assay_sequencing_assay sequencing_assay sequencing_assay NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#sequencingassay yes NA NA concept an_assay_that_exploits_a_sequencer_as_the_instrument_to_generate_results sequencing_assay an_assay_that_exploits_a_sequencer_as_the_instrument_to_generate_results assay assay NA NA NA efo:0003740_|assay_by_sequencer| NA NA 0:n 1:1 0:n 1:n NA #7CCAFF #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_standard_operating_procedure sequencing_assay standard_operating_procedure wgs_with_illumina_novaseq_6000 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstandardoperatingprocedure yes NA NA inherited standard_operating_procedure_associated_to_the_concept standard_operating_procedure standard_operating_procedure_that_was_followed_for_this_sequencing_assay sphnattributeobject standard_operating_procedure NA NA NA NA NA 0:1 NA NA NA NA NA #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_predecessor sequencing_assay predecessor kispi_custom_sample_prep_v1 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haspredecessor yes NA NA inherited process_preceding_this_concept predecessor sample_processing_preceding_the_sequencing_assay sphnattributeobject sample_processing NA NA NA NA NA 0:n NA NA NA NA NA #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_code sequencing_assay code efo_0022396 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hascode yes NA NA inherited coded_information_specifying_the_concept code code_specifying_the_type_of_sequencing_assay sphnattributeobject code NA efo;_obi_or_other for_efo:_descendant_of:_efo:0001455_|assay|;_for_obi:_descendant_of:_obi:0000070_|assay| NA NA 1:1 NA NA NA NA NA #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_identifier sequencing_assay identifier obo:obi_002117_(wgs) 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasidentifier yes NA NA inherited unique_identifier_identifying_the_concept identifier unique_identifier_identifying_the_sequencing_assay sphnattributedatatype string NA NA NA NA NA 0:1 NA NA NA NA NA #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_start_datetime sequencing_assay start_datetime jul 01 2023 01:01:01 gmt / v0.9.0 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstartdatetime yes NA NA inherited datetime_at_which_the_concept_started start_datetime datetime_at_which_the_sequencing_assay_was_first_executed hasdatetime temporal NA NA NA NA NA 0:1 NA NA NA NA yes #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_data_file sequencing_assay data_file out.fastq 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasdatafile yes NA NA inherited data_file_associated_to_the_concept data_file data_file_associated_to_the_sequencing_assay sphnattributeobject data_file time_series_data_file NA NA NA NA 0:n NA NA NA NA NA #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_sample sequencing_assay sample blood_sample_1 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hassample yes NA NA inherited sample_associated_to_the_concept sample material_that_is_being_sequenced_by_this_sequencing_assay sphnattributeobject sample tumor_specimen;_isolate NA NA NA NA 0:n NA NA NA NA NA #abb1cf #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_library_preparation sequencing_assay library_preparation illumina_truseq_dna_pcr-free 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haslibrarypreparation yes NA NA composedof library_preparation_associated_to_the_concept library_preparation the_library_preparation_that_is_part_of_the_sequencing_assay sphnattributeobject library_preparation NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_sequencing_instrument sequencing_assay sequencing_instrument a00485 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hassequencinginstrument yes NA NA composedof device_associated_to_the_concept sequencing_instrument the_device_which_is_used_to_perform_the_sequencing_assay sphnattributeobject sequencing_instrument NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_sequencing_run sequencing_assay sequencing_run 334 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hassequencingrun yes NA NA composedof sequencing_run_associated_to_the_concept sequencing_run sequencing_run_performed_as_part_of_the_sequencing_assay sphnattributeobject sequencing_run NA NA NA NA NA 0:n NA NA NA NA NA #92a8d1 #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_intended_read_length sequencing_assay intended_read_length 150 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasintendedreadlength yes NA NA composedof intended_read_length_associated_to_the_concept intended_read_length the_number_of_nucleotides_intended_to_be_ordered_from_each_side_of_a_nucleic_acid_fragment_obtained_after_the_completion_of_a_sequencing_assay hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7cac9 #8ed3a0 #8ed3a0
FALSE sequencing_assay_intended_read_depth sequencing_assay intended_read_depth 30x 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasintendedreaddepth yes NA NA composedof intended_read_depth_associated_to_the_concept intended_read_depth the_number_of_times_a_particular_locus_(site,_nucleotide,_amplicon,_region)_was_intended_to_be_sequenced_as_part_of_the_sequencing_assay hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7cac9 #8ed3a0 #8ed3a0
FALSE library_preparation_library_preparation library_preparation library_preparation NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#librarypreparation yes NA NA concept process_which_results_in_the_creation_of_a_library_from_fragments_of_dna library_preparation process_which_results_in_the_creation_of_a_library_from_fragments_of_dna sampleprocessing sample_processing NA NA NA obi:0000711_|library_preparation| NA NA 0:n 1:1 0:n 1:n NA #7CCAFF #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_code library_preparation code NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hascode yes NA NA inherited coded_information_specifying_the_concept code code_specifying_the_type_of_library_preparation sphnattributeobject code NA obi;_efo_or_other for_obi:_descendant_of:_obi:0000711_|library_preparation| NA NA 0:1 NA NA NA NA NA #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_input library_preparation input NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasinput yes NA NA inherited input_associated_to_the_concept input the_sample_for_which_a_library_is_created sphnattributeobject sample NA NA NA NA NA 0:n NA NA NA NA NA #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_output library_preparation output NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasoutput yes NA NA inherited output_associated_to_the_concept output the_ngs_library_that_is_produced sphnattributeobject sample tumor_specimen NA NA NA NA 0:1 NA NA NA NA NA #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_start_datetime library_preparation start_datetime NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstartdatetime yes NA NA inherited datetime_at_which_the_concept_started start_datetime start_of_library_preparation hasdatetime temporal NA NA NA NA NA 0:1 NA NA NA NA yes #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_quality_control_metric library_preparation quality_control_metric NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasqualitycontrolmetric yes NA NA inherited quality_control_metric_associated_to_the_concept quality_control_metric quality_control_metric_related_to_the_output_of_the_library_preparation sphnattributeobject quality_control_metric NA NA NA NA NA 0:n NA NA NA NA NA #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_predecessor library_preparation predecessor NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haspredecessor yes NA NA inherited process_preceding_this_concept predecessor process_preceding_this_library_preparation sphnattributeobject sample_processing NA NA NA NA NA 0:n NA NA NA NA NA #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_standard_operating_procedure library_preparation standard_operating_procedure NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstandardoperatingprocedure yes NA NA inherited standard_operating_procedure_associated_to_the_concept standard_operating_procedure standard_operating_procedure_that_was_followed_for_this_library_preparation sphnattributeobject standard_operating_procedure NA NA NA NA NA 0:1 NA NA NA NA NA #abb1cf #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_kit_code library_preparation kit_code NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haskitcode yes NA NA composedof coded_information_specifying_the_kit_associated_to_the_concept library_preparation_kit_code pre-filled,_ready-to-use_reagent_cartridges_intended_to_improve_chemistry,_cluster_density_and_read_length_as_well_as_improve_quality_(q)_scores_for_this_sample._reagent_components_are_encoded_to_interact_with_the_sequencing_system_to_validate_compatibility_with_user-defined_applications. hascode code NA efo,_genepio,_fairgenomes_or_other NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_target_enrichment_kit_code library_preparation target_enrichment_kit_code NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hastargetenrichmentkitcode yes NA NA composedof coded_information_specifying_the_target_enrichment_kit_associated_to_the_concept target_enrichment_kit_code indicates_which_target_enrichment_kit_was_used_to_prepare_this_sample._target_enrichment_is_a_pre-sequencing_dna_preparation_step_where_dna_sequences_are_either_directly_amplified_(amplicon_or_multiplex_pcr-based)_or_captured_(hybrid_capture-based)_in_order_to_only_focus_on_specific_regions_of_a_genome_or_dna_sample. hascode code NA efo,_genepio,_fairgenomes_or_other NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_intended_insert_size library_preparation intended_insert_size NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasintendedinsertsize yes NA NA composedof intended_insert_size_associated_to_the_concept intended_insert_size in_paired-end_sequencing,_the_dna_between_the_adapter_sequences_is_the_insert._the_length_of_this_sequence_is_known_as_the_insert_size,_not_to_be_confused_with_the_inner_distance_between_reads._so,_fragment_length_equals_read_adapter_length_(2x)_plus_insert_size,_and_insert_size_equals_read_length_(2x)_plus_inner_distance. hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7f6c9 #8ed3a0 #8ed3a0
FALSE library_preparation_gene_panel library_preparation gene_panel NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasgenepanel yes NA NA composedof gene_panel_associated_to_the_concept gene_panel collection_of_genes_that_are_the_focus_of_sequencing sphnattributeobject gene_panel NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7f6c9 #8ed3a0 #8ed3a0
FALSE sequencing_instrument_sequencing_instrument sequencing_instrument sequencing_instrument a00485 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#sequencinginstrument yes NA NA concept a_sequencing_instrument_that_is_used_in_a_sequencing_assay sequencing_instrument a_sequencing_instrument_that_is_used_in_a_sequencing_assay sphnconcept NA NA NA NA efo:0000548_|instrument| NA NA NA 1:1 NA NA NA #7CCAFF #f7e7c9 #8ed3a0 #8ed3a0
FALSE sequencing_instrument_code sequencing_instrument code a00485 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hascode yes NA NA composedof coded_information_specifying_the_concept code code_specifying_the_type_of_sequencing_instrument sphnattributeobject code NA obi;_efo_or_other for_obi:_descendant_of:_obi:0400103_|dna_sequencer|;_for_efo:_descendant_of:_efo:0003739_|sequencer| NA NA 1:1 NA NA NA NA NA #92a8d1 #f7e7c9 #8ed3a0 #8ed3a0
FALSE sequencing_run_sequencing_run sequencing_run sequencing_run NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#sequencingrun yes NA NA concept the_valid_and_completed_operation_of_a_high-throughput_sequencing_instrument_associated_with_a_sequencing_assay sequencing_run the_valid_and_completed_operation_of_a_high-throughput_sequencing_instrument_associated_with_a_sequencing_assay sphnconcept NA NA NA NA ncit:c148088_|sequencing_run| NA NA NA 1:1 NA NA NA #7CCAFF #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_identifier sequencing_run identifier NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasidentifier yes NA NA composedof unique_identifier_identifying_the_concept identifier unique_identifier_identifying_the_sequencing_run sphnattributedatatype string NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_datetime sequencing_run datetime NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasdatetime yes NA NA composedof datetime_of_the_concept datetime datetime_the_sequencing_run_was_performed sphnattributedatatype temporal NA NA NA NA NA 0:1 NA NA NA NA yes #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_read_count sequencing_run read_count NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasreadcount yes NA NA composedof ready_count_associated_with_to_concept read_count the_number_of_sequencing_reaction_results_that_were_pooled_to_assemble_a_sequence_for_a_genomic_region_of_interest_in_a_sequencing_run hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_average_insert_size sequencing_run average_insert_size NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasaverageinsertsize yes NA NA composedof average_insert_size_associated_to_the_concept average_insert_size the_average_insert_size_found_during_the_nucleic_acid_sequencing_run hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_average_read_length sequencing_run average_read_length NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasaveragereadlength yes NA NA composedof average_read_length_associated_to_the_concept average_read_length the_average_length_for_nucleic_acid_sequencing_reads_generated_in_a_sequencing_run hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_mean_read_depth sequencing_run mean_read_depth NA 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasmeanreaddepth yes NA NA composedof mean_read_depth_associated_to_the_concept mean_read_depth the_number_of_times_a_particular_locus_(site,_nucleotide,_amplicon,_region)_was_sequenced_in_a_sequencing_run hasquantity quantity NA NA NA NA NA 0:1 NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_data_file sequencing_run data_file ../out/example_wgs_sequencing_report_deliverable_summary.tsv 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasdatafile yes NA NA composedof data_file_associated_to_the_concept data_file data_file_associated_to_the_sequencing_run sphnattributeobject data_file time_series_data_file NA NA NA NA 1:n NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0
FALSE sequencing_run_quality_control_metric sequencing_run quality_control_metric 5f4dcc3b5aa765d61d8327deb882cf99 2024.1 NA https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasqualitycontrolmetric yes NA NA composedof quality_control_metric_associated_to_the_concept quality_control_metric quality_control_metric_associated_with_the_sequencing_run sphnattributeobject quality_control_metric NA NA NA NA NA 1:n NA NA NA NA NA #92a8d1 #f7dac9 #8ed3a0 #8ed3a0

Semantic evidence network

I have added the following method to automatically plot semantic evidence networks which show how evidence provenance has been generated. The dataset is organised into three hierarchical grouping levels based on the column concept_or_concept_compositions_or_inherited. The top level, Level 1, includes entries where this column equals “concept”. The subsequent levels, Level 2 and Level 3, contain entries where this column does not equal “concept”. The distinction between Levels 2 and 3 lies in the presence of distinct observations; Level 3 specifically represents the final observation associated with the general concept names from Level 2, differentiated further by non-empty values in the observation column, making Level 3 essentially a detailed continuation of Level 2.

Nodes within the network are structured with the following attributes: general_concept_name, id, group, name, and observation, where general_concept_name recurs in both Level 2 and Level 3 but differs based on the associated observation. Edges within this hierarchical setup link nodes from Level 1 to Level 2 and from Level 2 to Level 3 using general_concept_name as a consistent link identifier, facilitating a connection between the initial abstract concept level and its more detailed observational breakdowns.

Downloads

Example output (in mutiple filetypes) can be downloaded from the public set:

File NameDownload Link
example_report_concepts.tsvDownload
example_report_concepts.htmlDownload
example_report_concepts.RdsDownload
example_report_concepts.RdsDownload
example_report_concepts.RdsDownload
pdf plot_semantic_evidence_plot_network.pdfDownload
pdf plot_semantic_evidence_plot_sankey.pdfDownload
html plot_semantic_evidence_plot_network.htmlDownload
html plot_semantic_evidence_plot_sankey.htmlDownload

Example inputs can be downloaded from the public set:

File NameDownload Link
Canton_001_NGS000012345_NA_S46_L001_R1_001.fastq_head.textDownload
Canton_001_NGS000012345_NA_S46_L001_R1_001_sample_seq_assay_log.textDownload
SPHN_dataset_release_2024_2_20240502.xlsxDownload
sequencing assay_van_der_Horst2023.txtDownload
bwa_10351_101_10453.out.textDownload
example_variant.RdsDownload
example_variant.tsvDownload

Process Steps for Variant Features to RDF Concept Mapping

This section outlines the sequential processing steps from data extraction through to the final merged dataset, prepared for RDF concept mapping. Each step corresponds to a specific script and handles distinct data types or stages in data preparation and merging.

  1. Export Variant Data from Study
    • Extracts variant data from genomic projects focusing on specific genes and filtering for high-impact variants, saving them in formats like RDS and TSV for further processing.
  2. Read Variant Report Data
    • Loads and transforms variant data into a long format to facilitate metadata annotation, preparing the data by adding a column for metadata requirements.
  3. Read Sequencing Assay Data
    • Extracts key sequencing assay data such as identifiers, read depth, and file formats from logs or metadata files, providing crucial context for sequencing parameters.
  4. Read BWA Read Group Data
    • Parses BWA and samtools log files to extract detailed read group information, including metadata about the sequencing run such as machine, file paths, and read group specifications.
  5. Read Fastq Header Data - Analyzes headers from FASTQ files to extract sequencing instrument details and run metrics, offering a granular look at the sequencing runs which is instrumental in validating sequencing quality and parameters.
  6. Merge Datasets
    • Combines all processed data from the previous steps into a single dataset, aligning them by common identifiers and ensuring consistency across data types.
  7. Map Pipeline Output to SPHN Concepts
    • Maps the merged dataset to standard SPHN RDF concepts, ensuring each data point is correctly classified according to standardized ontology, thus aligning detailed genomic data with broader healthcare data standards.

Terms used in WGS logging

Descriptions for sequencing assay (WGS) terms

Column NameDescription
seq_assay_identifierThe unique identifier for the sequencing assay, typically a standard ontology term such as obo:OBI_002117 for Whole Genome Sequencing.
seq_assay_intended_read_depthThe targeted depth of coverage for the sequencing assay, indicating how many times each base is expected to be sequenced; in this case, 150x.
seq_assay_intended_read_lengthThe expected length of each read in the sequencing process, measured in base pairs; here, 20 bp.
data_file_identifierIdentifier for the data file output from the sequencing, used to trace and access the file within data systems.
data_file_formatThe format of the sequencing data files, specifying the standard used; here, EDAM format 1931, which is typical for FASTQ files from Illumina platforms.
quality_control_nameThe name of the metric used to assess the quality of the sequencing data; in this case, the Phred quality score.
quality_control_valueThe actual quality score achieved, indicating the reliability of the sequencing reads; 78.33% in this context.
library_prep_kitSpecifies the kit used for preparing DNA libraries for sequencing, critical for understanding the sample preparation methodology; Illumina TruSeq DNA PCR-Free is noted for high fidelity.
sample_identifierThe unique identifier for the sample being sequenced, used for tracking and reference throughout the sequencing process.
sample_material_typeThe type of biological material from which the sample was derived, with its specific ontology code; snomed:119297000 denotes a blood sample.
seq_instrument_codeThe identifier for the sequencing instrument used, linking to specific equipment details; obo: OBI_0002630 refers to the Illumina NovaSeq 6000.
sop_nameThe name of the Standard Operating Procedure followed during the sequencing, ensuring consistency and reproducibility; here, “WGS with Illumina NovaSeq 6000”.
sop_descriptionA brief description of the SOP, providing context and specifics about the sequencing approach used.
sop_versionThe version number of the SOP, which helps in identifying any changes or updates that might affect the sequencing output or interpretation.

Descriptions for FASTQ/BAM readgroup terms

Column NameDescription
START ATThe start timestamp of the sequencing or analysis process.
END ATThe end timestamp of the sequencing or analysis process.
sample_read_idA unique identifier for each sample read in the process.
rel_dirRelative directory path where sequencing data is stored.
dir_idDirectory identifier combining the project and sample ID.
FILE1Path to the first FASTQ file generated by sequencing.
FILE2Path to the second FASTQ file generated by sequencing.
output_filePath to the final BAM file generated after processing.
IDInternal identifier used to track the sample in analysis.
SMSample name or identifier used within the BAM file.
PLSequencing platform used, indicating technology type.
PUPlatform unit (PU) tag, often a barcode identifier.
LBLibrary ID which is crucial for distinguishing between libraries prepared differently.
RGRead group identifier in a BAM file, encapsulating all other identifiers.

Descriptions for genetic variants terms

Column NameDescription
sample.idUnique identifier for each sample.
rownamesRow names corresponding to data entries.
CHROMChromosome number where the variant is located.
REFReference allele at the variant locus.
ALTAlternate allele at the variant locus.
POSPosition of the variant on the chromosome.
startStart position of the variant.
endEnd position of the variant.
widthWidth of the variant region.
GeneGene name associated with the variant.
SYMBOLGene symbol.
HGNC_IDHUGO Gene Nomenclature Committee ID.
HGVSpHuman Genome Variation Society protein nomenclature.
HGVScHuman Genome Variation Society coding DNA sequence nomenclature.
ConsequenceConsequence of the variant.
IMPACTImpact of the variant on the gene or protein function.
genotypeGenotype showing the variant alleles.
Feature_typeType of genomic feature (e.g., transcript, regulatory).
FeatureSpecific feature affected by the variant (e.g., exon, intron).
BIOTYPEBiological type of the feature affected (e.g., protein_coding, miRNA).
VARIANT_CLASSClassification of the variant based on its genomic context.
CANONICALIndicates if the transcript is the canonical transcript.
CHROM_MetadataType: Chromosome; Cardinality: 1:1; Value Set: SNOMED CT: 91272006, LOINC:48000-4
POS_MetadataType: Genomic Position; Cardinality: 1:1; Value Set: GENO:0000902
REF_MetadataType: Reference Allele; Cardinality: 1:1; Value Set: string
ALT_MetadataType: Alternate Allele; Cardinality: 1:1; Value Set: string