Variant features to RDF concept metadata

This process is starting with sequencing_assay, which includes library_preparation, sequencing_run, etc. We will continue through the pipeline until we reach the final end-point required to report a pathogenic variant.

This documentation outlines the transformation of variant information from whole genome sequence (WGS) data to a format adhering to RDF structure data concepts. The aim is to ensure that the omic output from genomic analyses can be seamlessly integrated into clinical data warehouses with high fidelity and clarity.

Number of variables:

All SPHN RDF concept info (see SPHN_dataset_release_2024_2_20240502.xlsx) = 1503
Subset of relevant concepts = 76 (see example_subset_concepts.tsv)
Relevant WGS pipeline logs = 62 (see example_report.tsv)
Currently automated match = 13 (see example_report_concepts.Rds)
This repository uses a public dataset of example genetic variants and sequencing/analysis log data.

Overview

The process begins with the extraction of variant data from a genomic study, (no sensitive data is included in the public example set). The key variant features such as Chromosome (CHROM), Position (POS), Reference Allele (REF), and Alternate Allele (ALT) are formatted alongside metadata that describes their relationship to RDF concepts. This ensures downstream users can map these data accurately within clinical and research frameworks.

This document is to be updated as we improve the linking of result terms to SPHN_dataset_release_2024_2_20240502.xlsx which is critical so that downstream users can correctly map data.

Aims

Data preparation: Start with the extracted variant information from the genomic pipeline.
Key term identification: Focus on essential genomic terms like CHROM, POS, REF, and ALT, Sequencing run, Sequencing instrument.
Metadata addition: Attach metadata columns that specify RDF concept requirements such as type and cardinality.
Validation checklist:
- Do we have all necessary variant descriptors present?
- Is there inclusion and accuracy of all metadata explanations?
- Is there alignment of metadata with SPHN omic concepts?
- Downstream users (mapping) can choose from TSV, HTML, JSON, and Rds. Any others needed?

Current version

The observation column is highlighted in GREEN. It contains the data which we report as output from the pipeline for use in our database Here is the completed concept observations (this is file example_report_concepts.html):

cardinalityViolated	concept_reference_general_concept_name	concept_reference	general_concept_name	observation	release	unique_ID	IRI	active_status_(yes/no)	deprecated_in	replaced_by	concept_or_concept_compositions_or_inherited	general_description	contextualized_concept_name	contextualized_description	parent	type	excluded_type_descendants	standard	value_set_or_subset	meaning_binding	additional_information	cardinality_for_composedOf	cardinality_for_concept_to_Administrative_Case	cardinality_for_concept_to_Data_Provider	cardinality_for_concept_to_Subject_Pseudo_Identifier	cardinality_for_concept_to_Source_System	sensitive_(yes/no)	color_inherited	color_reference	color_observation	color_cardinality
FALSE	sequencing_assay_sequencing_assay	sequencing_assay	sequencing_assay	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#sequencingassay	yes	NA	NA	concept	an_assay_that_exploits_a_sequencer_as_the_instrument_to_generate_results	sequencing_assay	an_assay_that_exploits_a_sequencer_as_the_instrument_to_generate_results	assay	assay	NA	NA	NA	efo:0003740_\|assay_by_sequencer\|	NA	NA	0:n	1:1	0:n	1:n	NA	#7CCAFF	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_standard_operating_procedure	sequencing_assay	standard_operating_procedure	wgs_with_illumina_novaseq_6000	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstandardoperatingprocedure	yes	NA	NA	inherited	standard_operating_procedure_associated_to_the_concept	standard_operating_procedure	standard_operating_procedure_that_was_followed_for_this_sequencing_assay	sphnattributeobject	standard_operating_procedure	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_predecessor	sequencing_assay	predecessor	kispi_custom_sample_prep_v1	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haspredecessor	yes	NA	NA	inherited	process_preceding_this_concept	predecessor	sample_processing_preceding_the_sequencing_assay	sphnattributeobject	sample_processing	NA	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_code	sequencing_assay	code	efo_0022396	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hascode	yes	NA	NA	inherited	coded_information_specifying_the_concept	code	code_specifying_the_type_of_sequencing_assay	sphnattributeobject	code	NA	efo;_obi_or_other	for_efo:_descendant_of:_efo:0001455_\|assay\|;_for_obi:_descendant_of:_obi:0000070_\|assay\|	NA	NA	1:1	NA	NA	NA	NA	NA	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_identifier	sequencing_assay	identifier	obo:obi_002117_(wgs)	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasidentifier	yes	NA	NA	inherited	unique_identifier_identifying_the_concept	identifier	unique_identifier_identifying_the_sequencing_assay	sphnattributedatatype	string	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_start_datetime	sequencing_assay	start_datetime	jul 01 2023 01:01:01 gmt / v0.9.0	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstartdatetime	yes	NA	NA	inherited	datetime_at_which_the_concept_started	start_datetime	datetime_at_which_the_sequencing_assay_was_first_executed	hasdatetime	temporal	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	yes	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_data_file	sequencing_assay	data_file	out.fastq	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasdatafile	yes	NA	NA	inherited	data_file_associated_to_the_concept	data_file	data_file_associated_to_the_sequencing_assay	sphnattributeobject	data_file	time_series_data_file	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_sample	sequencing_assay	sample	blood_sample_1	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hassample	yes	NA	NA	inherited	sample_associated_to_the_concept	sample	material_that_is_being_sequenced_by_this_sequencing_assay	sphnattributeobject	sample	tumor_specimen;_isolate	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#abb1cf	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_library_preparation	sequencing_assay	library_preparation	illumina_truseq_dna_pcr-free	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haslibrarypreparation	yes	NA	NA	composedof	library_preparation_associated_to_the_concept	library_preparation	the_library_preparation_that_is_part_of_the_sequencing_assay	sphnattributeobject	library_preparation	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_sequencing_instrument	sequencing_assay	sequencing_instrument	a00485	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hassequencinginstrument	yes	NA	NA	composedof	device_associated_to_the_concept	sequencing_instrument	the_device_which_is_used_to_perform_the_sequencing_assay	sphnattributeobject	sequencing_instrument	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_sequencing_run	sequencing_assay	sequencing_run	334	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hassequencingrun	yes	NA	NA	composedof	sequencing_run_associated_to_the_concept	sequencing_run	sequencing_run_performed_as_part_of_the_sequencing_assay	sphnattributeobject	sequencing_run	NA	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#92a8d1	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_intended_read_length	sequencing_assay	intended_read_length	150	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasintendedreadlength	yes	NA	NA	composedof	intended_read_length_associated_to_the_concept	intended_read_length	the_number_of_nucleotides_intended_to_be_ordered_from_each_side_of_a_nucleic_acid_fragment_obtained_after_the_completion_of_a_sequencing_assay	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7cac9	#8ed3a0	#8ed3a0
FALSE	sequencing_assay_intended_read_depth	sequencing_assay	intended_read_depth	30x	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasintendedreaddepth	yes	NA	NA	composedof	intended_read_depth_associated_to_the_concept	intended_read_depth	the_number_of_times_a_particular_locus_(site,_nucleotide,_amplicon,_region)_was_intended_to_be_sequenced_as_part_of_the_sequencing_assay	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7cac9	#8ed3a0	#8ed3a0
FALSE	library_preparation_library_preparation	library_preparation	library_preparation	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#librarypreparation	yes	NA	NA	concept	process_which_results_in_the_creation_of_a_library_from_fragments_of_dna	library_preparation	process_which_results_in_the_creation_of_a_library_from_fragments_of_dna	sampleprocessing	sample_processing	NA	NA	NA	obi:0000711_\|library_preparation\|	NA	NA	0:n	1:1	0:n	1:n	NA	#7CCAFF	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_code	library_preparation	code	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hascode	yes	NA	NA	inherited	coded_information_specifying_the_concept	code	code_specifying_the_type_of_library_preparation	sphnattributeobject	code	NA	obi;_efo_or_other	for_obi:_descendant_of:_obi:0000711_\|library_preparation\|	NA	NA	0:1	NA	NA	NA	NA	NA	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_input	library_preparation	input	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasinput	yes	NA	NA	inherited	input_associated_to_the_concept	input	the_sample_for_which_a_library_is_created	sphnattributeobject	sample	NA	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_output	library_preparation	output	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasoutput	yes	NA	NA	inherited	output_associated_to_the_concept	output	the_ngs_library_that_is_produced	sphnattributeobject	sample	tumor_specimen	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_start_datetime	library_preparation	start_datetime	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstartdatetime	yes	NA	NA	inherited	datetime_at_which_the_concept_started	start_datetime	start_of_library_preparation	hasdatetime	temporal	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	yes	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_quality_control_metric	library_preparation	quality_control_metric	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasqualitycontrolmetric	yes	NA	NA	inherited	quality_control_metric_associated_to_the_concept	quality_control_metric	quality_control_metric_related_to_the_output_of_the_library_preparation	sphnattributeobject	quality_control_metric	NA	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_predecessor	library_preparation	predecessor	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haspredecessor	yes	NA	NA	inherited	process_preceding_this_concept	predecessor	process_preceding_this_library_preparation	sphnattributeobject	sample_processing	NA	NA	NA	NA	NA	0:n	NA	NA	NA	NA	NA	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_standard_operating_procedure	library_preparation	standard_operating_procedure	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasstandardoperatingprocedure	yes	NA	NA	inherited	standard_operating_procedure_associated_to_the_concept	standard_operating_procedure	standard_operating_procedure_that_was_followed_for_this_library_preparation	sphnattributeobject	standard_operating_procedure	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#abb1cf	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_kit_code	library_preparation	kit_code	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#haskitcode	yes	NA	NA	composedof	coded_information_specifying_the_kit_associated_to_the_concept	library_preparation_kit_code	pre-filled,_ready-to-use_reagent_cartridges_intended_to_improve_chemistry,_cluster_density_and_read_length_as_well_as_improve_quality_(q)_scores_for_this_sample._reagent_components_are_encoded_to_interact_with_the_sequencing_system_to_validate_compatibility_with_user-defined_applications.	hascode	code	NA	efo,_genepio,_fairgenomes_or_other	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_target_enrichment_kit_code	library_preparation	target_enrichment_kit_code	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hastargetenrichmentkitcode	yes	NA	NA	composedof	coded_information_specifying_the_target_enrichment_kit_associated_to_the_concept	target_enrichment_kit_code	indicates_which_target_enrichment_kit_was_used_to_prepare_this_sample._target_enrichment_is_a_pre-sequencing_dna_preparation_step_where_dna_sequences_are_either_directly_amplified_(amplicon_or_multiplex_pcr-based)_or_captured_(hybrid_capture-based)_in_order_to_only_focus_on_specific_regions_of_a_genome_or_dna_sample.	hascode	code	NA	efo,_genepio,_fairgenomes_or_other	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_intended_insert_size	library_preparation	intended_insert_size	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasintendedinsertsize	yes	NA	NA	composedof	intended_insert_size_associated_to_the_concept	intended_insert_size	in_paired-end_sequencing,_the_dna_between_the_adapter_sequences_is_the_insert._the_length_of_this_sequence_is_known_as_the_insert_size,_not_to_be_confused_with_the_inner_distance_between_reads._so,_fragment_length_equals_read_adapter_length_(2x)_plus_insert_size,_and_insert_size_equals_read_length_(2x)_plus_inner_distance.	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	library_preparation_gene_panel	library_preparation	gene_panel	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasgenepanel	yes	NA	NA	composedof	gene_panel_associated_to_the_concept	gene_panel	collection_of_genes_that_are_the_focus_of_sequencing	sphnattributeobject	gene_panel	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7f6c9	#8ed3a0	#8ed3a0
FALSE	sequencing_instrument_sequencing_instrument	sequencing_instrument	sequencing_instrument	a00485	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#sequencinginstrument	yes	NA	NA	concept	a_sequencing_instrument_that_is_used_in_a_sequencing_assay	sequencing_instrument	a_sequencing_instrument_that_is_used_in_a_sequencing_assay	sphnconcept	NA	NA	NA	NA	efo:0000548_\|instrument\|	NA	NA	NA	1:1	NA	NA	NA	#7CCAFF	#f7e7c9	#8ed3a0	#8ed3a0
FALSE	sequencing_instrument_code	sequencing_instrument	code	a00485	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hascode	yes	NA	NA	composedof	coded_information_specifying_the_concept	code	code_specifying_the_type_of_sequencing_instrument	sphnattributeobject	code	NA	obi;_efo_or_other	for_obi:_descendant_of:_obi:0400103_\|dna_sequencer\|;_for_efo:_descendant_of:_efo:0003739_\|sequencer\|	NA	NA	1:1	NA	NA	NA	NA	NA	#92a8d1	#f7e7c9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_sequencing_run	sequencing_run	sequencing_run	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#sequencingrun	yes	NA	NA	concept	the_valid_and_completed_operation_of_a_high-throughput_sequencing_instrument_associated_with_a_sequencing_assay	sequencing_run	the_valid_and_completed_operation_of_a_high-throughput_sequencing_instrument_associated_with_a_sequencing_assay	sphnconcept	NA	NA	NA	NA	ncit:c148088_\|sequencing_run\|	NA	NA	NA	1:1	NA	NA	NA	#7CCAFF	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_identifier	sequencing_run	identifier	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasidentifier	yes	NA	NA	composedof	unique_identifier_identifying_the_concept	identifier	unique_identifier_identifying_the_sequencing_run	sphnattributedatatype	string	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_datetime	sequencing_run	datetime	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasdatetime	yes	NA	NA	composedof	datetime_of_the_concept	datetime	datetime_the_sequencing_run_was_performed	sphnattributedatatype	temporal	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	yes	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_read_count	sequencing_run	read_count	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasreadcount	yes	NA	NA	composedof	ready_count_associated_with_to_concept	read_count	the_number_of_sequencing_reaction_results_that_were_pooled_to_assemble_a_sequence_for_a_genomic_region_of_interest_in_a_sequencing_run	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_average_insert_size	sequencing_run	average_insert_size	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasaverageinsertsize	yes	NA	NA	composedof	average_insert_size_associated_to_the_concept	average_insert_size	the_average_insert_size_found_during_the_nucleic_acid_sequencing_run	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_average_read_length	sequencing_run	average_read_length	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasaveragereadlength	yes	NA	NA	composedof	average_read_length_associated_to_the_concept	average_read_length	the_average_length_for_nucleic_acid_sequencing_reads_generated_in_a_sequencing_run	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_mean_read_depth	sequencing_run	mean_read_depth	NA	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasmeanreaddepth	yes	NA	NA	composedof	mean_read_depth_associated_to_the_concept	mean_read_depth	the_number_of_times_a_particular_locus_(site,_nucleotide,_amplicon,_region)_was_sequenced_in_a_sequencing_run	hasquantity	quantity	NA	NA	NA	NA	NA	0:1	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_data_file	sequencing_run	data_file	../out/example_wgs_sequencing_report_deliverable_summary.tsv	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasdatafile	yes	NA	NA	composedof	data_file_associated_to_the_concept	data_file	data_file_associated_to_the_sequencing_run	sphnattributeobject	data_file	time_series_data_file	NA	NA	NA	NA	1:n	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0
FALSE	sequencing_run_quality_control_metric	sequencing_run	quality_control_metric	5f4dcc3b5aa765d61d8327deb882cf99	2024.1	NA	https://www.biomedit.ch/rdf/sphn-schema/sphn/2024/1#hasqualitycontrolmetric	yes	NA	NA	composedof	quality_control_metric_associated_to_the_concept	quality_control_metric	quality_control_metric_associated_with_the_sequencing_run	sphnattributeobject	quality_control_metric	NA	NA	NA	NA	NA	1:n	NA	NA	NA	NA	NA	#92a8d1	#f7dac9	#8ed3a0	#8ed3a0

Semantic evidence network

I have added the following method to automatically plot semantic evidence networks which show how evidence provenance has been generated. The dataset is organised into three hierarchical grouping levels based on the column concept_or_concept_compositions_or_inherited. The top level, Level 1, includes entries where this column equals “concept”. The subsequent levels, Level 2 and Level 3, contain entries where this column does not equal “concept”. The distinction between Levels 2 and 3 lies in the presence of distinct observations; Level 3 specifically represents the final observation associated with the general concept names from Level 2, differentiated further by non-empty values in the observation column, making Level 3 essentially a detailed continuation of Level 2.

Nodes within the network are structured with the following attributes: general_concept_name, id, group, name, and observation, where general_concept_name recurs in both Level 2 and Level 3 but differs based on the associated observation. Edges within this hierarchical setup link nodes from Level 1 to Level 2 and from Level 2 to Level 3 using general_concept_name as a consistent link identifier, facilitating a connection between the initial abstract concept level and its more detailed observational breakdowns.

Downloads

Example output (in mutiple filetypes) can be downloaded from the public set:

File Name	Download Link
`example_report_concepts.tsv`	Download
`example_report_concepts.html`	Download
`example_report_concepts.Rds`	Download
`example_report_concepts.Rds`	Download
`example_report_concepts.Rds`	Download
`pdf plot_semantic_evidence_plot_network.pdf`	Download
`pdf plot_semantic_evidence_plot_sankey.pdf`	Download
`html plot_semantic_evidence_plot_network.html`	Download
`html plot_semantic_evidence_plot_sankey.html`	Download

Example inputs can be downloaded from the public set:

File Name	Download Link
`Canton_001_NGS000012345_NA_S46_L001_R1_001.fastq_head.text`	Download
`Canton_001_NGS000012345_NA_S46_L001_R1_001_sample_seq_assay_log.text`	Download
`SPHN_dataset_release_2024_2_20240502.xlsx`	Download
`sequencing assay_van_der_Horst2023.txt`	Download
`bwa_10351_101_10453.out.text`	Download
`example_variant.Rds`	Download
`example_variant.tsv`	Download

Process Steps for Variant Features to RDF Concept Mapping

This section outlines the sequential processing steps from data extraction through to the final merged dataset, prepared for RDF concept mapping. Each step corresponds to a specific script and handles distinct data types or stages in data preparation and merging.

Export Variant Data from Study
- Extracts variant data from genomic projects focusing on specific genes and filtering for high-impact variants, saving them in formats like RDS and TSV for further processing.
Read Variant Report Data
- Loads and transforms variant data into a long format to facilitate metadata annotation, preparing the data by adding a column for metadata requirements.
Read Sequencing Assay Data
- Extracts key sequencing assay data such as identifiers, read depth, and file formats from logs or metadata files, providing crucial context for sequencing parameters.
Read BWA Read Group Data
- Parses BWA and samtools log files to extract detailed read group information, including metadata about the sequencing run such as machine, file paths, and read group specifications.
Read Fastq Header Data - Analyzes headers from FASTQ files to extract sequencing instrument details and run metrics, offering a granular look at the sequencing runs which is instrumental in validating sequencing quality and parameters.
Merge Datasets
- Combines all processed data from the previous steps into a single dataset, aligning them by common identifiers and ensuring consistency across data types.
Map Pipeline Output to SPHN Concepts
- Maps the merged dataset to standard SPHN RDF concepts, ensuring each data point is correctly classified according to standardized ontology, thus aligning detailed genomic data with broader healthcare data standards.

Terms used in WGS logging

Descriptions for sequencing assay (WGS) terms

Column Name	Description
`seq_assay_identifier`	The unique identifier for the sequencing assay, typically a standard ontology term such as obo:OBI_002117 for Whole Genome Sequencing.
`seq_assay_intended_read_depth`	The targeted depth of coverage for the sequencing assay, indicating how many times each base is expected to be sequenced; in this case, 150x.
`seq_assay_intended_read_length`	The expected length of each read in the sequencing process, measured in base pairs; here, 20 bp.
`data_file_identifier`	Identifier for the data file output from the sequencing, used to trace and access the file within data systems.
`data_file_format`	The format of the sequencing data files, specifying the standard used; here, EDAM format 1931, which is typical for FASTQ files from Illumina platforms.
`quality_control_name`	The name of the metric used to assess the quality of the sequencing data; in this case, the Phred quality score.
`quality_control_value`	The actual quality score achieved, indicating the reliability of the sequencing reads; 78.33% in this context.
`library_prep_kit`	Specifies the kit used for preparing DNA libraries for sequencing, critical for understanding the sample preparation methodology; Illumina TruSeq DNA PCR-Free is noted for high fidelity.
`sample_identifier`	The unique identifier for the sample being sequenced, used for tracking and reference throughout the sequencing process.
`sample_material_type`	The type of biological material from which the sample was derived, with its specific ontology code; snomed:119297000 denotes a blood sample.
`seq_instrument_code`	The identifier for the sequencing instrument used, linking to specific equipment details; obo: OBI_0002630 refers to the Illumina NovaSeq 6000.
`sop_name`	The name of the Standard Operating Procedure followed during the sequencing, ensuring consistency and reproducibility; here, “WGS with Illumina NovaSeq 6000”.
`sop_description`	A brief description of the SOP, providing context and specifics about the sequencing approach used.
`sop_version`	The version number of the SOP, which helps in identifying any changes or updates that might affect the sequencing output or interpretation.

Descriptions for FASTQ/BAM readgroup terms

Column Name	Description
`START AT`	The start timestamp of the sequencing or analysis process.
`END AT`	The end timestamp of the sequencing or analysis process.
`sample_read_id`	A unique identifier for each sample read in the process.
`rel_dir`	Relative directory path where sequencing data is stored.
`dir_id`	Directory identifier combining the project and sample ID.
`FILE1`	Path to the first FASTQ file generated by sequencing.
`FILE2`	Path to the second FASTQ file generated by sequencing.
`output_file`	Path to the final BAM file generated after processing.
`ID`	Internal identifier used to track the sample in analysis.
`SM`	Sample name or identifier used within the BAM file.
`PL`	Sequencing platform used, indicating technology type.
`PU`	Platform unit (PU) tag, often a barcode identifier.
`LB`	Library ID which is crucial for distinguishing between libraries prepared differently.
`RG`	Read group identifier in a BAM file, encapsulating all other identifiers.

Descriptions for genetic variants terms

Column Name	Description
`sample.id`	Unique identifier for each sample.
`rownames`	Row names corresponding to data entries.
`CHROM`	Chromosome number where the variant is located.
`REF`	Reference allele at the variant locus.
`ALT`	Alternate allele at the variant locus.
`POS`	Position of the variant on the chromosome.
`start`	Start position of the variant.
`end`	End position of the variant.
`width`	Width of the variant region.
`Gene`	Gene name associated with the variant.
`SYMBOL`	Gene symbol.
`HGNC_ID`	HUGO Gene Nomenclature Committee ID.
`HGVSp`	Human Genome Variation Society protein nomenclature.
`HGVSc`	Human Genome Variation Society coding DNA sequence nomenclature.
`Consequence`	Consequence of the variant.
`IMPACT`	Impact of the variant on the gene or protein function.
`genotype`	Genotype showing the variant alleles.
`Feature_type`	Type of genomic feature (e.g., transcript, regulatory).
`Feature`	Specific feature affected by the variant (e.g., exon, intron).
`BIOTYPE`	Biological type of the feature affected (e.g., protein_coding, miRNA).
`VARIANT_CLASS`	Classification of the variant based on its genomic context.
`CANONICAL`	Indicates if the transcript is the canonical transcript.
`CHROM_Metadata`	Type: Chromosome; Cardinality: 1:1; Value Set: SNOMED CT: 91272006, LOINC:48000-4
`POS_Metadata`	Type: Genomic Position; Cardinality: 1:1; Value Set: GENO:0000902
`REF_Metadata`	Type: Reference Allele; Cardinality: 1:1; Value Set: string
`ALT_Metadata`	Type: Alternate Allele; Cardinality: 1:1; Value Set: string