Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Preparing a public data reference panel from 1000 genomes project for PCA

Last update: 20240807

Table of contents

Figure 1: Principal component analysis (PCA) of 1000 genomes project, reference genome GRCh38, showing population structure.

Overview

This documentation outlines the process of preparing a reference panel using the 1000 Genomes Project data, focusing on converting data to PLINK format and performing Principal Component Analysis (PCA). This enables mapping of cohort data to determine population labels and provides a reference for genetic diversity analysis.

Data Source

The data utilized is derived from the GRCh38 release of the 1000 Genomes Project, specifically from:

Tools and tutorials referenced

1000 genomes project

This is a project which provides public WGS data in VCF format and related metadata. The pedigree informaiton comes from 20130606_g1k.ped. We used the GRCh38 liftover data set http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/.

The phase 3 variant calls released by the 1000 Genomes project was on GRCh37 reference. To be able to compare them with the variant calls on the high coverage data they had to be lifted over to GRCh38. The liftover was performed at New York Genome Center (NYGC) using CrossMap version 0.5.4. The GRCh37 phase 3 calls used in the liftover are available here, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. The chain file used in liftover is available at UCSC and can be downloaded from https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/ .

We did not attempt to liftover SVs that were in phase 3. CrossMap does not liftover any records that either had multiple hits on GRCh38 or after liftover the REF allele matches ALT allele. Additionally, we failed any record that was lifted over to a different chromosome or if the REF allele contained symbols (Y, W, Z etc.).

[1] Download the files as VCF.gz (and tab-indices).

#!/bin/bash
prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/phase3.chr" ;

suffix=".GRCh38.GT.crossmap.vcf.gz" ;

for chr in {1..22} X Y; do
    wget "${prefix}""${chr}""${suffix}" \
               "${prefix}""${chr}""${suffix}".tbi ;
done

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/phase3.crossmap.GRCh38.07302021.README.html
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/phase3.crossmap.GRCh38.07302021.manifest.tsv

mkdir 1000genomes
mv phase3* 1000genomes/

[2] Download 1000 Genomes PED file. wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

[3] Download (link to the existing copy) of the reference genome.

Read here: Reference genome.

Pedigree data

Family.IDIndividual.IDPaternal.IDMaternal.IDGenderPhenotypePopulationRelationship
HG00096HG000960010GBRunrel
HG00097HG000970020GBRunrel
HG00099HG000990020GBRunrel
HG00100HG001000020GBRunrel
HG00101HG001010010GBRunrel
HG00102HG001020020GBRunrel

Population codes

CodePopulation DescriptionGeographic/Ethnic Details
CHBHan ChineseHan Chinese in Beijing, China
JPTJapaneseJapanese in Tokyo, Japan
CHSSouthern Han ChineseHan Chinese South
CDXDai ChineseChinese Dai in Xishuangbanna, China
KHVKinh VietnameseKinh in Ho Chi Minh City, Vietnam
CHDDenver ChineseChinese in Denver, Colorado (pilot 3 only)
CEUCEPHUtah residents (CEPH) with Northern and Western European ancestry
TSITuscanToscani in Italia
GBRBritishBritish in England and Scotland
FINFinnishFinnish in Finland
IBSSpanishIberian populations in Spain
YRIYorubaYoruba in Ibadan, Nigeria
LWKLuhyaLuhya in Webuye, Kenya
GWDGambianGambian in Western Division, The Gambia
MSLMendeMende in Sierra Leone
ESNEsanEsan in Nigeria
ASWAfrican-American SWAfrican Ancestry in Southwest US
ACBAfrican-CaribbeanAfrican Caribbean in Barbados
MXLMexican-AmericanMexican Ancestry in Los Angeles, California
PURPuerto RicanPuerto Rican in Puerto Rico
CLMColombianColombian in Medellin, Colombia
PELPeruvianPeruvian in Lima, Peru
GIHGujaratiGujarati Indian in Houston, TX
PJLPunjabiPunjabi in Lahore, Pakistan
BEBBengaliBengali in Bangladesh
STUSri LankanSri Lankan Tamil in the UK
ITUIndianIndian Telugu in the UK

PCA eigenvectors

IndividualPC1PC2PC3PC4PC5
HG00096-0.010320.02700.01170.01920.002517
HG00097-0.010540.02750.01040.01800.003890
HG00099-0.010670.02750.01040.01680.001831
HG00100-0.009680.02750.01090.0191-0.000839
HG00101-0.010380.02700.01160.01840.000796
HG00102-0.010630.02720.01100.01780.003824

Scripts Overview

The process is currently set up in three scripts:

  • pca_biplot_1kg.sh, pca_biplot_1kg_part2.sh, pca_biplot_1kg_part3_ggplot.R
  • 1: Parallel processing (per chromosome) for the conversion of 1000 Genomes VCF files to BCF and subsequently to PLINK format. The script handles data normalization, ID reformatting, and variant pruning to reduce linkage disequilibrium, enhancing the quality of genetic association analyses.
  • 2: Continues from Script 1, merging PLINK files across chromosomes into a single dataset, performing PCA to explore population structure.
  • 3: Uses ggplot2 in R to visualize the PCA results, highlighting the population stratification among global populations. This script also integrates demographic data to color-code populations in the bi-plot, providing clear visual insights into genetic diversity.

Step-by-Step Summary

  1. Data conversion and normalisation
    • Convert original VCF files to BCF using bcftools, ensuring all variants have unique IDs based on chromosome positions.
    • Annotate and reformat using bcftools annotate to adjust variant IDs.
    • Normalise and remove duplicate variants to clean the dataset for further analysis.
  2. PLINK file preparation
    • Convert BCF files to PLINK format, ensuring allele orders are maintained.
    • Prune variants using PLINK to reduce the dataset based on minor allele frequency and linkage disequilibrium.
  3. Data merging and PCA analysis
    • Merge all chromosome-specific PLINK files into a single dataset.
    • Perform PCA to identify principal components that explain the maximum variance, indicative of population stratification.
  4. Visualisation and interpretation
    • Use ggplot2 and patchwork in R to create bi-plots of the first few principal components.
    • Overlay population data to visually interpret population structure and genetic diversity.

Conclusion

This reference panel and PCA analysis provide a framework for understanding genetic diversity and population structure within global populations, using the latest 1000 Genomes Project data. This methodological approach is essential for genomic studies requiring a comprehensive understanding of genetic backgrounds.