HPCC & UKB

Sept 22, 2017

HPCC

Shared configuration

Please run the following snippet to enable shared configuration:

RS=/mnt/research/quantgen
echo -e "\nsource $RS/tools/configfiles/bash/bashrc" \
    >> ~/.bashrc
echo -e "\nsource $RS/tools/configfiles/bash/bash_profile" \
    >> ~/.bash_profile
touch /mnt/research/quantgen/tools/configfiles/bash/subscribers/$USER

Benefits: auto-loads R and PLINK, better defaults for working together, easier updates.

See: https://github.com/QuantGen/HPCC

Directory structure of research space

datasets
projects
scratch
tools, logs, shares, etc.

See: /mnt/research/quantgen/README

Directory structure of research space

`datasets` directory

Subdirectories of a dataset:

source (read-only, sometimes encrypted and with access control)
- Files received from institute.
derivative (read-only)
- Files derived from source files that are useful to all members of the group (e.g., file conversions, subset extractions).
playyard (read-and-write)
- The rest: experiments, data pipelines, etc.
- Deprecated in favor of projects directory.

Directory structure of research space

`projects` directory

project directories (e.g., UKB/landscape)
- Files to reproduce a particular paper.
user directories (e.g., gruenebe)

Directory structure of research space

`scratch` directory

link to group-specific scratch space (/mnt/ls15/scratch/groups/quantgen)
follows the same directory structure as research space
very fast, lots of space
files are (may be?) purged after 45 days

Pipelines using scratch

For I/O-heavy projects:

Leave code on research space
Read input from scratch
Write output to scratch
Occasionaly copy output from scratch to research space (e.g., using a cronjob)

Pipelines using scratch

cron: periodic job scheduler

$ crontab -l
0 0 * * * /mnt/research/quantgen/tools/cronjobs/ukb-500-output-transfer.sh

$ cat /mnt/research/quantgen/tools/cronjobs/ukb-500-output-transfer.sh
rsync -av /mnt/research/quantgen/scratch/projects/UKB/PIPELINE500/GWAS \
    /mnt/research/quantgen/projects/UKB/PIPELINE500/output/

Let me know if you need help setting this up.

UK Biobank

UKB

500k dataset was released.

Samples sizes

All: 488,377 White British: 409,703

Number of SNPs

Calls: 805,426

Data location on HPCC

Genotype Calls: /mnt/research/quantgen/datasets/UKB/source/genotypes/calls500

Phenotypes: /mnt/research/quantgen/datasets/UKB/source/phenotypes (no changes)

Genotype-derived phenotypes: /mnt/research/quantgen/datasets/UKB/source/genotypes/sample_qc

Where is the imputed data?

Problem:

> The genetic data was imputed using two different reference panels. The
> Haplotype Reference Consortium (HRC) panel was used as first choice
> option, but for SNPs not in that reference panel the UK10K + 1000 Genomes
> panel was used. The problem arose in the second set of imputed data from
> the UK10K + 1000 Genomes panel. The genotypes at these SNPs are imputed
> correctly, but have not been recorded as having the correct genome
> position in the files.

> For now we recommend that researchers focus exclusively on SNPs in the
> HRC panel, or work with the directly genotyped data until the new release
> is available.

http://www.ukbiobank.ac.uk/2017/07/important-note-about-imputed-genetics-data/

What's different?

XY chromosome as a separate file
Calls are now distributed in BED format
No new phenotype file: for updates to genotype-derived phenotypes (e.g., ethnic grouping, PCs, …) check the sample QC file

What has been done so far?

derivative directory: /mnt/research/quantgen/datasets/UKB/derivative/

BED/calls500_unfiltered (renamed original BED files)
cohorts/calls500_unfiltered/whites (white cohort, FID IID)
relabeled_phenotypes (uses labels instead of cryptic field IDs)

What has been done so far?

Project directory: /mnt/research/quantgen/projects/UKB/PIPELINE500

BED (whites only, minor QC)
phenotypes and phenotypes_genetic (whites only)
adjusted_phenotypes (height)
cohorts (genotyped_white, genotyped_white_related, genotyped_white_unrelated)
BGData, summaries
GMatrix and related_pairs
sample_sets, GWAS
ld, markers

HPCC

Shared configuration

Directory structure of research space

Directory structure of research space

datasets directory

Directory structure of research space

projects directory

Directory structure of research space

scratch directory

Pipelines using scratch

Pipelines using scratch

cron: periodic job scheduler

UK Biobank

UKB

Samples sizes

Number of SNPs

Data location on HPCC

Where is the imputed data?

What's different?

What has been done so far?

What has been done so far?

Questions? Office Hours: TWT 10 AM - 11 AM

`datasets` directory

`projects` directory

`scratch` directory

Questions?

Office Hours:
TWT 10 AM - 11 AM