HPCC & UKB

Mar 15, 2018

Previous Presentation

https://slides.agrueneberg.info/2017-09-22-hpcc-ukb.html

HPCC

Storage Solutions

Source: HPCC Documentation and User Manual

Group-specific paths:

Research space: /mnt/research/quantgen
Scratch space: /mnt/ls15/scratch/groups/quantgen

Research Space Quota

Research space is currently at 15 TB / 19 TB. Storage space is limited and relatively expensive.

Possible solutions:

Request to increase quotas of home directories to 1 TB (https://contact.icer.msu.edu/large_quota), make home directories readable for members of the quantgen group to allow sharing.
Use group-specific scratch directory for intermediate files that can be easily (within 4 hours?) recreated.
Have external backups of finished projects (we are investigating an in-house solution within IQ).
???

Other News

Please do not use #PBS -l 'feature=intel14' or #PBS -l 'feature=intel14|intel16' anymore, use #PBS -l 'feature=intel16' instead (our shared version of R will not work on intel14 and produce cryptic errors).
There is now a version of R with GPU support: please get in touch if you want to use it.

UKB

Structure

Files are here and there:

source (/mnt/research/quantgen/datasets/UKB/source)
- as retrieved from the UK Biobank
- contains original genotypes and split phenotypes
derivative (/mnt/research/quantgen/datasets/UKB/derivative)
- derived from source files without data loss
- contains unfiltered BED files, cohorts (all, whites), merged phenotypes with different types of column names

Data Showcase: http://biobank.ctsu.ox.ac.uk/crystal/

PIPELINE500

projects (/mnt/research/quantgen/projects/UKB/PIPELINE500)
- this pipeline is opinionated
- feel free to use this as a template for your own projects
- QC, cohort, relatedness cutoff, etc. may not be suitable for your project
- may currently change without notice
- this directory just contains code
scratch (/mnt/ls15/scratch/groups/quantgen/projects/UKB/PIPELINE500)
- this directory contains the output

Cohort-Based Approach

Whites cohort (cohorts/whites)
QC-ed BED files for whites cohort (BED)
Subset of phenotypes for whites cohort (phenotypes)
Subset of genome-derived phenotypes for whites cohort (sample_qc)
Summaries for whites cohort (summaries)
BGData object for whites cohort (BGData)

Cohort-Based Approach

GMatrix for whites cohort (GMatrix)
Related pairs (related_pairs)
Related individuals (related_individuals)
Related whites subcohort (cohorts/whites_related)
Unrelated whites subcohort (cohorts/whites_unrelated)

Cohort-Based Approach

PCs for unrelated whites subcohort (PCs)
Adjusted phenotypes for unrelated whites subcohort (adjusted_phenotypes)
Sample sets for unrelated whites subcohort (sample_sets)
GWAS for sample set (GWAS)
Variant sets (variant_sets)
BGLR (BGLR)

Computational Approach

The data is too big to hold in memory: divide-apply-combine approach
Allows for easy parallelization (both on nodes and on cores)
Chunks are mapped into memory: fast, but places an I/O burden on the system
Research space and scratch are both network file systems: I/O-heavy jobs have a large runtime variability

Local Disk

The hard drive of the node itself: /mnt/local (~100GB)
Amount of data needs to be specified (#PBS -l file=XXgb)
Files can be placed in /mnt/local or in a subdirectory /mnt/local/$JOBID
- /mnt/local is purged after 8 days
- /mnt/local/$JOBID is purged after job completion
Use TMPDIR environment variable to refer to /mnt/local/$JOBID
Data needs to be copied (cp in bash, file.copy() in R)
Currently the TMPDIR permissions is 755, use a subdirectory with 700 for sensitive datasets

Load Balancing

To estimate amount of data needed in /mnt/local and walltime of all jobs, chunks should be balanced
BED files are already split into chunks: chromosomes
Chromosomes are unbalanced
Merge, and split into balanced BED files

Previous Presentation

HPCC

Storage Solutions

Research Space Quota

Other News

Other Open Questions

UKB

Structure

PIPELINE500

Cohort-Based Approach

Cohort-Based Approach

Cohort-Based Approach

Computational Approach

Local Disk

Load Balancing

Questions?

Office Hours:
MWF 10 AM - 2 PM

Previous Presentation

HPCC

Storage Solutions

Research Space Quota

Other News

Other Open Questions

UKB

Structure

PIPELINE500

Cohort-Based Approach

Cohort-Based Approach

Cohort-Based Approach

Computational Approach

Local Disk

Load Balancing

Questions? Office Hours: MWF 10 AM - 2 PM

Questions?

Office Hours:
MWF 10 AM - 2 PM