Mar 15, 2018

Previous Presentation

HPCC

Storage Solutions

Research Space Quota

Research space is currently at 15 TB / 19 TB. Storage space is limited and relatively expensive.

Possible solutions:

  1. Request to increase quotas of home directories to 1 TB (https://contact.icer.msu.edu/large_quota), make home directories readable for members of the quantgen group to allow sharing.
  2. Use group-specific scratch directory for intermediate files that can be easily (within 4 hours?) recreated.
  3. Have external backups of finished projects (we are investigating an in-house solution within IQ).
  4. ???

Other News

  • Please do not use #PBS -l 'feature=intel14' or #PBS -l 'feature=intel14|intel16' anymore, use #PBS -l 'feature=intel16' instead (our shared version of R will not work on intel14 and produce cryptic errors).
  • There is now a version of R with GPU support: please get in touch if you want to use it.

Other Open Questions

  • How should we handle software updates?

UKB

Structure

Files are here and there:

  • source (/mnt/research/quantgen/datasets/UKB/source)
    • as retrieved from the UK Biobank
    • contains original genotypes and split phenotypes
  • derivative (/mnt/research/quantgen/datasets/UKB/derivative)
    • derived from source files without data loss
    • contains unfiltered BED files, cohorts (all, whites), merged phenotypes with different types of column names

Data Showcase: http://biobank.ctsu.ox.ac.uk/crystal/

PIPELINE500

  • projects (/mnt/research/quantgen/projects/UKB/PIPELINE500)
    • this pipeline is opinionated
    • feel free to use this as a template for your own projects
    • QC, cohort, relatedness cutoff, etc. may not be suitable for your project
    • may currently change without notice
    • this directory just contains code
  • scratch (/mnt/ls15/scratch/groups/quantgen/projects/UKB/PIPELINE500)
    • this directory contains the output

Cohort-Based Approach

  • Whites cohort (cohorts/whites)
  • QC-ed BED files for whites cohort (BED)
  • Subset of phenotypes for whites cohort (phenotypes)
  • Subset of genome-derived phenotypes for whites cohort (sample_qc)
  • Summaries for whites cohort (summaries)
  • BGData object for whites cohort (BGData)

Cohort-Based Approach

  • GMatrix for whites cohort (GMatrix)
  • Related pairs (related_pairs)
  • Related individuals (related_individuals)
  • Related whites subcohort (cohorts/whites_related)
  • Unrelated whites subcohort (cohorts/whites_unrelated)

Cohort-Based Approach

  • PCs for unrelated whites subcohort (PCs)
  • Adjusted phenotypes for unrelated whites subcohort (adjusted_phenotypes)
  • Sample sets for unrelated whites subcohort (sample_sets)
  • GWAS for sample set (GWAS)
  • Variant sets (variant_sets)
  • BGLR (BGLR)

Computational Approach

  • The data is too big to hold in memory: divide-apply-combine approach
  • Allows for easy parallelization (both on nodes and on cores)
  • Chunks are mapped into memory: fast, but places an I/O burden on the system
  • Research space and scratch are both network file systems: I/O-heavy jobs have a large runtime variability

Local Disk

  • The hard drive of the node itself: /mnt/local (~100GB)
  • Amount of data needs to be specified (#PBS -l file=XXgb)
  • Files can be placed in /mnt/local or in a subdirectory /mnt/local/$JOBID
    • /mnt/local is purged after 8 days
    • /mnt/local/$JOBID is purged after job completion
  • Use TMPDIR environment variable to refer to /mnt/local/$JOBID
  • Data needs to be copied (cp in bash, file.copy() in R)
  • Currently the TMPDIR permissions is 755, use a subdirectory with 700 for sensitive datasets

Load Balancing

  • To estimate amount of data needed in /mnt/local and walltime of all jobs, chunks should be balanced
  • BED files are already split into chunks: chromosomes
  • Chromosomes are unbalanced
  • Merge, and split into balanced BED files

Questions?

Office Hours:
MWF 10 AM - 2 PM