Sept 22, 2017

HPCC

Shared configuration

Please run the following snippet to enable shared configuration:

RS=/mnt/research/quantgen
echo -e "\nsource $RS/tools/configfiles/bash/bashrc" \
    >> ~/.bashrc
echo -e "\nsource $RS/tools/configfiles/bash/bash_profile" \
    >> ~/.bash_profile
touch /mnt/research/quantgen/tools/configfiles/bash/subscribers/$USER

Benefits: auto-loads R and PLINK, better defaults for working together, easier updates.

See: https://github.com/QuantGen/HPCC

Directory structure of research space

  • datasets
  • projects
  • scratch
  • tools, logs, shares, etc.

See: /mnt/research/quantgen/README

Directory structure of research space

datasets directory

Subdirectories of a dataset:

  • source (read-only, sometimes encrypted and with access control)
    • Files received from institute.
  • derivative (read-only)
    • Files derived from source files that are useful to all members of the group (e.g., file conversions, subset extractions).
  • playyard (read-and-write)
    • The rest: experiments, data pipelines, etc.
    • Deprecated in favor of projects directory.

Directory structure of research space

projects directory

  • project directories (e.g., UKB/landscape)
    • Files to reproduce a particular paper.
  • user directories (e.g., gruenebe)

Directory structure of research space

scratch directory

  • link to group-specific scratch space (/mnt/ls15/scratch/groups/quantgen)
  • follows the same directory structure as research space
  • very fast, lots of space
  • files are (may be?) purged after 45 days

Pipelines using scratch

For I/O-heavy projects:

  • Leave code on research space
  • Read input from scratch
  • Write output to scratch
  • Occasionaly copy output from scratch to research space (e.g., using a cronjob)

Pipelines using scratch

cron: periodic job scheduler

$ crontab -l
0 0 * * * /mnt/research/quantgen/tools/cronjobs/ukb-500-output-transfer.sh

$ cat /mnt/research/quantgen/tools/cronjobs/ukb-500-output-transfer.sh
rsync -av /mnt/research/quantgen/scratch/projects/UKB/PIPELINE500/GWAS \
    /mnt/research/quantgen/projects/UKB/PIPELINE500/output/

Let me know if you need help setting this up.

UK Biobank

UKB

500k dataset was released.

Samples sizes

All: 488,377 White British: 409,703

Number of SNPs

Calls: 805,426

Data location on HPCC

Genotype Calls: /mnt/research/quantgen/datasets/UKB/source/genotypes/calls500

Phenotypes: /mnt/research/quantgen/datasets/UKB/source/phenotypes (no changes)

Genotype-derived phenotypes: /mnt/research/quantgen/datasets/UKB/source/genotypes/sample_qc

Where is the imputed data?

Problem:

> The genetic data was imputed using two different reference panels. The
> Haplotype Reference Consortium (HRC) panel was used as first choice
> option, but for SNPs not in that reference panel the UK10K + 1000 Genomes
> panel was used. The problem arose in the second set of imputed data from
> the UK10K + 1000 Genomes panel. The genotypes at these SNPs are imputed
> correctly, but have not been recorded as having the correct genome
> position in the files.

> For now we recommend that researchers focus exclusively on SNPs in the
> HRC panel, or work with the directly genotyped data until the new release
> is available.

http://www.ukbiobank.ac.uk/2017/07/important-note-about-imputed-genetics-data/

What's different?

  • XY chromosome as a separate file
  • Calls are now distributed in BED format
  • No new phenotype file: for updates to genotype-derived phenotypes (e.g., ethnic grouping, PCs, …) check the sample QC file

What has been done so far?

derivative directory: /mnt/research/quantgen/datasets/UKB/derivative/

  • BED/calls500_unfiltered (renamed original BED files)
  • cohorts/calls500_unfiltered/whites (white cohort, FID IID)
  • relabeled_phenotypes (uses labels instead of cryptic field IDs)

What has been done so far?

Project directory: /mnt/research/quantgen/projects/UKB/PIPELINE500

  • BED (whites only, minor QC)
  • phenotypes and phenotypes_genetic (whites only)
  • adjusted_phenotypes (height)
  • cohorts (genotyped_white, genotyped_white_related, genotyped_white_unrelated)
  • BGData, summaries
  • GMatrix and related_pairs
  • sample_sets, GWAS
  • ld, markers

Questions?

Office Hours:
TWT 10 AM - 11 AM