tidyGWAS
tidyGWAS.Rmd
Installation
tidyGWAS is not yet on CRAN. To install, you need to use
devtools::install_github()
or
remotes::install_github()
devtools::install_github("ararder/tidyGWAS")
remotes::install_github("ararder/tidyGWAS")
Secondly, you will need to download the reference data. tidyGWAS uses a slightly edited version of dbSNP v155, that is available here.
Quick start
This is what a typical call to tidyGWAS()
will look
like. A detailed explanation of the different arguments follows.
# we setup a directory where we will store all summary statistics cleaned by tidyGWAS
gwas_folder <- tempfile()
# provide the filepath to the name of a directory that tidyGWAS will create.
outdir <- paste0(gwas_folder, "gwasName")
cleaned <- tidyGWAS(
tbl = "/filepath/to/sumstats",
# we read in a tab-separated file, so provide tab as delimiter
delim = "\t",
dbsnp_path = dsnp_path,
#
column_names = list(
"CHR" = "CHROM",
"POS" = "BP",
"EffectAllele" = "A1",
"B" = "Effect"
),
# Number of samples are missing, so manually impute
CaseN = 54000,
ControlN = 73000,
logfile=TRUE,
output_dir = outdir
)
Tutorial
Provide tidyGWAS either with an in-memory
dplyr::tibble()
or a filepath.
- Here a filepath to a gzipped tsv file is passed, along with
delim
="\t"
. tidyGWAS usesarrow::read_delim_arrow()
to read in files from disk. If you have trouble with parsing the file correctly, use another package such asdata.table::fread()
orreadr::read_table()
to first read in the into memory, and then pass the in-memory data.frame to tidyGWAS. -
dbsnp_path = dbsnp_path
gives the filepath to the reference data. -
output_dir = out
gives the filepath to the output directory that tidyGWAS will create. The directory cannot exist yet - to protect for accidently overwriting files. If no argument is passed tooutput_dir
, tidyGWAS will create a folder in the R tempdir, meaning that any cleaned summary statistics will deleted if the R session is ended.
library(tidyGWAS)
# we use the dummy version of dbSNP that comes with the package
dbsnp_path <- system.file("extdata/dbSNP155", package = "tidyGWAS")
# a dummy sumstats with 100 000 rows
gwas <- system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS")
# store the results in a temporary directory
out <- tempfile()
tidyGWAS(
tbl = gwas,
dbsnp_path = dbsnp_path,
output_dir = out
)
#> Parsing input summary statistics...
#>
#>
#> ── Running tidyGWAS 0.9.9.2 ────────────────────────────────────────────────────
#>
#> Starting at 2025-07-11 17:56:58.848487
#> with 100000 rows in input data.frame
#> ℹ Saving output in folder: /tmp/RtmpsKC2ps/file20988fd2438
#>
#>
#>
#> ── The detected format is "tidyGWAS"
#>
#> Was able to map 12 out of 12 to the "tidyGWAS" format
#> ✔ CHR -> CHR
#>
#> ✔ POS -> POS
#>
#> ✔ RSID -> RSID
#>
#> ✔ EffectAllele -> EffectAllele
#>
#> ✔ OtherAllele -> OtherAllele
#>
#> ✔ B -> B
#>
#> ✔ SE -> SE
#>
#> ✔ EAF -> EAF
#>
#> ✔ P -> P
#>
#> ✔ CaseN -> CaseN
#>
#> ✔ ControlN -> ControlN
#>
#> ✔ INFO -> INFO
#>
#>
#>
#> ── Checking that columns follow tidyGWAS format
#>
#> ✔ The following columns are used for further steps: CHR, POS, RSID, EffectAllele, OtherAllele, B, SE, EAF, INFO, P, CaseN, ControlN, and rowid
#>
#>
#>
#> ── Checking for columns with all NA
#>
#> ✔ Found no columns with all NA
#>
#> ℹ Found CaseN and ControlN, and no effective N:
#> Calculating EffectiveN by `EffectiveN = 4 / (1 / ControlN + 1 / CaseN)`
#>
#> ℹ Found CaseN and ControlN, Calculating N by `N = ControlN + CaseN`
#>
#>
#>
#> ── 1) Scanning for indels ──
#>
#>
#>
#> ✔ No rows contained missing values in EffectAllele and OtherAllele
#>
#> 1. EffectAllele or OtherAllele, character length > 1: A vs AA
#>
#> 2. EffectAllele or OtherAllele coded as 'D', 'I', or 'R'
#>
#> ✔ Detected 0 rows as indels
#>
#>
#>
#> ── 2) Scanning for rows with NA in critical columns ──
#>
#>
#>
#> ✔ No rows contained missing values in CHR and POS
#>
#>
#>
#> ── 3) Scanning for rows with duplications ──
#>
#>
#>
#> ℹ Looking for duplications with columns: CHR, POS, EffectAllele, and OtherAllele
#>
#> ✔ Found no duplications
#>
#>
#>
#> ── 4) Validating columns
#>
#> ℹ The median value of B is -0.0012, which seems reasonable
#>
#>
#>
#> ── All rows passed validation
#>
#>
#>
#> ── 5) Adding RSID based on CHR:POS. Adding dbSNP based QC flags
#>
#> ℹ Inferring build by matching 10000 rows to GRCh37 and GRCh38
#>
#> 99 snps matched GRCh38, 9998 for GRCh37, inferring build to be 37
#> ! Removed 21 rows with no dbSNP entry or with incompat alleles
#>
#> /tmp/RtmpsKC2ps/file20988fd2438/pipeline_info/removed_nodbsnp.parquet
#>
#>
#> ── 6) Repairing missings statistics columns if possible ──
#>
#>
#>
#> ℹ Z missing: Calculating Z using the formula: Z = B / SE
#>
#>
#>
#> ── Finished repair_stats:
#>
#> ℹ Added 1 new columns: Z
#>
#>
#>
#> ── Listing final breakdown of removed rows:
#>
#> nodbsnp: 21
#>
#>
#>
#> ── Finished tidyGWAS ───────────────────────────────────────────────────────────
#>
#> ℹ A total of 21 rows were removed
#>
#> ℹ Total running time: 3.8s
#>
#> Saving metadata from analysis to /tmp/RtmpsKC2ps/file20988fd2438/metadata.yaml
#> CHR POS_37 EffectAllele OtherAllele rowid B P
#> <char> <int> <char> <char> <int> <num> <num>
#> 1: 6 27521096 C T 77548 0.2120017 4.830e-39
#> 2: 6 28080777 T C 36396 0.2066970 4.672e-38
#> 3: 6 27480526 C T 77992 0.2062008 7.920e-38
#> 4: 6 27835272 T C 67347 0.1714038 1.454e-32
#> 5: 6 26328353 T C 7577 0.1908022 8.936e-30
#> ---
#> 99975: 11 33865194 A T 10863 0.0000000 1.000e+00
#> 99976: 8 126178011 A G 31103 0.0000000 1.000e+00
#> 99977: 3 7984591 G A 38135 0.0000000 1.000e+00
#> 99978: 19 37077172 A G 45371 0.0000000 1.000e+00
#> 99979: 1 191825752 A G 50847 0.0000000 1.000e+00
#> SE INFO CaseN ControlN EAF EffectiveN N POS_38
#> <num> <num> <int> <int> <num> <int> <int> <int>
#> 1: 0.0162 0.990 53386 77258 0.9189468 126281 130644 27553317
#> 2: 0.0160 1.000 53386 77258 0.9179468 126281 130644 28112999
#> 3: 0.0160 0.993 53386 77258 0.9175382 126281 130644 27512747
#> 4: 0.0144 1.000 53386 77258 0.8981296 126281 130644 27867494
#> 5: 0.0168 0.996 53386 77258 0.9261296 126281 130644 26328125
#> ---
#> 99975: 0.0122 0.670 53386 77258 0.2471827 126281 130644 33843648
#> 99976: 0.0104 0.996 53386 77258 0.7801827 126281 130644 125165769
#> 99977: 0.0286 0.986 52978 76825 0.9764081 125421 129803 7942904
#> 99978: 0.0163 0.998 53386 77258 0.9250000 126281 130644 36586270
#> 99979: 0.0097 1.000 53386 77258 0.2654086 126281 130644 191856622
#> RSID REF_37 REF_38 multi_allelic Z
#> <char> <char> <char> <lgcl> <num>
#> 1: rs35848276 C C FALSE 13.08653
#> 2: rs68188794 T T FALSE 12.91857
#> 3: rs34573979 C C FALSE 12.88755
#> 4: rs200948 T T FALSE 11.90304
#> 5: rs34107459 T T FALSE 11.35727
#> ---
#> 99975: rs7942366 A A FALSE 0.00000
#> 99976: rs7813225 A A FALSE 0.00000
#> 99977: rs140518486 G G FALSE 0.00000
#> 99978: rs12977285 A A FALSE 0.00000
#> 99979: rs9283418 A A FALSE 0.00000
Output files
tidyGWAS returns all its output in a directory.
fs::dir_tree(out)
#> /tmp/RtmpsKC2ps/file20988fd2438
#> ├── metadata.yaml
#> ├── pipeline_info
#> │ ├── removed_nodbsnp.parquet
#> │ └── removed_validate_chr_pos_path.parquet.parquet
#> ├── raw
#> │ └── sumstats.tsv.gz.parquet
#> └── tidyGWAS_hivestyle
#> ├── CHR=1
#> │ └── part-0.parquet
#> ├── CHR=10
#> │ └── part-0.parquet
#> ├── CHR=11
#> │ └── part-0.parquet
#> ├── CHR=12
#> │ └── part-0.parquet
#> ├── CHR=13
#> │ └── part-0.parquet
#> ├── CHR=14
#> │ └── part-0.parquet
#> ├── CHR=15
#> │ └── part-0.parquet
#> ├── CHR=16
#> │ └── part-0.parquet
#> ├── CHR=17
#> │ └── part-0.parquet
#> ├── CHR=18
#> │ └── part-0.parquet
#> ├── CHR=19
#> │ └── part-0.parquet
#> ├── CHR=2
#> │ └── part-0.parquet
#> ├── CHR=20
#> │ └── part-0.parquet
#> ├── CHR=21
#> │ └── part-0.parquet
#> ├── CHR=22
#> │ └── part-0.parquet
#> ├── CHR=3
#> │ └── part-0.parquet
#> ├── CHR=4
#> │ └── part-0.parquet
#> ├── CHR=5
#> │ └── part-0.parquet
#> ├── CHR=6
#> │ └── part-0.parquet
#> ├── CHR=7
#> │ └── part-0.parquet
#> ├── CHR=8
#> │ └── part-0.parquet
#> └── CHR=9
#> └── part-0.parquet
That’s a lot of files!
Removed rows are stored in
pipeline/removed_*
metadata.yaml
contains metadata about the executionraw
contains the summary statistics before any munging was done. Useful to reproduce, or to identify why rows were removed.tidyGWAS_hivestyle
contains the cleaned summary statistics, in something called a hivestyle partition, by default. The motivation for this is detailed further down in the vignette. If you just want a standard csv file, useoutput_format="csv"
oroutput_format="parquet"
-
To read in the cleaned summary statistics:
df <- arrow::open_dataset(paste0(out, "/tidyGWAS_hivestyle")) |> dplyr::collect()
Setting column names
Almost certainly, the summary statistics you want to clean will not have the same column name that the example file had.
sumstats <- read.table(gwas, header=T) |> dplyr::tibble()
head(sumstats)
#> # A tibble: 6 × 13
#> CHR POS RSID EffectAllele OtherAllele B SE EAF INFO
#> <int> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 31819813 rs1839882… C G -0.00570 0.0292 0.975 0.936
#> 2 16 13991338 rs8044430 T C -0.0177 0.0092 0.641 0.951
#> 3 9 85016930 rs1114040… A T -0.00230 0.0185 0.916 0.701
#> 4 19 15743403 rs4807961 C G 0.00240 0.0099 0.256 0.996
#> 5 18 5948313 rs948293 T C 0.00170 0.0092 0.332 0.971
#> 6 13 47224772 rs837 G A 0.0272 0.0086 0.447 1.01
#> # ℹ 4 more variables: P <dbl>, CaseN <int>, ControlN <int>, rowid <int>
colnames(sumstats)
#> [1] "CHR" "POS" "RSID" "EffectAllele" "OtherAllele"
#> [6] "B" "SE" "EAF" "INFO" "P"
#> [11] "CaseN" "ControlN" "rowid"
To prevent parsing files wrong tidyGWAS()
will not guess
which columns are which. Therefore, tidyGWAS requires the colum names to
be the same as it’s native format or a mapping between tidyGWAS column
names and the column names in the input file. This is done using the
column_names
argument, which takes a named list, where the
names are the tidyGWAS columns and the values the corresponding column
in the input file.
# what if the names were all wrong?
sumstats_with_wrong_names <- dplyr::rename(
sumstats,
CHROM = CHR,
BP = POS,
ID = RSID,
A1 = EffectAllele,
A2 = OtherAllele,
EFFECT = B
)
tidyGWAS(
tbl = sumstats_with_wrong_names,
dbsnp_path = dbsnp_path,
column_names = list(
CHR = "CHROM",
POS = "BP",
RSID = "ID",
EffectAllele = "A1",
OtherAllele = "A2",
B = "EFFECT"
)
)
The tidyGWAS columns
tidyGWAS uses the following column names:
CHR
POS
RSID
EffectAllele
OtherAllele
EAF
B
SE
P
CaseN
ControlN
N
INFO
Z
Note:
If your RSID column is in the format CHR:BP:A1:A2, you can still pass it as an RSID column.
Inputting sample size columns
Often, the sample size column is missing from the summary statistics, and you provide it manually. tidyGWAS has three arguments that can be used to manually set the sample size, if it’s missing from the original file:
CaseN
ControlN
N
cleaned <- tidyGWAS(
tbl = sumstats,
dbsnp_path = dbsnp_path,
# CaseN, ControlN and N can all be used to set sample size
CaseN = 400,
ControlN = 800,
N = 1200
)
Reading in files
If you pass a filepath to tidyGWAS, it will attempt to read in the
file with arrow::read_delim_arrow()
The default delimiter is white space, so to read in comma-separated
files or tab-separated files, you can provide the delim
argument to arrow::read_delim_arrow()
through
...
cleaned <- tidyGWAS(
tbl = "filepath/to/gwas/ondisk.csv",
# here we specify the delimiter to for csv files
delim = ",",
dbsnp_path = dbsnp_path,
)
There’s a lot of different field delimiters used in the wild, and
sometimes you can struggle with inputting the correct delimiter. In such
cases, it’s often much more convenient to use the effective
data.table::fread()
or readr::read_table()
to
first read in the summary statistics into memory before passing it to
tidyGWAS.
Hivestyle-partitioning
The default output format is a hivestyle format using .parquet files. See for example here for some motivation for this. In essence, this format will significantly speed up other downstream applications such as meta-analysis, LD querying and other analyses.
Variant identity
TidyGWAS will add the following columns dealing with variant ID, in addition to saving and validating all valid columns in the input file.
CHR
Chromosome. The same across both buildsPOS_38
,POS_37
genomic position on GRCh38 and GRCh37RSID
variant ID from dbSNPREF_37
,REF_38
is the reference genome allele on GRCh38 and GRCh37multi_allelic
is a TRUE/FALSE column that flag rows that were multi-allelic IN the summary statistics NOT whether there are multiple alleles in dbSNP. (TRUE corresponds to multi allelic).rowid
maps each row back to the inputted summary statistics the file inraw
, so that any row can be mapped back to it’s original values.indel
TRUE/FALSE whether the variant is of type INsertion/DELetionAll other valid columns in the input summary statistics file
If possible
repair_stats()
will add statistics columns such asB, P, SE, Z
, if they are missing and possible to repair.
Parallel computation
tidyGWAS automatically detects the number of cores. In some cases,
for example when running tidyGWAS in a HPC cluster, you might need to
manually set the number of cores, which can be done using the
OMP_NUM_THREADS
variable. This should not be a larger
number of cores than what you have requested in your HPC job (in the
example below, the “–cpus-per-task” flag)
#SBATCH --mem=60gb
#SBATCH --time=24:0:00
#SBATCH --cpus-per-task 8
export OMP_NUM_THREADS=8
outdir=$(pwd)
gwas=$outdir/my_gwas.tsv.gz
dbsnp_files="dbSNP155"
Rscript -e "tidyGWAS(commandArgs(trailingOnly = TRUE)[1], dbsnp_path = commandArgs(trailingOnly = TRUE)[2],outdir = commandArgs(trailingOnly = TRUE)[3], logfile=TRUE)" $gwas $dbsnp_files $outdir
Computational cost and memory usage
Memory use and time scales with the size of the summary statistics. From running tidyGWAS, here’s an estimation. Especially, if the summary statistics are missing CHR and POS, tidyGWAS will require additional memory. 1. 10 million rows ~ 5-15gb 2. 40 million rows 10-40gb 3. 60 million rows 65-85gb