tidyGWAS
tidyGWAS.RmdInstallation
tidyGWAS is not yet on CRAN. To install, you need to use
devtools::install_github() or
remotes::install_github()
devtools::install_github("ararder/tidyGWAS")
remotes::install_github("ararder/tidyGWAS")Secondly, you will need to download the reference data. tidyGWAS uses a slightly edited version of dbSNP v155, that is available here.
NEWS
With the release of tidyGWAS 1.0, the package now
supports automatic column-name guessing as well as automatic detection
of file delimiter, using data.table::fread(). This means
that the delim argument has been superseded and it’s often
no longer nessecary to pass the names of the input column names with the
column_names argument. But remember to double check the
logfile that the column names were parsed correctly.
Quick start
This is what a typical call to tidyGWAS() will look
like. A detailed explanation of the different arguments follows.
# we setup a directory where we will store all summary statistics cleaned by tidyGWAS
gwas_folder <- tempfile()
# provide the filepath to the name of a directory that tidyGWAS will create.
outdir <- paste0(gwas_folder, "gwasName")
cleaned <- tidyGWAS(
tbl = "/filepath/to/sumstats",
dbsnp_path = dsnp_path,
# Number of samples are missing, so manually impute
CaseN = 54000,
ControlN = 73000,
logfile=TRUE,
output_dir = outdir
)Tutorial
tidyGWAS now supports directly downloading files from GWAS catalog or providing URLs.
The first argument to tidyGWAS is tbl, and should be one
off
- A GWAS catalog study id. For example, “GCST90101808”. The full summary statistics needs to be available.
- A webpage URL to a downloadable file. For example, the FinnGEN R12 depression release
- A local filepath. For example “~/summary_stats/raw/trait_x.tsv.gz”
- An in-memory data.frame.
library(tidyGWAS)
# we use the dummy version of dbSNP that comes with the package
dbsnp_path <- system.file("extdata/dbSNP155", package = "tidyGWAS")
# a dummy sumstats with 100 000 rows
gwas <- system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS")
# store the results in a temporary directory
out <- tempfile()
# using a GWAS catalog study ID
tidyGWAS(
tbl = "GCST90101808",
dbsnp_path = dbsnp_path,
output_dir = "finnGEN_MDD_R12"
)
# using a URL
tidyGWAS(
tbl = "https://storage.googleapis.com/finngen-public-data-r12/summary_stats/release/finngen_R12_F5_DEPRESSIO.gz",
dbsnp_path = dbsnp_path,
output_dir = "finnGEN_MDD_R12"
)
# using a local flepath
gwas <- system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS")
tidyGWAS(
tbl = gwas,
dbsnp_path = dbsnp_path,
output_dir = "test_sumstat"
)
# or an in-memory data.frame
df = read.table(gwas)
tidyGWAS(
tbl = df,
dbsnp_path = dbsnp_path,
output_dir = "test_sumstat"
)An example call
Here’s what the output of tidyGWAS looks like, using the test file provided with tidyGWAS
library(tidyGWAS)
tidyGWAS(
tbl = system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS"),
dbsnp_path = system.file("extdata/dbSNP155", package = "tidyGWAS"),
output_dir = tempfile()
)
#> Parsing input summary statistics...
#>
#>
#> ── Running tidyGWAS 1.0.0 ──────────────────────────────────────────────────────
#>
#> Starting at 2025-11-29 11:50:40.418185
#> with 100000 rows in input data.frame
#> ℹ Saving output in folder: /tmp/Rtmp2qwFE4/file1a7bba9964b
#>
#>
#>
#> ── The detected format is "tidyGWAS"
#>
#> Was able to map 12 out of 12 to the "tidyGWAS" format
#> ✔ CHR -> CHR
#>
#> ✔ POS -> POS
#>
#> ✔ RSID -> RSID
#>
#> ✔ EffectAllele -> EffectAllele
#>
#> ✔ OtherAllele -> OtherAllele
#>
#> ✔ B -> B
#>
#> ✔ SE -> SE
#>
#> ✔ EAF -> EAF
#>
#> ✔ P -> P
#>
#> ✔ CaseN -> CaseN
#>
#> ✔ ControlN -> ControlN
#>
#> ✔ INFO -> INFO
#>
#>
#>
#> ── Checking that columns follow tidyGWAS format
#>
#> ✔ The following columns are used for further steps: CHR, POS, RSID, EffectAllele, OtherAllele, B, SE, EAF, INFO, P, CaseN, ControlN, and rowid
#>
#>
#>
#> ── Checking for columns with all NA
#>
#> ✔ Found no columns with all NA
#>
#> ℹ Found CaseN and ControlN, and no effective N:
#> Calculating EffectiveN by `EffectiveN = 4 / (1 / ControlN + 1 / CaseN)`
#>
#> ℹ Found CaseN and ControlN, Calculating N by `N = ControlN + CaseN`
#>
#>
#>
#> ── 1) Scanning for rows with NA in critical columns ──
#>
#>
#>
#> ✔ No rows contained missing values in CHR, POS, EffectAllele, and OtherAllele
#>
#>
#>
#> ── 2) Scanning for rows with duplications ──
#>
#>
#>
#> ℹ Looking for duplications with columns: CHR, POS, EffectAllele, and OtherAllele
#>
#> ✔ Found no duplications
#>
#>
#>
#> ── 3) Scanning for indels ──
#>
#>
#>
#> 1. EffectAllele or OtherAllele, character length > 1: A vs AA
#>
#> 2. EffectAllele or OtherAllele coded as 'D', 'I', or 'R'
#>
#> ✔ Detected 0 rows as indels
#>
#>
#>
#> ── 4) Validating columns
#>
#> ℹ The median value of B is -0.0012, which seems reasonable
#>
#>
#>
#> ── All rows passed validation
#>
#>
#>
#> ── 5) Adding RSID based on CHR:POS. Adding dbSNP based QC flags
#>
#> ℹ Inferring build by matching 10000 rows to GRCh37 and GRCh38
#>
#> 99 snps matched GRCh38, 9998 for GRCh37, inferring build to be 37
#> ℹ 0 rows had EffectAllele and OtherAllele not matching dbSNP REF/ALT alleles,but which matched after strand-flipping
#>
#> ! Removed 21 rows with no dbSNP entry or with incompat alleles
#>
#> /tmp/Rtmp2qwFE4/file1a7bba9964b/pipeline_info/removed_nodbsnp.parquet
#>
#>
#> ── 6) Repairing missings statistics columns if possible ──
#>
#>
#>
#> ℹ Z missing: Calculating Z using the formula: Z = B / SE
#>
#>
#>
#> ── Finished repair_stats:
#>
#> ℹ Added 1 new columns: Z
#>
#>
#>
#> ── Listing final breakdown of removed rows:
#>
#> nodbsnp: 21
#>
#>
#>
#> ── Finished tidyGWAS ───────────────────────────────────────────────────────────
#>
#> ℹ A total of 21 rows were removed
#>
#> ℹ Total running time: 3.7s
#>
#> Saving metadata from analysis to /tmp/Rtmp2qwFE4/file1a7bba9964b/metadata.yaml
#> # A tibble: 99,979 × 21
#> CHR POS_37 EffectAllele OtherAllele rowid B P SE INFO CaseN
#> <chr> <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 6 2.75e7 C T 77548 0.212 4.83e-39 0.0162 0.99 53386
#> 2 6 2.81e7 T C 36396 0.207 4.67e-38 0.016 1 53386
#> 3 6 2.75e7 C T 77992 0.206 7.92e-38 0.016 0.993 53386
#> 4 6 2.78e7 T C 67347 0.171 1.45e-32 0.0144 1 53386
#> 5 6 2.63e7 T C 7577 0.191 8.94e-30 0.0168 0.996 53386
#> 6 6 3.08e7 G A 26615 0.157 9.77e-28 0.0144 0.954 53386
#> 7 6 2.83e7 C T 91092 0.116 4.33e-27 0.0108 1.01 53386
#> 8 6 2.64e7 C T 84820 0.151 9.52e-27 0.0141 1 53386
#> 9 6 3.11e7 C T 17683 0.143 1.54e-25 0.0137 1 53386
#> 10 6 3.03e7 A G 71635 0.154 2.58e-25 0.0148 0.992 49683
#> # ℹ 99,969 more rows
#> # ℹ 11 more variables: ControlN <int>, EAF <dbl>, EffectiveN <int>, N <int>,
#> # POS_38 <int>, RSID <chr>, REF_37 <chr>, REF_38 <chr>, indel <lgl>,
#> # multi_allelic <lgl>, Z <dbl>Manually setting column names
tidyGWAS will attempt to guess the column names of the input file. Always take a look at the logfile if it’s the first time you are using that particular format. Sometimes the automatic parsing fails, like in the example below
testdf <- arrow::read_tsv_arrow(system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS"))
# nonsense names
names(testdf) <- c("CHROM_X", "BP_NEW", "RsID", "A1", "A2", "B","SE", "EAF","INFO", "P", "CaseN", "ControlN", "rowid")
# Wit
tryCatch(
tidyGWAS(
tbl = testdf,
dbsnp_path = system.file("extdata/dbSNP155", package = "tidyGWAS"),
output_dir = tempfile()
),
error = function(e) {
message("tidyGWAS failed: ", e$message)
NULL
}
)
#> Parsing input summary statistics...
#>
#>
#> ── Running tidyGWAS 1.0.0 ──────────────────────────────────────────────────────
#>
#> Starting at 2025-11-29 11:50:44.335313
#> with 100000 rows in input data.frame
#> ℹ Saving output in folder: /tmp/Rtmp2qwFE4/file1a7b4c118c15
#>
#>
#>
#> ── The detected format is "tidyGWAS"
#>
#> Was able to map 7 out of 12 to the "tidyGWAS" format
#> ✔ B -> B
#>
#> ✔ SE -> SE
#>
#> ✔ EAF -> EAF
#>
#> ✔ P -> P
#>
#> ✔ CaseN -> CaseN
#>
#> ✔ ControlN -> ControlN
#>
#> ✔ INFO -> INFO
#>
#> ! Failed to map: "CHROM_X", "BP_NEW", "RsID", "A1", and "A2"
#>
#> ℹ Attempting to map: "CHROM_X", "BP_NEW", "RsID", "A1", and "A2" to a dictionary of column names
#>
#> ✔ EffectAllele -> A1
#>
#> ✔ OtherAllele -> A2
#>
#> ! extra column(s) "CHROM_X", "BP_NEW", and "RsID" were not mapped to a tidyGWAS column
#>
#>
#>
#> ── Checking that columns follow tidyGWAS format
#>
#> ✔ The following columns are used for further steps: EffectAllele, OtherAllele, B, SE, EAF, INFO, P, CaseN, ControlN, and rowid
#>
#> ✖ Removed columns: CHROM_X, BP_NEW, and RsID
#>
#>
#>
#> ── Checking for columns with all NA
#>
#> ✔ Found no columns with all NA
#>
#> tidyGWAS failed: Either CHR and POS or RSID are required columns
#> NULLIf you use column_names, those columns will be renamed, following by an automatic attempt at parsing the remaining column names.
testdf <- arrow::read_tsv_arrow(system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS"))
# nonsense names
names(testdf) <- c("CHROM_X", "BP_NEW", "RsID", "A1", "A2", "B","SE", "EAF","INFO", "P", "CaseN", "ControlN", "rowid")
out <- tempfile()
tidyGWAS(
tbl = testdf,
dbsnp_path = system.file("extdata/dbSNP155", package = "tidyGWAS"),
output_dir = out,
column_names = list(
CHR = "CHROM_X",
POS = "BP_NEW",
RSID = "RsID"
)
)
#> Parsing input summary statistics...
#>
#>
#> ── Running tidyGWAS 1.0.0 ──────────────────────────────────────────────────────
#>
#> Starting at 2025-11-29 11:50:44.674695
#> with 100000 rows in input data.frame
#> ℹ Saving output in folder: /tmp/Rtmp2qwFE4/file1a7b47f6f2b9
#>
#>
#>
#> ── The detected format is "tidyGWAS"
#>
#> Was able to map 10 out of 12 to the "tidyGWAS" format
#> ✔ CHR -> CHR
#>
#> ✔ POS -> POS
#>
#> ✔ RSID -> RSID
#>
#> ✔ B -> B
#>
#> ✔ SE -> SE
#>
#> ✔ EAF -> EAF
#>
#> ✔ P -> P
#>
#> ✔ CaseN -> CaseN
#>
#> ✔ ControlN -> ControlN
#>
#> ✔ INFO -> INFO
#>
#> ! Failed to map: "A1" and "A2"
#>
#> ℹ Attempting to map: "A1" and "A2" to a dictionary of column names
#>
#> ✔ EffectAllele -> A1
#>
#> ✔ OtherAllele -> A2
#>
#>
#>
#> ── Checking that columns follow tidyGWAS format
#>
#> ✔ The following columns are used for further steps: CHR, POS, RSID, EffectAllele, OtherAllele, B, SE, EAF, INFO, P, CaseN, ControlN, and rowid
#>
#>
#>
#> ── Checking for columns with all NA
#>
#> ✔ Found no columns with all NA
#>
#> ℹ Found CaseN and ControlN, and no effective N:
#> Calculating EffectiveN by `EffectiveN = 4 / (1 / ControlN + 1 / CaseN)`
#>
#> ℹ Found CaseN and ControlN, Calculating N by `N = ControlN + CaseN`
#>
#>
#>
#> ── 1) Scanning for rows with NA in critical columns ──
#>
#>
#>
#> ✔ No rows contained missing values in CHR, POS, EffectAllele, and OtherAllele
#>
#>
#>
#> ── 2) Scanning for rows with duplications ──
#>
#>
#>
#> ℹ Looking for duplications with columns: CHR, POS, EffectAllele, and OtherAllele
#>
#> ✔ Found no duplications
#>
#>
#>
#> ── 3) Scanning for indels ──
#>
#>
#>
#> 1. EffectAllele or OtherAllele, character length > 1: A vs AA
#>
#> 2. EffectAllele or OtherAllele coded as 'D', 'I', or 'R'
#>
#> ✔ Detected 0 rows as indels
#>
#>
#>
#> ── 4) Validating columns
#>
#> ℹ The median value of B is -0.0012, which seems reasonable
#>
#>
#>
#> ── All rows passed validation
#>
#>
#>
#> ── 5) Adding RSID based on CHR:POS. Adding dbSNP based QC flags
#>
#> ℹ Inferring build by matching 10000 rows to GRCh37 and GRCh38
#>
#> 82 snps matched GRCh38, 9997 for GRCh37, inferring build to be 37
#> ℹ 0 rows had EffectAllele and OtherAllele not matching dbSNP REF/ALT alleles,but which matched after strand-flipping
#>
#> ! Removed 21 rows with no dbSNP entry or with incompat alleles
#>
#> /tmp/Rtmp2qwFE4/file1a7b47f6f2b9/pipeline_info/removed_nodbsnp.parquet
#>
#>
#> ── 6) Repairing missings statistics columns if possible ──
#>
#>
#>
#> ℹ Z missing: Calculating Z using the formula: Z = B / SE
#>
#>
#>
#> ── Finished repair_stats:
#>
#> ℹ Added 1 new columns: Z
#>
#>
#>
#> ── Listing final breakdown of removed rows:
#>
#> nodbsnp: 21
#>
#>
#>
#> ── Finished tidyGWAS ───────────────────────────────────────────────────────────
#>
#> ℹ A total of 21 rows were removed
#>
#> ℹ Total running time: 3.1s
#>
#> Saving metadata from analysis to /tmp/Rtmp2qwFE4/file1a7b47f6f2b9/metadata.yaml
#> # A tibble: 99,979 × 21
#> CHR POS_37 EffectAllele OtherAllele rowid B P SE INFO CaseN
#> <chr> <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 6 2.75e7 C T 77548 0.212 4.83e-39 0.0162 0.99 53386
#> 2 6 2.81e7 T C 36396 0.207 4.67e-38 0.016 1 53386
#> 3 6 2.75e7 C T 77992 0.206 7.92e-38 0.016 0.993 53386
#> 4 6 2.78e7 T C 67347 0.171 1.45e-32 0.0144 1 53386
#> 5 6 2.63e7 T C 7577 0.191 8.94e-30 0.0168 0.996 53386
#> 6 6 3.08e7 G A 26615 0.157 9.77e-28 0.0144 0.954 53386
#> 7 6 2.83e7 C T 91092 0.116 4.33e-27 0.0108 1.01 53386
#> 8 6 2.64e7 C T 84820 0.151 9.52e-27 0.0141 1 53386
#> 9 6 3.11e7 C T 17683 0.143 1.54e-25 0.0137 1 53386
#> 10 6 3.03e7 A G 71635 0.154 2.58e-25 0.0148 0.992 49683
#> # ℹ 99,969 more rows
#> # ℹ 11 more variables: ControlN <int>, EAF <dbl>, EffectiveN <int>, N <int>,
#> # POS_38 <int>, RSID <chr>, REF_37 <chr>, REF_38 <chr>, indel <lgl>,
#> # multi_allelic <lgl>, Z <dbl>Output files
tidyGWAS returns all its output in a directory.
fs::dir_tree(out)
#> /tmp/Rtmp2qwFE4/file1a7b47f6f2b9
#> ├── metadata.yaml
#> ├── pipeline_info
#> │ ├── removed_nodbsnp.parquet
#> │ └── removed_validate_chr_pos_path.parquet.parquet
#> ├── raw
#> │ └── raw.parquet
#> └── tidyGWAS_hivestyle
#> ├── CHR=1
#> │ └── part-0.parquet
#> ├── CHR=10
#> │ └── part-0.parquet
#> ├── CHR=11
#> │ └── part-0.parquet
#> ├── CHR=12
#> │ └── part-0.parquet
#> ├── CHR=13
#> │ └── part-0.parquet
#> ├── CHR=14
#> │ └── part-0.parquet
#> ├── CHR=15
#> │ └── part-0.parquet
#> ├── CHR=16
#> │ └── part-0.parquet
#> ├── CHR=17
#> │ └── part-0.parquet
#> ├── CHR=18
#> │ └── part-0.parquet
#> ├── CHR=19
#> │ └── part-0.parquet
#> ├── CHR=2
#> │ └── part-0.parquet
#> ├── CHR=20
#> │ └── part-0.parquet
#> ├── CHR=21
#> │ └── part-0.parquet
#> ├── CHR=22
#> │ └── part-0.parquet
#> ├── CHR=3
#> │ └── part-0.parquet
#> ├── CHR=4
#> │ └── part-0.parquet
#> ├── CHR=5
#> │ └── part-0.parquet
#> ├── CHR=6
#> │ └── part-0.parquet
#> ├── CHR=7
#> │ └── part-0.parquet
#> ├── CHR=8
#> │ └── part-0.parquet
#> └── CHR=9
#> └── part-0.parquetThat’s a lot of files!
Removed rows are stored in
pipeline/removed_*metadata.yamlcontains metadata about the executionrawcontains the summary statistics before any munging was done. Useful to reproduce, or to identify why rows were removed.tidyGWAS_hivestylecontains the cleaned summary statistics, in something called a hivestyle partition, by default. The motivation for this is detailed further down in the vignette. If you just want a standard csv file, useoutput_format="csv"oroutput_format="parquet"-
To read in the cleaned summary statistics:
df <- arrow::open_dataset(paste0(out, "/tidyGWAS_hivestyle")) |> dplyr::collect()
The tidyGWAS columns
tidyGWAS uses the following column names:
CHR
POS
RSID
EffectAllele
OtherAllele
EAF
B
SE
P
CaseN
ControlN
N
EffectiveN
INFO
Z
Note:
If your RSID column is in the format CHR:BP:A1:A2, you can still pass it as an RSID column. tidyGWAS automatically detects and splits apart the values.
Inputting sample size columns
Often, the sample size column is missing from the summary statistics, and you provide it manually. tidyGWAS has three arguments that can be used to manually set the sample size if it’s missing from the original file:
CaseNControlNN
cleaned <- tidyGWAS(
tbl = sumstats,
dbsnp_path = dbsnp_path,
# CaseN, ControlN and N can all be used to set sample size
CaseN = 400,
ControlN = 800,
N = 1200
)Reading in files
If you pass a filepath to tidyGWAS, it will attempt to read in the
file with data.table::fread(). If you need to pass custom
arguments to fread to make the parsing correct, you can the arguments
through ....
Hivestyle-partitioning
The default output format is a hivestyle format using .parquet files. See for example here for some motivation for this. In essence, this format will significantly speed up other downstream applications such as meta-analysis, LD querying and other analyses.
Variant identity
TidyGWAS will add the following columns dealing with variant ID, in addition to saving and validating all valid columns in the input file.
CHRChromosome. The same across both buildsPOS_38,POS_37genomic position on GRCh38 and GRCh37RSIDvariant ID from dbSNPREF_37,REF_38is the reference genome allele on GRCh38 and GRCh37multi_allelicis a TRUE/FALSE column that flag rows that were multi-allelic IN the summary statistics NOT whether there are multiple alleles in dbSNP. (TRUE corresponds to multi allelic).rowidmaps each row back to the inputted summary statistics the file inraw, so that any row can be mapped back to it’s original values.indelTRUE/FALSE whether the variant is of type INsertion/DELetionAll other valid columns in the input summary statistics file
If possible
repair_stats()will add statistics columns such asB, P, SE, Z, if they are missing and possible to repair.
Parallel computation
tidyGWAS automatically detects the number of cores. In some cases,
for example when running tidyGWAS in a HPC cluster, you might need to
manually set the number of cores, which can be done using the
OMP_NUM_THREADS variable. This should not be a larger
number of cores than what you have requested in your HPC job (in the
example below, the “–cpus-per-task” flag)
#SBATCH --mem=60gb
#SBATCH --time=24:0:00
#SBATCH --cpus-per-task 8
export OMP_NUM_THREADS=8
outdir=$(pwd)
gwas=$outdir/my_gwas.tsv.gz
dbsnp_files="dbSNP155"
Rscript -e "tidyGWAS(commandArgs(trailingOnly = TRUE)[1], dbsnp_path = commandArgs(trailingOnly = TRUE)[2],output_dir = commandArgs(trailingOnly = TRUE)[3], logfile=TRUE)" $gwas $dbsnp_files $outdirComputational cost and memory usage
Memory use and time scales with the size of the summary statistics. From running tidyGWAS, here’s an estimation. Especially, if the summary statistics are missing CHR and POS, tidyGWAS will require additional memory. 1. 10 million rows ~ 5-15gb 2. 40 million rows 10-40gb 3. 60 million rows 65-85gb