tidyGWAS • tidyGWAS

Installation

tidyGWAS is not yet on CRAN. To install, you need to use devtools::install_github() or remotes::install_github()

devtools::install_github("ararder/tidyGWAS")
remotes::install_github("ararder/tidyGWAS")

Secondly, you will need to download the reference data. tidyGWAS uses a slightly edited version of dbSNP v155, that is available here.

wget https://zenodo.org/records/14697504/files/dbSNP155.tar
tar -xvf dbSNP155.tar

Using an apptainer or docker container instead

You can skip the installation step by using a docker container with docker or apptainer.

# using apptainer
apptainer pull docker://arvhar/tidygwas:latest
apptainer shell ~/tidygwas_latest.sif

# using docker 
docker run arvhar/tidygwas:latest

Quick start

This is what a typical call to tidyGWAS() will look like. A detailed explanation of the different arguments follows.

# we setup a directory where we will store all summary statistics cleaned by tidyGWAS
gwas_folder <- tempfile()
# provide the filepath to the name of a directory that tidyGWAS will create.
outdir <- paste0(gwas_folder, "gwasName")

cleaned <- tidyGWAS(
  tbl = "/filepath/to/sumstats", 
  # we read in a tab-separated file, so provide tab as delimiter
  delim = "\t",
  dbsnp_path = dsnp_path,
  # 
  column_names = list(
    "CHR" = "CHROM",
    "POS" = "BP",
    "EffectAllele" = "A1",
    "B" = "Effect"
  ),
  # Number of samples are missing, so manually impute
  CaseN = 54000,
  ControlN = 73000,
  logfile=TRUE,
  output_dir = outdir
  )

Tutorial

Provide tidyGWAS either with an in-memory dplyr::tibble() or a filepath.

Here a filepath to a gzipped tsv file is passed, along with delim = "\t". tidyGWAS uses arrow::read_delim_arrow() to read in files from disk. If you have trouble with parsing the file correctly, use another package such as data.table::fread() or readr::read_table() to first read in the into memory, and then pass the in-memory data.frame to tidyGWAS.
dbsnp_path = dbsnp_path gives the filepath to the reference data.
output_dir = out gives the filepath to the output directory that tidyGWAS will create. The directory cannot exist yet - to protect for accidently overwriting files. If no argument is passed to output_dir, tidyGWAS will create a folder in the R tempdir, meaning that any cleaned summary statistics will deleted if the R session is ended.

library(tidyGWAS)
# we use the dummy version of dbSNP that comes with the package
dbsnp_path <- system.file("extdata/dbSNP155", package = "tidyGWAS")

# a dummy sumstats with 100 000 rows
gwas <- system.file("extdata/sumstats.tsv.gz", package = "tidyGWAS")
# store the results in a temporary directory
out <- tempfile()

tidyGWAS(
  tbl = gwas,
  dbsnp_path = dbsnp_path,
  output_dir = out
)
#> Parsing input summary statistics...
#> 
#> 
#> ── Running tidyGWAS 0.9.9.2 ────────────────────────────────────────────────────
#> 
#> Starting at 2025-07-11 17:56:58.848487
#> with 100000 rows in input data.frame
#> ℹ Saving output in folder: /tmp/RtmpsKC2ps/file20988fd2438
#> 
#> 
#> 
#> ── The detected format is "tidyGWAS" 
#> 
#> Was able to map 12 out of 12 to the "tidyGWAS" format
#> ✔ CHR -> CHR
#> 
#> ✔ POS -> POS
#> 
#> ✔ RSID -> RSID
#> 
#> ✔ EffectAllele -> EffectAllele
#> 
#> ✔ OtherAllele -> OtherAllele
#> 
#> ✔ B -> B
#> 
#> ✔ SE -> SE
#> 
#> ✔ EAF -> EAF
#> 
#> ✔ P -> P
#> 
#> ✔ CaseN -> CaseN
#> 
#> ✔ ControlN -> ControlN
#> 
#> ✔ INFO -> INFO
#> 
#> 
#> 
#> ── Checking that columns follow tidyGWAS format 
#> 
#> ✔ The following columns are used for further steps: CHR, POS, RSID, EffectAllele, OtherAllele, B, SE, EAF, INFO, P, CaseN, ControlN, and rowid
#> 
#> 
#> 
#> ── Checking for columns with all NA 
#> 
#> ✔ Found no columns with all NA
#> 
#> ℹ Found CaseN and ControlN, and no effective N:
#> Calculating EffectiveN by `EffectiveN = 4 / (1 / ControlN + 1 / CaseN)`
#> 
#> ℹ Found CaseN and ControlN, Calculating N by `N = ControlN + CaseN`
#> 
#> 
#> 
#> ── 1) Scanning for indels ──
#> 
#> 
#> 
#> ✔ No rows contained missing values in EffectAllele and OtherAllele
#> 
#> 1. EffectAllele or OtherAllele, character length > 1: A vs AA
#> 
#> 2. EffectAllele or OtherAllele coded as 'D', 'I', or 'R'
#> 
#> ✔ Detected 0 rows as indels
#> 
#> 
#> 
#> ── 2) Scanning for rows with NA in critical columns ──
#> 
#> 
#> 
#> ✔ No rows contained missing values in CHR and POS
#> 
#> 
#> 
#> ── 3) Scanning for rows with duplications ──
#> 
#> 
#> 
#> ℹ Looking for duplications with columns: CHR, POS, EffectAllele, and OtherAllele
#> 
#> ✔ Found no duplications
#> 
#> 
#> 
#> ── 4) Validating columns 
#> 
#> ℹ The median value of B is -0.0012, which seems reasonable
#> 
#> 
#> 
#> ── All rows passed validation 
#> 
#> 
#> 
#> ── 5) Adding RSID based on CHR:POS. Adding dbSNP based QC flags 
#> 
#> ℹ Inferring build by matching 10000 rows to GRCh37 and GRCh38
#> 
#> 99 snps matched GRCh38, 9998 for GRCh37, inferring build to be 37
#> ! Removed 21 rows with no dbSNP entry or with incompat alleles
#> 
#> /tmp/RtmpsKC2ps/file20988fd2438/pipeline_info/removed_nodbsnp.parquet
#> 
#> 
#> ── 6) Repairing missings statistics columns if possible ──
#> 
#> 
#> 
#> ℹ Z missing: Calculating Z using the formula:  Z = B / SE
#> 
#> 
#> 
#> ── Finished repair_stats:  
#> 
#> ℹ Added 1 new columns: Z
#> 
#> 
#> 
#> ── Listing final breakdown of removed rows:  
#> 
#> nodbsnp: 21
#> 
#> 
#> 
#> ── Finished tidyGWAS ───────────────────────────────────────────────────────────
#> 
#> ℹ A total of 21 rows were removed
#> 
#> ℹ Total running time: 3.8s
#> 
#> Saving metadata from analysis to /tmp/RtmpsKC2ps/file20988fd2438/metadata.yaml
#>           CHR    POS_37 EffectAllele OtherAllele rowid         B         P
#>        <char>     <int>       <char>      <char> <int>     <num>     <num>
#>     1:      6  27521096            C           T 77548 0.2120017 4.830e-39
#>     2:      6  28080777            T           C 36396 0.2066970 4.672e-38
#>     3:      6  27480526            C           T 77992 0.2062008 7.920e-38
#>     4:      6  27835272            T           C 67347 0.1714038 1.454e-32
#>     5:      6  26328353            T           C  7577 0.1908022 8.936e-30
#>    ---                                                                    
#> 99975:     11  33865194            A           T 10863 0.0000000 1.000e+00
#> 99976:      8 126178011            A           G 31103 0.0000000 1.000e+00
#> 99977:      3   7984591            G           A 38135 0.0000000 1.000e+00
#> 99978:     19  37077172            A           G 45371 0.0000000 1.000e+00
#> 99979:      1 191825752            A           G 50847 0.0000000 1.000e+00
#>            SE  INFO CaseN ControlN       EAF EffectiveN      N    POS_38
#>         <num> <num> <int>    <int>     <num>      <int>  <int>     <int>
#>     1: 0.0162 0.990 53386    77258 0.9189468     126281 130644  27553317
#>     2: 0.0160 1.000 53386    77258 0.9179468     126281 130644  28112999
#>     3: 0.0160 0.993 53386    77258 0.9175382     126281 130644  27512747
#>     4: 0.0144 1.000 53386    77258 0.8981296     126281 130644  27867494
#>     5: 0.0168 0.996 53386    77258 0.9261296     126281 130644  26328125
#>    ---                                                                  
#> 99975: 0.0122 0.670 53386    77258 0.2471827     126281 130644  33843648
#> 99976: 0.0104 0.996 53386    77258 0.7801827     126281 130644 125165769
#> 99977: 0.0286 0.986 52978    76825 0.9764081     125421 129803   7942904
#> 99978: 0.0163 0.998 53386    77258 0.9250000     126281 130644  36586270
#> 99979: 0.0097 1.000 53386    77258 0.2654086     126281 130644 191856622
#>               RSID REF_37 REF_38 multi_allelic        Z
#>             <char> <char> <char>        <lgcl>    <num>
#>     1:  rs35848276      C      C         FALSE 13.08653
#>     2:  rs68188794      T      T         FALSE 12.91857
#>     3:  rs34573979      C      C         FALSE 12.88755
#>     4:    rs200948      T      T         FALSE 11.90304
#>     5:  rs34107459      T      T         FALSE 11.35727
#>    ---                                                 
#> 99975:   rs7942366      A      A         FALSE  0.00000
#> 99976:   rs7813225      A      A         FALSE  0.00000
#> 99977: rs140518486      G      G         FALSE  0.00000
#> 99978:  rs12977285      A      A         FALSE  0.00000
#> 99979:   rs9283418      A      A         FALSE  0.00000

Output files

tidyGWAS returns all its output in a directory.

fs::dir_tree(out)
#> /tmp/RtmpsKC2ps/file20988fd2438
#> ├── metadata.yaml
#> ├── pipeline_info
#> │   ├── removed_nodbsnp.parquet
#> │   └── removed_validate_chr_pos_path.parquet.parquet
#> ├── raw
#> │   └── sumstats.tsv.gz.parquet
#> └── tidyGWAS_hivestyle
#>     ├── CHR=1
#>     │   └── part-0.parquet
#>     ├── CHR=10
#>     │   └── part-0.parquet
#>     ├── CHR=11
#>     │   └── part-0.parquet
#>     ├── CHR=12
#>     │   └── part-0.parquet
#>     ├── CHR=13
#>     │   └── part-0.parquet
#>     ├── CHR=14
#>     │   └── part-0.parquet
#>     ├── CHR=15
#>     │   └── part-0.parquet
#>     ├── CHR=16
#>     │   └── part-0.parquet
#>     ├── CHR=17
#>     │   └── part-0.parquet
#>     ├── CHR=18
#>     │   └── part-0.parquet
#>     ├── CHR=19
#>     │   └── part-0.parquet
#>     ├── CHR=2
#>     │   └── part-0.parquet
#>     ├── CHR=20
#>     │   └── part-0.parquet
#>     ├── CHR=21
#>     │   └── part-0.parquet
#>     ├── CHR=22
#>     │   └── part-0.parquet
#>     ├── CHR=3
#>     │   └── part-0.parquet
#>     ├── CHR=4
#>     │   └── part-0.parquet
#>     ├── CHR=5
#>     │   └── part-0.parquet
#>     ├── CHR=6
#>     │   └── part-0.parquet
#>     ├── CHR=7
#>     │   └── part-0.parquet
#>     ├── CHR=8
#>     │   └── part-0.parquet
#>     └── CHR=9
#>         └── part-0.parquet

That’s a lot of files!

Removed rows are stored in pipeline/removed_*
metadata.yaml contains metadata about the execution
raw contains the summary statistics before any munging was done. Useful to reproduce, or to identify why rows were removed.
tidyGWAS_hivestyle contains the cleaned summary statistics, in something called a hivestyle partition, by default. The motivation for this is detailed further down in the vignette. If you just want a standard csv file, use output_format="csv" or output_format="parquet"
To read in the cleaned summary statistics:

df <- arrow::open_dataset(paste0(out, "/tidyGWAS_hivestyle")) |> dplyr::collect()

Setting column names

Almost certainly, the summary statistics you want to clean will not have the same column name that the example file had.

sumstats <- read.table(gwas, header=T) |> dplyr::tibble()
head(sumstats)
#> # A tibble: 6 × 13
#>     CHR      POS RSID       EffectAllele OtherAllele        B     SE   EAF  INFO
#>   <int>    <int> <chr>      <chr>        <chr>          <dbl>  <dbl> <dbl> <dbl>
#> 1     6 31819813 rs1839882… C            G           -0.00570 0.0292 0.975 0.936
#> 2    16 13991338 rs8044430  T            C           -0.0177  0.0092 0.641 0.951
#> 3     9 85016930 rs1114040… A            T           -0.00230 0.0185 0.916 0.701
#> 4    19 15743403 rs4807961  C            G            0.00240 0.0099 0.256 0.996
#> 5    18  5948313 rs948293   T            C            0.00170 0.0092 0.332 0.971
#> 6    13 47224772 rs837      G            A            0.0272  0.0086 0.447 1.01 
#> # ℹ 4 more variables: P <dbl>, CaseN <int>, ControlN <int>, rowid <int>

colnames(sumstats)
#>  [1] "CHR"          "POS"          "RSID"         "EffectAllele" "OtherAllele" 
#>  [6] "B"            "SE"           "EAF"          "INFO"         "P"           
#> [11] "CaseN"        "ControlN"     "rowid"

To prevent parsing files wrong tidyGWAS() will not guess which columns are which. Therefore, tidyGWAS requires the colum names to be the same as it’s native format or a mapping between tidyGWAS column names and the column names in the input file. This is done using the column_names argument, which takes a named list, where the names are the tidyGWAS columns and the values the corresponding column in the input file.

# what if the names were all wrong?
sumstats_with_wrong_names <- dplyr::rename(
  sumstats, 
  CHROM = CHR, 
  BP = POS,
  ID = RSID,
  A1 = EffectAllele, 
  A2 = OtherAllele,
  EFFECT = B
)

tidyGWAS(
  tbl = sumstats_with_wrong_names, 
  dbsnp_path = dbsnp_path,
  column_names = list(
    CHR = "CHROM",
    POS = "BP",
    RSID = "ID",
    EffectAllele = "A1",
    OtherAllele = "A2",
    B = "EFFECT"
  )
  )

The tidyGWAS columns

tidyGWAS uses the following column names:

CHR
POS
RSID
EffectAllele
OtherAllele
EAF
B
SE
P
CaseN
ControlN
N
INFO
Z

Note:

If your RSID column is in the format CHR:BP:A1:A2, you can still pass it as an RSID column.

Inputting sample size columns

Often, the sample size column is missing from the summary statistics, and you provide it manually. tidyGWAS has three arguments that can be used to manually set the sample size, if it’s missing from the original file:

CaseN
ControlN
N


cleaned <- tidyGWAS(
  tbl = sumstats, 
  dbsnp_path = dbsnp_path,
  # CaseN, ControlN and N can all be used to set sample size
  CaseN = 400,
  ControlN = 800,
  N = 1200
  )

Reading in files

If you pass a filepath to tidyGWAS, it will attempt to read in the file with arrow::read_delim_arrow()

The default delimiter is white space, so to read in comma-separated files or tab-separated files, you can provide the delim argument to arrow::read_delim_arrow() through ...


cleaned <- tidyGWAS(
  tbl = "filepath/to/gwas/ondisk.csv",
  # here we specify the delimiter to for csv files
  delim = ",",
  dbsnp_path = dbsnp_path,
)

There’s a lot of different field delimiters used in the wild, and sometimes you can struggle with inputting the correct delimiter. In such cases, it’s often much more convenient to use the effective data.table::fread() or readr::read_table() to first read in the summary statistics into memory before passing it to tidyGWAS.

# use readr
sumstats <- readr::read_table("path/to/sumstats.gz.vcf")
# or data.table
sumstats <- data.table::fread("path/to/sumstats.gz.vcf")

output <- "my_gwas_dir"

cleaned <- tidyGWAS(
  tbl = sumstats,
  dbsnp_path = dbsnp_path,
  outdir = output
)

Hivestyle-partitioning

The default output format is a hivestyle format using .parquet files. See for example here for some motivation for this. In essence, this format will significantly speed up other downstream applications such as meta-analysis, LD querying and other analyses.

Variant identity

TidyGWAS will add the following columns dealing with variant ID, in addition to saving and validating all valid columns in the input file.

CHR Chromosome. The same across both builds
POS_38, POS_37 genomic position on GRCh38 and GRCh37
RSID variant ID from dbSNP
REF_37, REF_38 is the reference genome allele on GRCh38 and GRCh37
multi_allelic is a TRUE/FALSE column that flag rows that were multi-allelic IN the summary statistics NOT whether there are multiple alleles in dbSNP. (TRUE corresponds to multi allelic).
rowid maps each row back to the inputted summary statistics the file in raw, so that any row can be mapped back to it’s original values.
indel TRUE/FALSE whether the variant is of type INsertion/DELetion
All other valid columns in the input summary statistics file
If possible repair_stats() will add statistics columns such as B, P, SE, Z, if they are missing and possible to repair.

Parallel computation

tidyGWAS automatically detects the number of cores. In some cases, for example when running tidyGWAS in a HPC cluster, you might need to manually set the number of cores, which can be done using the OMP_NUM_THREADS variable. This should not be a larger number of cores than what you have requested in your HPC job (in the example below, the “–cpus-per-task” flag)


#SBATCH --mem=60gb
#SBATCH --time=24:0:00
#SBATCH --cpus-per-task 8
export OMP_NUM_THREADS=8

outdir=$(pwd)
gwas=$outdir/my_gwas.tsv.gz
dbsnp_files="dbSNP155"
Rscript -e "tidyGWAS(commandArgs(trailingOnly = TRUE)[1],  dbsnp_path = commandArgs(trailingOnly = TRUE)[2],outdir = commandArgs(trailingOnly = TRUE)[3], logfile=TRUE)" $gwas $dbsnp_files $outdir

Computational cost and memory usage

Memory use and time scales with the size of the summary statistics. From running tidyGWAS, here’s an estimation. Especially, if the summary statistics are missing CHR and POS, tidyGWAS will require additional memory. 1. 10 million rows ~ 5-15gb 2. 40 million rows 10-40gb 3. 60 million rows 65-85gb