Execute validation and quality control of GWAS summmary statistics

tidyGWAS() performs a set of validations on input colummns, repairs missing columns, and can add missing CHR/POS or RSID. In addition, CHR and POS is standardised to GRCh38, with coordinates on GRCh37 added in as well.

Briefly, tidyGWAS() updates RSID if possible using the refsnp-merged file from dbSNP. Each inputed column is then validated and coerced to the correct type.

If statistis such as P, B are missing, tidyGWAS() will attempt to impute them if possible using repair_stats()

Standard column names are assumed, BEFORE inputting into the function. This is a deliberate decision as automatic parsing of some important column names can be ambiguous For example, in some sumstats, A1 referes to effect allele, while other formats use A1 as non-effect allele.

Usage

tidyGWAS(
  tbl,
  dbsnp_path = file.path(Sys.getenv("HOME"), ".config/dbSNP155"),
  ...,
  column_names = NULL,
  output_format = c("hivestyle", "parquet", "csv"),
  output_dir = tempfile(),
  CaseN = NULL,
  ControlN = NULL,
  N = NULL,
  impute_freq = c("None", "EUR", "AMR", "AFR", "SAS", "EAS"),
  impute_freq_file = NULL,
  impute_n = FALSE,
  min_EAF = NULL,
  flag_discrep_freq = c("None", "EUR", "AMR", "AFR", "SAS", "EAS"),
  allow_duplications = FALSE,
  build = c("NA", "37", "38"),
  default_build = c("37", "38"),
  indel_strategy = c("keep", "remove"),
  convert_p = 2.225074e-308,
  repair_cols = TRUE,
  logfile = FALSE
)

Arguments

tbl

a data.frame or character() vector

dbsnp_path

filepath to the dbSNP155 directory

...

pass additional arguments to arrow::read_delim_arrow(), if tbl is a filepath.

column_names

a named list of column names: list(RSID = "SNP", POS = "BP")

output_format

How should the finished cleaned file be saved?

'csv' corresponds to arrow::write_csv_arrow()
'parquet' corresponds to arrow::write_parquet()
'hivestyle' corresponds to arrow::write_dataset() split by CHR

output_dir

filepath to a folder where tidyGWAS output will be stored. The folder should not yet exist. Note that the default argument is tempfile(), meaning that tidyGWAS output will not be saved by default over R sessions.

CaseN

manually input number of cases

ControlN

manually input number of controls

N

manually input sample size

impute_freq

one of c("None", "EUR", "AMR", "AFR", "SAS", "EAS"). If None, no imputation is done. Otherwise precomputed alleles frequence from 1000KG, selected ancestry is used

impute_freq_file

filepath to a .parquet file with custom allele frequencies. The file needs to be a tabular dataframe with columns RSID, EffectAllele, OtherAllele, EAF. EAF should correspond to the frequency of the EffectAllele.

impute_n

Should N be imputed if it's missing?

min_EAF

Apply a filter on allele frequency prior to applying the algorithm. Useful to speed up cleaning of very large files

flag_discrep_freq

Should variants with allele frequency discrepancies be flagged?

allow_duplications

Should duplicated variants be allowed? Useful if the munged sumstats are QTL sumstats

build

If you are sure of what genome build ('37' or '38'), can be used to skip infer_build() and speed up computation

default_build

If only RSID exists, the build cannot be inferred. Nonetheless, tidyGWAS applies a filter on incompatible alleles with GRCh37/38. In such a case, tidyGWAS needs to decide on which reference genome to compare alleles with.

indel_strategy

Should indels be kept or removed?

convert_p

What value should be used for when P-value has been rounded to 0?

repair_cols

Should any missing statistical columns be repaired if possible? calls repair_stats() if TRUE

logfile

Should messages be redirected to a logfile?

Value

a dplyr::tibble()

Examples

if (FALSE) { # \dontrun{
tidyGWAS(
  tbl = "path/to/GWAS_trait_X_.tsv.gz", logfile = TRUE,
  output_dir = "/store/GWAS/tidyGWAS/trait_X"
  )
} # }