Execute validation and quality control of GWAS summmary statistics
tidyGWAS.Rd
tidyGWAS()
performs a set of validations on input colummns, repairs missing
columns, and can add missing CHR/POS or RSID. In addition, CHR and POS is
standardised to GRCh38, with coordinates on GRCh37 added in as well.
Briefly, tidyGWAS()
updates RSID if possible using the
refsnp-merged file
from dbSNP. Each inputed column is then validated and coerced to
the correct type.
If statistis such as P
, B
are missing, tidyGWAS()
will attempt to impute
them if possible using repair_stats()
Standard column names are assumed, BEFORE inputting into the function. This is a deliberate decision as automatic parsing of some important column names can be ambiguous For example, in some sumstats, A1 referes to effect allele, while other formats use A1 as non-effect allele.
Usage
tidyGWAS(
tbl,
dbsnp_path = file.path(Sys.getenv("HOME"), ".config/dbSNP155"),
...,
column_names = NULL,
output_format = c("hivestyle", "parquet", "csv"),
output_dir = tempfile(),
CaseN = NULL,
ControlN = NULL,
N = NULL,
impute_freq = c("None", "EUR", "AMR", "AFR", "SAS", "EAS"),
impute_freq_file = NULL,
impute_n = FALSE,
min_EAF = NULL,
flag_discrep_freq = c("None", "EUR", "AMR", "AFR", "SAS", "EAS"),
allow_duplications = FALSE,
build = c("NA", "37", "38"),
default_build = c("37", "38"),
indel_strategy = c("keep", "remove"),
convert_p = 2.225074e-308,
repair_cols = TRUE,
logfile = FALSE
)
Arguments
- tbl
a
data.frame
orcharacter()
vector- dbsnp_path
filepath to the dbSNP155 directory
- ...
pass additional arguments to
arrow::read_delim_arrow()
, if tbl is a filepath.- column_names
a named list of column names:
list(RSID = "SNP", POS = "BP")
- output_format
How should the finished cleaned file be saved?
'csv' corresponds to
arrow::write_csv_arrow()
'parquet' corresponds to
arrow::write_parquet()
'hivestyle' corresponds to
arrow::write_dataset()
split byCHR
- output_dir
filepath to a folder where tidyGWAS output will be stored. The folder should not yet exist. Note that the default argument is
tempfile()
, meaning that tidyGWAS output will not be saved by default over R sessions.- CaseN
manually input number of cases
- ControlN
manually input number of controls
- N
manually input sample size
- impute_freq
one of c("None", "EUR", "AMR", "AFR", "SAS", "EAS"). If None, no imputation is done. Otherwise precomputed alleles frequence from 1000KG, selected ancestry is used
- impute_freq_file
filepath to a .parquet file with custom allele frequencies. The file needs to be a tabular dataframe with columns RSID, EffectAllele, OtherAllele, EAF. EAF should correspond to the frequency of the EffectAllele.
- impute_n
Should N be imputed if it's missing?
- min_EAF
Apply a filter on allele frequency prior to applying the algorithm. Useful to speed up cleaning of very large files
- flag_discrep_freq
Should variants with allele frequency discrepancies be flagged?
- allow_duplications
Should duplicated variants be allowed? Useful if the munged sumstats are QTL sumstats
- build
If you are sure of what genome build ('37' or '38'), can be used to skip
infer_build()
and speed up computation- default_build
If only RSID exists, the build cannot be inferred. Nonetheless, tidyGWAS applies a filter on incompatible alleles with GRCh37/38. In such a case, tidyGWAS needs to decide on which reference genome to compare alleles with.
- indel_strategy
Should indels be kept or removed?
- convert_p
What value should be used for when P-value has been rounded to 0?
- repair_cols
Should any missing statistical columns be repaired if possible? calls
repair_stats()
if TRUE- logfile
Should messages be redirected to a logfile?