• ldsR

library(ldsR)
library(testthat)
library(fs)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#> 
#>     matches
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Introduction

ldsR is a R rewrite of the ldsc software, originally implemented in python 2.7. The aim is to provide a significantly simplified user experience.

Currently, ldsR is able to reproduce all features of ldsc, except:

1. LDscore calculation

2. Quantitative annotations in partitioned heritability.

The showcase here is based on a small example file of 100,000 rows, with effect sizes from schizophrenia and bipolar disorder.

df <- arrow::read_parquet(system.file("extdata", "sumstats.parquet", package = "ldsR"))
head(df)
#> # A tibble: 6 × 5
#>   SNP          Z.x    N.x    Z.y    N.y
#>   <chr>      <dbl>  <dbl>  <dbl>  <dbl>
#> 1 rs3094315  0.289 126849 -0.101 405254
#> 2 rs3131972  0.269 126849 -0.299 405254
#> 3 rs3131969 -0.074 126849 -0.296 405254
#> 4 rs1048488  0.273 126849 -0.051 405254
#> 5 rs3115850  0.263 126849 -0.045 405254
#> 6 rs2286139  0.034 126849 -0.68  387638

Munge

To perfectly reproduce LDSC results, the input summary statistics needs to filtered in an identical manner. munge() mirrors the functionality of munge_sumstats.py. Munge requires the following columns:

SNP A1 A2 Z N

test_df <- mutate(df, A1 = "G", A2 = "T", Z = Z.x, N=N.x)
test_df <- munge(test_df)
#> ! Removed 0 rows after filtering on reference panel
#> ! Removed 0 rows with non-RSIDs in SNP column
#> ! Removed 0 rows with duplicated RSIDs
#> ! Removed 0 rows due to strand ambigious alleles
#> ! Removed 0 rows with a sample size smaller than 87096

SNP heritability

ldsc_h2() estimates the SNP-heritability for a single trait, one at a time. By default, regression weights and LDscores are from the European 1000G reference panel.

ldsc_h2(test_df)
#> # A tibble: 1 × 6
#>      h2  h2_se   int int_se mean_chi2 lambda_gc
#>   <dbl>  <dbl> <dbl>  <dbl>     <dbl>     <dbl>
#> 1 0.371 0.0358  1.09 0.0328      2.14      1.80

For binary traits, it’s often useful to convert the heritability estimate to the liability scale. This can be done using the pop_prev and sample_prev arguments. In this particular case, the N column in the example file reflects the total number of cases and controls. In such a case, we need both the population prevalence and sample prevalence

n_cont <- 77344
n_cas <- 53300
s_prev <- n_cas/(n_cas + n_cont)
ldsc_h2(test_df, pop_prev = 0.01, sample_prev = s_prev)
#> # A tibble: 1 × 8
#>   lia_h2 lia_se    h2  h2_se   int int_se mean_chi2 lambda_gc
#>    <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>     <dbl>     <dbl>
#> 1  0.212 0.0205 0.371 0.0358  1.09 0.0328      2.14      1.80

Ideally, for estimation of heritability, the per SNP sample size should be calculated using the effective sample size summed across cohorts, see here. If the N column corresponds to the effective sample size, only the pop_prev argument is required (a 50% sample prevalence is assumed).

eff_n <- calc_effective_n(n_cas, n_cont)
test_df$N <- eff_n
ldsc_h2(test_df, pop_prev = 0.01, sample_prev = s_prev)
#> # A tibble: 1 × 8
#>   lia_h2 lia_se    h2  h2_se   int int_se mean_chi2 lambda_gc
#>    <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>     <dbl>     <dbl>
#> 1  0.219 0.0212 0.384 0.0371  1.09 0.0329      2.14      1.80

Lastly, you can estimate the observed scale heritability first, and then calculate the liability scale heritability at your convenience using liability_h2().

liability_h2(0.37, 0.01, 0.4)
#> [1] 0.2127143

Genetic correlations

ldsR also implements the estimation of genetic correlations in ldsc_rg(). By default, regression weights and LDscores are from the European 1000G reference panel.

Technically, ldsc_h2() only requires SNP, Z and N. ldsc_rg() on the other hand requires both A1 and A2, to be able to align Z scores to the same reference allele.

Using the example data:

sumstat1 <- dplyr::rename(df, Z = Z.x, N = N.x) |> 
  dplyr::mutate(A1 = "A", A2 = "G")
sumstat2 <- dplyr::rename(df, Z = Z.y, N = N.y) |> 
  dplyr::mutate(A1 = "A", A2 = "G")

ldsc_rg(sumstat1, sumstat2)
#> ! 0 SNPs were removed when merging summary statistics
#> # A tibble: 1 × 13
#>      rg  rg_se h2_trait1 h2_trait_se h2_trait2 h2_trait2_se  gcov gcov_se
#>   <dbl>  <dbl>     <dbl>       <dbl>     <dbl>        <dbl> <dbl>   <dbl>
#> 1 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> # ℹ 5 more variables: gcov_int <dbl>, gcov_int_se <dbl>, mean_z1z2 <dbl>,
#> #   z <dbl>, p <dbl>

Additionally, ldsc_rg() attemps to make it easier to calculate genetic correlations among many traits.

sumstats2 can be both a dplyr::tibble(), and a list containing multiple dplyr::tibble()’s. If a list is passed to the sumstats2 argument, the genetic correlation will be estimated between sumstats1 and all entries in the list.

sumstats <- list(
  "trait1" = sumstat2,
  "trait2" = sumstat2,
  "trait4" = sumstat2
)
          
ldsc_rg(sumstat1, sumstats)
#> ! 0 SNPs were removed when merging summary statistics
#> ⠙ 1/3 ETA:  4s | Computing genetic correlations...
#> ! 0 SNPs were removed when merging summary statistics
#> ⠙ 1/3 ETA:  4s | Computing genetic correlations...  ⠹ 2/3 ETA:  2s | Computing genetic correlations...  
#> ! 0 SNPs were removed when merging summary statistics
#> ⠹ 2/3 ETA:  2s | Computing genetic correlations...  
#> # A tibble: 3 × 14
#>   trait2    rg  rg_se h2_trait1 h2_trait_se h2_trait2 h2_trait2_se  gcov gcov_se
#>   <chr>  <dbl>  <dbl>     <dbl>       <dbl>     <dbl>        <dbl> <dbl>   <dbl>
#> 1 trait1 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> 2 trait2 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> 3 trait4 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> # ℹ 5 more variables: gcov_int <dbl>, gcov_int_se <dbl>, mean_z1z2 <dbl>,
#> #   z <dbl>, p <dbl>

If the list is not named, the names will correspond to the index in the list.

sumstats <- list(
  sumstat2,
  sumstat2,
  sumstat2
)
          
ldsc_rg(sumstat1, sumstats)
#> ! 0 SNPs were removed when merging summary statistics
#> ⠙ 1/3 ETA:  3s | Computing genetic correlations...
#> ! 0 SNPs were removed when merging summary statistics
#> ⠙ 1/3 ETA:  3s | Computing genetic correlations...  ! 0 SNPs were removed when merging summary statistics
#> ⠙ 1/3 ETA:  3s | Computing genetic correlations...  
#> # A tibble: 3 × 14
#>   trait2    rg  rg_se h2_trait1 h2_trait_se h2_trait2 h2_trait2_se  gcov gcov_se
#>    <int> <dbl>  <dbl>     <dbl>       <dbl>     <dbl>        <dbl> <dbl>   <dbl>
#> 1      1 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> 2      2 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> 3      3 0.734 0.0496     0.371      0.0358    0.0732      0.00685 0.121  0.0113
#> # ℹ 5 more variables: gcov_int <dbl>, gcov_int_se <dbl>, mean_z1z2 <dbl>,
#> #   z <dbl>, p <dbl>

Partitioning heritability

In creating the functions to partition heritability, ldsR deviates from ldsc in the file format used for the partitioned ldscores.

The examples here use a smaller version of the full ldscores, which are available here. The native ldsc for partitioned ldscores are available here. The primary difference between ldsR and ldsc partitioned ldscores is the number of files. ldsc splits out the files across chromosomes and data-type, resulting in 88 files per ldscore dataset (22 chromosomes, .ldscore, .annot, .m50, .m5_50).

ldsR collapses this to three files, in the parquet format.

dir_tree(system.file("extdata", "superclusters/", package = "ldsR"))
#> /home/runner/work/_temp/Library/ldsR/extdata/superclusters/
#> ├── annot.parquet
#> ├── annot_ref.parquet
#> └── ld.parquet

annot.parquet contains information on the number of SNPs, ie the M50 and M5_50 files per annotation

arrow::read_parquet(system.file("extdata", "superclusters/annot.parquet", package = "ldsR")) |> 
  head()
#> # A tibble: 4 × 3
#>   annot                    m50       m
#>   <chr>                  <dbl>   <dbl>
#> 1 amygdala_excitatory   911324 1540052
#> 2 astrocyte             713230 1207523
#> 3 bergmann_glia         696875 1183544
#> 4 cerebellar_inhibitory 690546 1168161

ld.parquet contains the LDscores per annotation

arrow::read_parquet(system.file("extdata", "superclusters/ld.parquet", package = "ldsR")) |> 
  head()
#> # A tibble: 6 × 5
#>   SNP       amygdala_excitatory astrocyte bergmann_glia cerebellar_inhibitory
#>   <chr>                   <dbl>     <dbl>         <dbl>                 <dbl>
#> 1 rs3094315                2.65      42.3         0.041                 0.034
#> 2 rs3131972                2.74      42.4         0.061                 0.175
#> 3 rs3131969                3.62      54.8         0.298                 0.625
#> 4 rs1048488                2.77      42.5         0.024                -0.024
#> 5 rs3115850                2.61      42.0        -0.025                -0.105
#> 6 rs2286139                4.08      55.0         0.25                  0.86

annot_ref.parquet contains the the SNP to annotation mapping for each annotation and SNP that is used to calculate LDscores. It’s used when you want to calculate the enrichment of an annotation.

arrow::read_parquet(system.file("extdata", "superclusters/annot_ref.parquet", package = "ldsR")) |> 
  head()
#> # A tibble: 6 × 32
#>   SNP       amygdala_excitatory astrocyte bergmann_glia cerebellar_inhibitory
#>   <chr>                   <int>     <int>         <int>                 <int>
#> 1 rs3094315                   0         0             0                     0
#> 2 rs3131972                   0         0             0                     0
#> 3 rs3131969                   0         0             0                     0
#> 4 rs1048488                   0         1             0                     0
#> 5 rs3115850                   0         1             0                     0
#> 6 rs2286139                   0         1             0                     0
#> # ℹ 27 more variables: cge_interneuron <int>, choroid_plexus <int>,
#> #   committed_oligodendrocyte_precursor <int>,
#> #   deep_layer_corticothalamic_and_6b <int>,
#> #   deep_layer_intratelencephalic <int>, deep_layer_near_projecting <int>,
#> #   eccentric_medium_spiny_neuron <int>, ependymal <int>, fibroblast <int>,
#> #   hippocampal_ca1_3 <int>, hippocampal_ca4 <int>,
#> #   hippocampal_dentate_gyrus <int>, lamp5_lhx6_and_chandelier <int>, …

Partition Heritability

The main function to partition heritability is partition_h2().

sumstats is the first argument, corresponding to a dataframe with SNP, Z and N. Use munge() first.
ldscore_dirs takes a vector of filepaths, to directories with partitioned ldscores
subset_annots allows for flexible subsetting of annotations to use in the model. Default is NULL; which means that all of the annotions will be used in the model
overlapping_annotations which will result in the enrichment for each annotation also being estimated.

In this first example we estimate the partitioned heritability across the 4 of the 53 baseline annotations

# use the dataframe we munged earlier

partition_h2(
  test_df,
  ldscore_dirs = system.file("extdata", "baseline1.1_test/", package = "ldsR"),
)
#> ℹ Removed 449 rows after merging with weights
#> ℹ Removed 5961 rows after merging with ldscores
#> ✔ A total of 93590 SNPs remain for the regression model
#> # A tibble: 4 × 7
#>   annot                                coef    coef_se     z   tot tot_se n_snps
#>   <chr>                               <dbl>      <dbl> <dbl> <dbl>  <dbl>  <int>
#> 1 Conserved_LindbladToh.bedL2  0.000000880     2.63e-7  3.35 0.350 0.0393 1.53e5
#> 2 baseL2                       0.0000000274    8.79e-9  3.12 0.350 0.0393 5.96e6
#> 3 H3K4me1_peaks_Trynka.bedL2   0.0000000717    4.81e-8  1.49 0.350 0.0393 1.01e6
#> 4 Coding_UCSC.bedL2           -0.000000236     1.66e-7 -1.42 0.350 0.0393 8.51e4

Perhaps we are interested in doing “cell-type analysis”, testing the enrichment of a specific annotation while adjusting for the 53 baseline annotations. With partition_h2(), we simply add the filepath to the directory with additional ldscores.

Here we use a subsetted example of LDscores derived from human brain cell-types.

To be more specific in which LDscores we want to include in the model, we need pass a list of annotation names that will be included in the model. For this we need to know what the annotation names are per ldscore dataset. Here we use get_annot_names(), which will simply return the names of all annotations in a ldscore dataset.

all_ldscores <- c(
  system.file("extdata", "baseline1.1_test/", package = "ldsR"), 
  system.file("extdata", "superclusters/", package = "ldsR")
  )

baseline_names <- get_annot_names(all_ldscores[1])
supercluster_names <- get_annot_names(all_ldscores[2])

baseline_names
#> [1] "baseL2"                      "Coding_UCSC.bedL2"          
#> [3] "Conserved_LindbladToh.bedL2" "H3K4me1_peaks_Trynka.bedL2"

supercluster_names
#> [1] "amygdala_excitatory"   "astrocyte"             "bergmann_glia"        
#> [4] "cerebellar_inhibitory"

With this, we specify which annotations to include in the model through subset_annots.

partition_h2(
  test_df,
  ldscore_dirs = all_ldscores,
  # here we test the association with hippocampal neurons
  subset_annot = c(baseline_names,"amygdala_excitatory")
) |> 
  filter(annot == "amygdala_excitatory")
#> ℹ Removed 449 rows after merging with weights
#> ℹ Removed 5961 rows after merging with ldscores
#> ✔ A total of 93590 SNPs remain for the regression model
#> # A tibble: 1 × 7
#>   annot                       coef      coef_se     z   tot tot_se n_snps
#>   <chr>                      <dbl>        <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1 amygdala_excitatory 0.0000000312 0.0000000180  1.73 0.350 0.0373 911324

With this setup we can fit any model we might like, for example by adding additional cell-types to run conditional analysis.

partition_h2(
  test_df,
  ldscore_dirs = all_ldscores,
  # here we test the association with hippocampal neurons
  subset_annot = c(baseline_names,"amygdala_excitatory","astrocyte")
) |> 
  filter(annot %in% c("amygdala_excitatory","astrocyte"))
#> ℹ Removed 449 rows after merging with weights
#> ℹ Removed 5961 rows after merging with ldscores
#> ✔ A total of 93590 SNPs remain for the regression model
#> # A tibble: 2 × 7
#>   annot                        coef      coef_se     z   tot tot_se n_snps
#>   <chr>                       <dbl>        <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1 amygdala_excitatory  0.0000000272 0.0000000184  1.48 0.349 0.0379 911324
#> 2 astrocyte           -0.0000000186 0.0000000157 -1.18 0.349 0.0379 713230

Lastly, we can use overlapping_annotations=TRUE to estimate the enrichment of each annotation.

partition_h2(
  test_df,
  ldscore_dirs = all_ldscores,
  subset_annot = c(baseline_names,"amygdala_excitatory","astrocyte"),
  overlapping_annotations=TRUE
) |> 
  filter(annot %in% c("amygdala_excitatory","astrocyte"))

Cell-type analysis

Lastly, celltype_analysis() is a convenience function for estimating the association across several annotations, while adjusting for a set of “baseline” annotations.

sumstat
covariate_dir filepath to annotations that are used as controls
ldscore_dir filepath to annotations that are to be tested, one at a time while adjusting for all annotations in covariate_dir

celltype_analysis() will loop over each annotation in ldscore_dir, testing it’s association while adjusting for all annotations in covariate_dir.

celltype_analysis(
  sumstat = test_df,
  covariate_dir = all_ldscores[1],
  ldscore_dir = all_ldscores[2]
  
)
#> ℹ Removed 0 rows after merging with ldscores
#> # A tibble: 4 × 8
#>   annot                     coef      coef_se enrich   prop      z   tot tot_se
#>   <chr>                    <dbl>        <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
#> 1 amygdala_excitatory    3.12e-8 0.0000000180  0.724 0.112   1.73  0.350 0.0373
#> 2 astrocyte             -1.49e-8 0.0000000165 -0.336 0.0900 -0.905 0.352 0.0400
#> 3 bergmann_glia          4.55e-9 0.0000000159  0.101 0.0881  0.286 0.357 0.0396
#> 4 cerebellar_inhibitory -1.84e-8 0.0000000143 -0.411 0.0874 -1.29  0.354 0.0404