Command-line utilities
=======================

.. grs-create
.. _grs-create:

``grs-create``
^^^^^^^^^^^^^^^

The most common algorithm to create a GRS is the LD pruning followed by p-value
thresholding approach. This script provides a convenient implementation of this
algorithm that can be easily integrated to computational pipelines.

This method consists of the following steps:

1. Rank the variants by increasing p-value.
2. Select the top variant, include it in the GRS and exclude all variants in LD
   at a pre-defined threshold (``--ld-threshold``).
3. Repeat 2. until the p-value threshold is reached (``--p-threshold``).

.. note::

    If a ``maf`` column is present in the summary statistics association file,
    all variants that are too rare will be automaticall excluded when reading
    the summary statistics file. If it is not available, the MAF will be
    computed from the genotype data to allow filtering, which is more
    computationally intensive.

For now, a plink binary file is required for the reference. A
population-specific reference panel like the 1000 genomes phase 3 should be
used.

This script generates a 'grs' file that is suitable for computing the GRS using
:ref:`grs-compute`.

.. literalinclude:: includes/grs_create_help.txt
    :language: none


.. grs-compute
.. _grs-compute:

``grs-compute``
^^^^^^^^^^^^^^^

Computes the GRS using a '.grs' file and individual-level genotypes. 

The GRS are computed using the following formula:

.. math::
    GRS_i = \sum_{j=0}^{m} \beta_j \cdot X_{ij} 

Where :math:`GRS_i` is the genetic risk score for the `i`:sup:`th` sample, j
indexes the genetic variants included in the score, :math:`\beta_j` is the
coefficient estimated from the summary statistics and :math:`X_{ij}` is the
number of effect alleles carried by individual `i` at variant `j`.

.. note::
    If the coefficient (:math:`\beta`) is negative, it's additive inverse is
    used instead (:math:`-\beta`) and the effect and reference alleles are
    swapped accordingly. This ensures that contributions to the GRS are always
    positive.

.. note::
   For missing genotypes, the mean dosage with respect to the effect allele is
   used when computing the variant's contribution. Another intuitive way of
   handling this case would have been to set the contribution to 0, but doing
   so is a stronger claim than using the mean, especially for variants with a
   MAF close to 0.5.

If supported imputation data formats are used to hold the genotype information,
the variants are further weighted by the INFO score as a form of global
genotype confidence weighting:

.. math::
    GRS_i = \sum_{j=0}^{m} \beta_j \cdot X_{ij}  \cdot info_j

.. note::
   At this point, few data formats are supported. This is dependant on
   the use of the "ImputedVariant" class in the geneparse reader.

.. literalinclude:: includes/grs_compute_help.txt
    :language: none

.. grs-mr
.. _grs-mr:

``grs-mr``
^^^^^^^^^^^

Mendelian randomization is a technique where genetic variants (or a genetic
risk score (G)) is used to estimate the effect of an exposure (X) on an outcome
(Y). Here, the estimation is done using the ratio method:

.. math::
    \beta = \frac{\beta_{Y \sim G}}{\beta_{X \sim G}}

In the current implementation, bootstrap resampling is used to estimate the
standard error which can be used for further inference.

.. literalinclude:: includes/grs_mr_help.txt
    :language: none

.. grs-utils
.. _grs-utils:

``grs-utils``
^^^^^^^^^^^^^^

This tool contains multiple features to manipulate GRS after they have been
computed. Most of these tools either generate plots (as png files) or modified
versions of the computed GRS. The latter file format is simply a CSV with a
'sample' and a 'grs' column.

histogram
++++++++++

The histogram sub-command generates a histogram of all the GRS values. Here is
an example:

.. image:: _static/images/histogram.png


quantiles
++++++++++

This function is used to dichotomize the GRS. The parameters are:
- ``-k`` The number of quantiles to take to form a group.
- ``-q`` The total number of quantiles.

Here are various examples of dichotomizations: ::

    # Create two groups with respect to the median.
    grs-utils quantiles my_grs.grs -k 1 -q 2

    # 1st vs 5th quintiles.
    grs-utils quantiles my_grs.grs -k 1 -q 5

    # 1st vs 5th quintiles, but keeping all the samples (will be set to NA).
    grs-utils quantiles my_grs.grs -k 1 -q 5 --keep-unclassified

The output file will contain two columns and a header: ``sample`` and
``group``. The group '0' contains the individuals with lower values of the GRS
and the group '1' contains individuals with high values of the GRS.

standardize
++++++++++++

This function standardizes a GRS:

.. math::
    GRS' = \frac{GRS - \bar{GRS}}{s}

This function is useful when different GRS are to be compared. For example, is
linear regression is used to assess the effect of a GRS on a given trait (Y),
then the coefficient can be interpreted as the number of units increase in Y
for a 1 s.d. increase in the GRS.

correlation
++++++++++++

This utility takes two computed GRS files as input and plots the correlation
between the GRS. This can be useful to test the variability of changing
different hyperparameters (e.g. p-value threshold).

This script opens the plot in the interactive matplotlib viewer. From there, it
can manipulated and saved to a file.

Here is an example:

.. image:: _static/images/correlation.png

On the plot, the linear regression fit is displayed as well as the identity
line to quickly assess the agreement between GRS.

.. literalinclude:: includes/grs_utils_help.txt
    :language: none

beta-plot
++++++++++

This utility compares the effect of variants between a summary statistics file
in the GRS format and regression results using the user's data. The goal is
to ensure that the reported effect sizes are compatible and that there are no
large or systematic discrepencies that could indicate a population or strand
mismatch.

The utility also displays the standard errors on the plot. For the error of the
expected coefficients, the standard errors are derived from the p-values and
effect sizes.

.. math::
    z = \frac{\beta}{s.e.}

Here is an example of the plot:

.. image:: _static/images/beta_plot.png

The execution also produces a text file displaying the SNPs information
(chromosome, position, alleles, coefficients, standard errors, maf in the
given population and n, the number of individuals used to compute each
observed coefficient).

.. literalinclude:: includes/grs_utils_beta_plot_help.txt
    :language: none

.. grs-evaluate
.. _grs-evaluate:

``grs-evaluate``
^^^^^^^^^^^^^^^^^

After calculating a GRS, it is important to quantify and evaluate its
predictive performance. This tool provides multiple commands to easily run
common diagnostics and validation steps.

regress
++++++++

The regress subcommand assesses the effect of the GRS as a continuous trait on
a given binary or continuous phenotype. In the case of a continuous outcome, 
linear regression is used and a plot like the following is generated:

.. image:: _static/images/evaluate-regress-continuous.png

In this plot, it is possible to quickly validate that the GRS indeed has an
effect on the trait of interest. It is also possible to estimate the fraction
of the variance explained by the GRS (corresponding to the :math:`R^2` value or
to evaluate the expected change in the phenotype attributable to a 1 unit
change in the GRS (:math:`\beta` coefficient and its 95% CI). Note that
standardizing the GRS first is useful to get an interpretation in units of GRS
s.d. and to make comparisons across different GRS possible.

If the outcome is discrete, then logistic regression is better suited to
evaluate the fit with the GRS. In this case, the generated plot looks as
follows:

.. image:: _static/images/evaluate-regress-binary.png

In this plot, the perspective is somewhat reversed, when compared with the
linear case. The GRS is now in the y axis and its distribution is shown in
cases and controls. The more general term "level" is used to describe the
discrete numerical values of the phenotype as it could be used for factor
variables.

The results for the logistic regression are shown on the plot. The OR
corresponds to the expected increase in odds for a one unit increase in the
GRS.

dichotomize-plot
+++++++++++++++++

This plot can be used to select an optimal quantile for dichotmization. The
tradeoff between the number of individuals classified and the GRS effect size
is highlighted. Note that selecting a local maxima in this plot can be a bad
idea because it increases the risk of overfitting. Traditional approaches such
as using a test set, using cross-validation or bootstrapping can be used to
select a more robust threshold.

The dichotmization is done by comparing the extremums of the distribution. For
example, at the 0.25 quantile, the first quartile (1/4) is compared to the last
quartile (4/4). All of the individuals in between can be used to form an
"intermediate" group or can be excluded from downstream analyses. This is why
the right axis represents the number of individuals that are classified in
either the "high" or "low" group for a given dichotmization.

.. image:: _static/images/evaluate-dichotomize-plot.png


ROC
++++

To evaluate the predictive accuracy of a GRS, it can be useful to look at its
ROC curve.

.. image:: _static/images/evaluate-roc.png

.. literalinclude:: includes/grs_evaluate_help.txt
    :language: none