Extending Forward¶
Forward was designed to be easily extensible. To achieve this, we have adopted a modular structure and have developed abstract classes that serve as template for the different components. Bioinformaticians can implement their own versions of the phenotype and genotypes databases if they want to do specific optimizations. They can also add new statistical tests by implementing new Tasks.
To summarize, the three modular components are the following:
This section describes what is expected of the implementations of these abstract classes.
Phenotype databases¶
Phenotype databases should inherit forward.phenotype.db.AbstractPhenotypeDatabase
-
class
forward.phenotype.db.
AbstractPhenotypeDatabase
(*args, **kwargs)[source]¶ Abstract class representing a collection of phenotypes.
This class parses the phenotype information source (e.g. flat files, excel files, relational database or anything else).
It is the responsibility of the phenotype database to handle phenotype based exclusions and transformations.
Exclude correlated samples from controls.
Parameters: threshold (float) – A correlation coefficien threshold for exclusions. In phenomic studies, it is common to exclude samples from controls if they are affected by a correlated phenotype. This is often described through the concept of “disease groups”.
A threshold of 0.8 means that if two discrete phenotypes A and B have a correlation coefficient >= 0.8, samples that are cases for A will be excluded from the control group of B and vice versa.
-
get_correlation_matrix
(names)[source]¶ Get a correlation matrix for the specified names.
Parameters: names (list) – A list of variable names. Returns: A correlation matrix. Return type: numpy.ndarray This is useful to exclude correlated phenotypes as controls.
-
get_phenotype_vector
(name)[source]¶ Returns a numpy array representing the selected outcome for all samples.
Parameters: name ( forward.phenotype.variables.Variable
) – The Variable object representing the phenotype to extract.Returns: A vector representing the outcome. Return type: numpy.ndarray
This is one of the most important methods as it is Forward’s way of accessing all phenotypic information (it should be as efficient as possible). Missing values should be represented using Numpy NaNs.
Note
Experiment objects will call
set_sample_order()
with the results from the genotype container’sget_sample_order()
to ensure consistency. This means that the order of the samples in this vector should be the same as the sample order for the genotype container.
-
get_phenotypes
()[source]¶ Returns a list of phenotypes available in this database.
Returns: List of available phenotype names. Return type: str
-
get_sample_order
()[source]¶ Get the order of the samples.
Returns: A list of samples in the same order as the phenotype vector. Return type: list
-
set_experiment_variables
(variables)[source]¶ Signal by the experiment describing the subset of variables that will be analyzed.
Parameters: variables (list) – List of forward.phenotype.variables.Variable
This method can be used to compute the exclusions for discrete variables based on correlation.
-
static
validate_sample_sequences
(old_seq, new_seq, allow_subset)[source]¶ Compares sample sequences to validate the new sequence order.
Parameters: - old_seq (iterable) – The initial order of samples.
- new_seq (iterable) – The new order of samples.
- allow_subset (bool) – If False, this method will raise ValueErrors if the new sequence is a subset of the old sequence.
This can optionally be used by subclasses when writing the set_sample_order method. We recommend using this method to properly log relevant information.
If samples are missing from the new seq, a warning will be displayed. If new samples are added, a ValueError will be raised.
Genotype databases¶
Genotype databases should inherit forward.genotype.AbstractGenotypeDatabase
-
class
forward.genotype.
AbstractGenotypeDatabase
(*args, **kwargs)[source]¶ Abstract genotype container.
This class defines the standardized methods to organize the genotype access procedures used by Forward.
-
experiment_init
(experiment)[source]¶ Experiment specific initialization.
This method has two main roles. Building a database of
Variant
objects and doing the db-level filtering of the variants. It should also take care of loading the file in memory or of indexing if needed.
-
filter_name
(variant_list)[source]¶ Filtering by variant id.
The argument can be either a path to a file or a list of names.
-
get_genotypes
(variant_name)[source]¶ Get a vector of encoded genotypes for the variant.
This is the core functionality of the Genotype Databases. It should be as fast as possible as it will be called repeatedly by the tasks. If the structure is in memory, using a hashmap or a pandas DataFrame is recommended. If the underlying structure is on disk, this should use very good indexing and potentially caching.
-
get_sample_order
()[source]¶ - Return a list of the (ordered) samples as represented in the
- database.
The experiment will pass the results of this method to the phenotype container’s
set_sample_order
method.
-
query_variants
(session, fields=None)[source]¶ Return a query object for variants.
Parameters: - session (
sqlalchemy.orm.session.Session
) – A session object to interface with the Variant table. - fields (list) – A list of attributes to query. They should correspond
to columns of the
Variant
table.
If fields are given, they are queried. Alternatively a query for the Variant objects is returned.
Variant data is stored in a SQLAlchemy database. This method provides a shortcut to query it in a more pythonic way.
- session (
-
Tasks¶
Tasks are classes that take care of statistical testing. Their run_task
method will sequentially be called by the experiment.
-
class
forward.tasks.
AbstractTask
(*args, **kwargs)[source]¶ Abstract class for genetic tests.
Parameters: - outcomes (list or str) – (optional) List of Variable names to include as outcomes for this task. Alternatively, “all” can be passed.
- covariates (list or str) – (optional) List of Variable names to include as covariates.
- variants (str) – List of variants. For now, we can’t subset at the task level, so this should either not be passed or be “all”.
- correction (str) – The multiple hypothesis testing correction. This will be automatically serialized in the task metadata (if the parent’s method is called).
- alpha (float) – Significance threshold (default: 0.05). This will be automatically serialized it the parent’s method is called.
Implementations of this class should either compute the statistics directly or manage the execution of external statistical programs and parse the results.
The
run_task
method will be called by the experiment and should result in the statistical analysis.When the task is done, the experiment will call the
done
method which should take care of dumping metadata.-
done
()[source]¶ Cleanup signal from the Experiment.
The abstract method writes the content of the info attribute to disk.
-
run_task
(experiment, task_name, work_dir)[source]¶ Method that triggers statistical computation.
Parameters: - experiment (
forward.experiment.Experiment
) – The parent experiment which provides access to the whole experimental context. - task_name (str) – The name of the task. This is useful to fill the results table, because the task name is one of the columns.
- work_dir (str) – Path to the Task’s work directory that was created by the experiment.
For implementations of this abstract class, calling the parent method will set the outcomes and covariates to filtered lists of Variable objects. It will also make sure that the outcomes, covariates, variants, alpha and correction are included as task metadata.
- experiment (