AutoGeneS¶
AutoGeneS is a tool to automatically extracts informative genes and reveals the cellular heterogeneity of bulk RNA samples. AutoGeneS requires no prior knowledge about marker genes and selects genes by simultaneously optimizing multiple criteria: minimizing the correlation and maximizing the distance between cell types. It can be applied to reference profiles from various sources like single-cell experiments or sorted cell populations.
It is compatible with scanpy. To report issues or view the code, please refer to our github page.
Background¶
The proposed feature selection approach solves a multi-objective optimization problem. As the name suggests, multi-objective optimization involves more than one objective function to be optimized at once. When no single solution exists that simultaneously optimizes each objective, the objective functions are said to be conflicting. In this case, the optimal solution of one objective function is different from that of the others. This gives rise to a set of trade-off optimal solutions popularly known as Pareto-optimal solutions. The list of Pareto-optimal solutions includes non-dominated solutions, which are explored so far by the search algorithm. These solutions cannot be improved for any of the objectives without degrading at least one of the other objectives. Without additional subjective preference, all Pareto-optimal solutions are considered to be equally good.
In AutoGeneS, we have n binary decision variables where n is equal to the number of genes from which the optimizer selects the markers. The value of a decision variable represents whether the corresponding gene is selected as a marker. Later, we evaluate the objective functions (correlation and distance) only for genes G whose decision variables are set to one.
AutoGeneS uses a genetic algorithm (GA) as one of the main representatives of the family of multi-objective optimization techniques. GA uses a population-based approach where candidate solutions that represent individuals of a population are iteratively modified using heuristic rules to increase their fitness (i. e., objective function values). The main steps of the generic algorithm are as follows:
Initialization step: Here the initial population of individuals is randomly generated. Each individual represents a candidate solution that, in the feature selection problem, is a set of marker genes. The solution is represented as a bit string with each bit representing a gene. If a bit is one, the corresponding gene is selected as a marker.
Evaluation and selection step: Here the individuals are evaluated for fitness (objective) values, and they are ranked from best to worst based on their fittness values. After evaluation, the best feasible individuals are then stored in an archive according to their objective values.
Termination step: Here, if the termination conditions (e. g., if the simulation has run a certain number of generations) are satisfied, then the simulation exits with the current solutions in the archive. Otherwise, a new generation is created.
If the simulation continues, the next step is creating offspring (new individuals): The general GA modifies solutions in the archive and creates offspring through random-based crossover and mutation operators. First, parents are selected among the candidates in the archive. Second, the crossover operator combines the bits of the parents to create the offspring. Third, the mutation operator makes random changes to the offspring. Offspring are then added to the population, and the GA continues with step 2.
Getting Started¶
Install AutoGeneS with:
pip install --user autogenes
In the following, we show how to use AutoGeneS with an example.
Import the libraries and read the reference data and bulk samples:
import anndata
import numpy as np
import pandas as pd
import autogenes as ag
import anndata
bulk_data = pd.read_csv('bulk_data.csv').transpose()
adata = sc.read(address_to_your_sc_data, cache=True).transpose()
Before you can use AutoGeneS, it needs to be initialized with the reference data (see the API):
ag.init(adata,use_highly_variable=True,celltype_key='cellType')
If the data is given as anndata, ag.init automatically measures the centroids of cell types by means of averaging the gene expression of their cells.
In the next step, we run the optimizer:
ag.optimize(ngen=5000,nfeatures=400,seed=0,mode='fixed')
Here, we run the optimizer for 5K generations asking for 400 genes. During this optimization process, a set of solutions is generated. Each solution is a set of 400 genes and is evaluated based on objectives that can be passed to optimize. In our examples, it uses the default objectives correlation and distance.
All the non-dominated solutions can be visualized using the plot function:
ag.plot(weights=(-1,0))
This plots the objective values of all non-dominated solutions. The arguments are used to select a solution, which is marked in the plot. In our case, we choose the solution using the weights (-1,0) on the objective values that will return the solution with minimum correlation.
To choose another solution, run:
ag.select(close_to=(1,75))
The following criterion is used to select a solution: The first number, 1 refers to the second objective, which is distance in our case. So, the above will choose the solution whose distance value is closest to 75. There are other ways of selecting a solution like specifying the index of a solution using ag.select(index=0).
Now that we have chosen a solution with a set of genes, we can deconvolute bulk samples:
coef = ag.deconvolve(bulk_data, model='nnls')
print(coef)
We recommend to normalize the coeffiecients after the analysis.
It is possible to compress all of the above steps into a single line:
ag.pipeline(adata,bulk_data,ngen=5000,nfeatures=400,seed=0,mode='fixed',close_to=(1,75),model='nnls')
This should produce the same result!
For more information on each step, please refer to the API. For more extensive examples, see the examples section.
API¶
Import AutoGeneS as:
import autogenes as ag
Main functions¶
|
Preprocesses input data |
|
Runs multi-objective optimizer |
|
Plots objective values of solutions |
|
Selects a solution |
|
Performs bulk deconvolution |
Auxiliary functions¶
|
Runs the optimizer, selection and deconvolution using one method |
|
Resumes an optimization process that has been interrupted |
|
Saves current state to a file |
|
Loads a state from a file |
References¶
[Aliee, Hananeh and Theis, Fabian, AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution](https://www.biorxiv.org/content/early/2020/02/23/2020.02.21.940650)