SlipGURU Dipartimento di Informatica e Scienze dell'Informazione Università Degli Studi di Genova

Quick start tutorial

Assuming L1L2Py and PPlus already installed, L1L2Signature may be installed using standard Python tools (with administrative or sudo permissions on GNU-Linux platforms):

$ pip install L1L2Signature

or

$ easy_install L1L2Signature

Installation from sources

If you like to manually install L1L2Signature, download the source tar.gz from our BitBucket repository. Then extract it and move into the root directory:

$ tar xvf L1L2Signature-|release|.tar.gz
$ cd L1L2Signature-|release|/

From here, you may use the standard Python installation step:

$ python setup.py install

Note

If you would like to use L1L2Signature in a distributed environment you need to install L1L2Py and PPlus on each node, as described into the documentation of both libraries. You can also run L1L2Signature script on a single machine (installing dependencies only there) using PPlus debug mode, configurable from L1L2Signature configuration file, as described below.

After L1L2Signature installation, you should have access to three scripts, named with a common l1l2_ prefix:

$ l1l2_<TAB>
l1l2_analysis.py    l1l2_run.py    l1l2_tau_choice.py

This tutorial assumes that you downloaded and extracted L1L2Signature source package which contains a Golub99_Leukemia directory with Test data, which will be used to show L1L2Signature‘s tools functionalities.

L1L2Signature needs only 3 ingredients:

  • A gene expressions matrix
  • A set of labels for each sample (e.g. phenotype)
  • A configuration file

Input data format

Input data (gene expression matrix and labels) are assumed to be textual ad separated by a char (delimiter). For example, the given data matrix (of Leukemia gene expressions) is a text file where samples are organized by columns and microarray probes by row and gene expressions values are separated by a comma (',').

Gene Accession Number,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,34,35,36,37,38,28,29,30,31,32,33
AF009426_at,36,58,63,38,120,92,16,169,43,-18,26,-81,123,89,268,171,53,-5,86,267,206,-23,-87,135,-35,2,20,-105,-67,-43,-45,-42,-44,-21,-33,-41,9,-107
AF009674_at,-66,-239,-328,307,212,314,-653,332,221,-106,-169,-446,-93,66,123,-60,-404,-146,143,331,573,-155,-288,463,90,129,-799,-6,-315,-207,-653,62,55,143,87,-579,-387,479
AF010193_at,87,548,93,-31,159,101,119,589,245,451,91,484,458,158,128,225,555,288,97,1757,363,598,207,153,534,149,680,74,211,439,169,248,465,130,407,194,227,224
AF012270_at,-12,139,237,104,29,-8,223,416,10,59,29,91,-14,105,37,106,485,134,109,104,144,158,106,78,95,110,34,-95,113,110,234,240,232,86,69,194,73,264
AF014958_at,318,243,19,108,40,112,329,166,110,159,96,-71,139,112,-36,345,1057,176,268,180,382,15,197,32,306,110,424,81,259,418,159,340,17,174,197,478,444,565
AF015910_at,1219,1055,1820,1074,822,394,1154,1763,1157,902,837,178,1077,171,316,1317,2041,1140,1053,1767,630,1212,281,1559,460,1196,843,914,240,1735,967,666,563,539,1934,315,741,1908
AF015913_at,476,611,850,691,1304,346,653,550,826,472,439,239,1558,705,942,560,1366,369,584,2460,0,206,537,1205,337,385,536,424,300,412,270,253,425,346,352,474,493,410
AF015950_at,-466,-49,-277,-670,-180,-453,-700,-191,-163,-183,-329,-431,-43,-296,-73,334,-436,105,-282,29,-176,-56,-142,221,-137,-211,-863,-852,-529,-646,-574,-525,-346,-344,-261,-497,-544,-602
AJ000480_at,895,749,1049,1016,634,752,920,254,934,536,607,456,617,592,624,608,925,833,473,801,348,592,478,944,682,611,1391,924,524,2210,1525,983,2416,570,1827,1000,1456,2912
AJ001047_at,-451,-108,-265,-378,-171,-257,-514,-250,-291,-218,-129,-137,-146,-128,-159,-223,-311,-256,-209,-415,-122,-252,-355,-325,-173,-333,-552,-341,-310,-467,-267,-263,-150,89,-72,-177,-253,-577
AJ001421_at,687,829,914,982,1294,743,680,527,1352,675,383,616,1071,1004,1106,547,1790,574,660,1828,394,996,457,1366,486,141,342,588,613,1665,1164,533,1038,671,1821,644,630,792
...

Labels contains information about the given samples, indicated if they belong to the ALL (Acute Lymphoblastic Leukemia) or AML (Acute Myeloid Leukemia) group:

sample,status
1,ALL
2,ALL 
3,ALL 
4,ALL 
5,ALL 
28,AML
29,AML 
30,AML 
31,AML 
32,AML 
33,AML 
...

See also l1l2signature.utils.BioDataReader API and next section for more information.

Configuration File

L1L2Signature configuration file is a standard Python script. It is imported as a module, then all the code is executed. Actually all configuration option are mandatory and it is possible to generate a default configuration file, as the one that follows, with the l1l2_run.py script (see Experiment runner section):

# Configuration file example for L1L2Signature
# version: '0.2.2'

import l1l2py

#~~ Data Input/Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# * Data assumed csv with samples and features labels
# * All the path are w.r.t. config file path
data_matrix = 'data/gedm.csv'
labels = 'data/labels.csv'
delimiter = ','
samples_on = 'col' # or 'row': samples on cols or rows
result_path = '.'

#~~ Data filtering/normalization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sample_remover = None # removes samples with this label value
variable_remover = 'affx' # remove vars where name starts with (not case-sens.)
data_normalizer = l1l2py.tools.center
labels_normalizer = None

#~~ Cross validation options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# * See l1l2py.tools.{kfold_splits, stratified_kfold_splits}
external_k = 4   # (None means Leave One Out)
internal_k = 3
cv_splitting = l1l2py.tools.stratified_kfold_splits

#~~ Errors functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# * See l1l2py.tools.{regression_error, classification_error,
#                     balanced_classification_error}
cv_error = l1l2py.tools.regression_error
error = l1l2py.tools.balanced_classification_error
positive_label = None # Indicates the positive class in case of 2-class task

#~~ L1l2 Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# * Ranges will be sorted from smaller to bigger value!
# * See l1l2py.tools.{geometric_range, linear_range}
tau_range = l1l2py.tools.geometric_range(1e-3, 0.5, 20) # * MAX_TAU
mu_range = l1l2py.tools.geometric_range(1e-3, 1.0, 3)   # * CORRELATION_FACTOR
lambda_range = l1l2py.tools.geometric_range(1e0, 1e4, 10)

#~~ Signature Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
frequency_threshold = 0.5

#~~ PPlus options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
debug = True # If True, the experiment runs only on the local pc cores

Configuration file is fully documented and it imports L1L2Py in order to use some useful tools. User is free to use personalized functions if they follow the same API. For example, if the user would like to use a different error function, this must be written as:

def my_error(true_labels, predicted_labels):
    something(...)
    return error_as_float

After the user defines all the option needed to read the data and to perform the model assessment, the crucial phase is to properly define a set of ranges of parameter involved, namely \tau, \mu and \lambda (see L1L2Py for a complete explanation of them). Usually, a good option for \lambda is to use a wide range of values in geometric series.

Instead, for \tau we automatically scale the range respect to the maximum value (MAX_TAU) it can assume. Values greater than MAX_TAU produce empty models (no genes are selected). Actually, the tau_range option is considered as relative with respect to MAX_TAU.

About \mu, we implemented an heuristic which follows the same criteria implemented for \tau. Because \mu is related to the amount of correlation we would like to introduce into the generated signatures, we would like to use relative values with respect to each split (sub)matrix. So, we evaluate a CORRELATION_FACTOR, used as scaling factor, which is based on the spectral properties of the correlation (sub)matrix.

Tau choice helper

In order to help users choosing a good relative \tau range, they can use the l1l2_tau_choice.py script which has the following prototype:

$ l1l2_tau_choice.py --help
Usage: l1l2_tau_choice.py [-s] configuration-file.py

Options:
  --version   show program's version number and exit
  -h, --help  show this help message and exit
  -s, --show  show interactive plot

It needs a properly configured configuration file, with an initial tau range and performs the following steps:

  • Reads data
  • Checks MAX_TAU value (it must select at least 1 variable)
  • Calculates a full path of \ell_1\ell_2 solutions using the given \tau range (and a \mu very close to zero). User may check minimum and maximum number of selected variables, estimating minimal (without correlation) gene signatures lengths.
  • Performs a 5-fold cross validation saving a tau_choice_plot.png (as the one in the next Figure) into the configuration file directory. User can estimate a basic (and maybe biased) performance with the given ranges of values.
_images/tau_choice_plot.png

Tau choice plot for Golub dataset with a default configuration file

Usually we perform this step many times in order to drive our parameters ranges choice, before launching the real experiment (which usually requires many hours).

Experiment runner

The l1l2_run.py script, executes the full framework described in the Framework overview section. The prototype is the following:

$ l1l2_run.py --help
Usage: l1l2_run.py [-c] configuration-file.py

Options:
  --version     show program's version number and exit
  -h, --help    show this help message and exit
  -c, --create  create config file

As described before, this script may be also used to generate a valid default configuration file (-c option).

When launched, the script reads and splits the data, then it runs L1L2Py on each external split and collects the results in a new sub-directory of the result_path directory (see Configuration File). Such a directory is named as:

l1l2_result_<TIMESTAMP>

and it contains all information needed for the following analysis step.

Note

Note that data and configuration file are hard-linked inside the result directory which, in that way, becomes completely portable and self contained.

Results analysis

This is the last step, needed to be performed in order to get some useful summaries and plots from an already executed experiment. The l1l2_analysis.py script accepts as only parameter a result directory already created:

$ l1l2_analysis.py --help
Usage: l1l2_analysis.py result-dir

Options:
  --version   show program's version number and exit
  -h, --help  show this help message and exit
  -d DPI, --dpi=DPI  figures dpi resolution (default 300)
  -s, --show         show interactive plots

The script prints some results and produces a set of textual and graphical results.

Cross Validation Errors

The script generates a list of kcv_err_split_*.png, one for each external split (as averaged error across internal splits). Moreover, it generates an averaged plot: avg_kcv_err.png. On each plot, a blue dot indicates the minimum.

_images/avg_kcv_err.png

Averaged Cross Validation Error for Golub dataset with a default configuration file

See also: l1l2signature.plots.kfold_errors().

Prediction Errors

The script generates a box plot for both test and training errors, respectively prediction_error_ts.png and prediction_error_tr.png. They show the averaged prediction error over external splits in order to asses performance and stability of the signatures (for each level of considered correlation, \mu values).

_images/prediction_error_ts.png

Prediction Error Box Plot for Golub dataset with a default configuration file

See also: l1l2signature.plots.errors_boxplot().

Frequencies Threshold

In order to help the user defining a good stability threshold (see frequency_threshold in Configuration File) the script also plots (and actually print and save as selected_over_threshold.png), an overall summary of the number of genes selected for different thresholds and for each correlation level.

_images/selected_over_threshold.png

Number of selected variables wrt a frequency (stability) threshold over external split signatures (Golub dataset)

See also: l1l2signature.plots.selected_over_threshold() and l1l2signature.utils.selection_summary().

Signatures Heatmaps

In case of classification (automatically identified when labels assume only two values), the script creates a heatmap plot for each final signature (then they also depend by frequency_threshold option value). Images are saved as heatmap_mu*.png files where samples and variables are properly clustered in order to improve the visualization.

Note

Using a very small frequency_threshold (e.g. 0.0), signature contains the full list of variables. In that case, variables are not clustered but only ordered by frequency across splits. In order to generate such a plot, the l1l2signature.plots.heatmap() can be used.

_images/heatmap_mu1.png

Heatmap for Golub dataset with a default configuration file

See also: l1l2signature.plots.heatmap() and l1l2signature.utils.ordered_submatrices().

Samples in PCA space

In case of classification (automatically identified when labels assume only two values), the script plots samples in a 3D space, using Principal Component Analysis (PCA), for each final signature. Images are saved as pca_mu*.png files.

_images/pca_mu1.png

PCA plot for Golub dataset with a default configuration file

See also: l1l2signature.plots.pca().

Performance Statistics

The analysis script also produces some textual results, saved into a stats.txt file. That file is divided into some sections, each one containing at least a short table.

Optimal Parameters

Optimal Parameters
==========================================================================
Split  #  |    lambda*    |     tau*      |   log(lam*)   |   log(tau*)   
----------+---------------+---------------+---------------+---------------
Split   1 |         1.000 |        40.550 |         0.000 |         1.608 
Split   2 |         1.000 |       555.132 |         0.000 |         2.744 
Split   3 |         1.000 |       150.035 |         0.000 |         2.176 
Split   4 |         1.000 |        56.239 |         0.000 |         1.750 

This table describes the best parameter pairs (\lambda^*, \tau^*) found in each cross validation split. They correspond to blue points in generated Cross Validation Errors images.

Prediction errors

Test set Prediction
==========================================================
Stat type |      mu1      |      mu2      |      mu3      
----------+---------------+---------------+---------------
mean      |         0.131 |         0.066 |         0.031 
std       |         0.139 |         0.077 |         0.035 
median    |         0.101 |         0.062 |         0.030 
----------+---------------+---------------+---------------
sqrt(mean)|         0.361 |         0.257 |         0.175 
sqrt(std) |         0.372 |         0.277 |         0.188 
sqrt(med) |         0.318 |         0.250 |         0.173 


Training set Prediction
==========================================================
Stat type |      mu1      |      mu2      |      mu3      
----------+---------------+---------------+---------------
mean      |         0.000 |         0.000 |         0.000 
std       |         0.000 |         0.000 |         0.000 
median    |         0.000 |         0.000 |         0.000 
----------+---------------+---------------+---------------
sqrt(mean)|         0.000 |         0.000 |         0.000 
sqrt(std) |         0.000 |         0.000 |         0.000 
sqrt(med) |         0.000 |         0.000 |         0.000 

These tables show the averaged prediction error of the signatures, before frequency/stability thresholding, for each value of correlation \mu. They correspond to the generated Prediction Errors box plots.

Classification Performances

Confusion matrix and classification performance with mu3
===============================================================================
  Real vs Pred. |        AML         |        ALL         |   Predictive Values
----------------+--------------------+--------------------+--------------------
            AML |         11         |          2         |             84.62 %
----------------+--------------------+--------------------+--------------------
            ALL |          0         |         25         |            100.00 %
----------------+--------------------+--------------------+--------------------
     True rates |      100.00 %      |      92.59 %       |                    

This table, generated only in classification for each \mu, summarizes classification performances of the signature. It is a classical Confusion Matrix where, e.g. the entry AML vs ALL indicates the number of real AML samples classified as belonging to ALL class (results generate on given Golub “Leukemia” dataset). On the bottom we have True rates: #(Correctly predicted as C) / #(Predicted as C) while on the right we have Predictive values: #(Correctly predicted as C) / #(Samples in C)

Moreover, this section contains some other performance measures:

Classification performance measures:
  * Accuracy:          94.74 %
  * Balanced Accuracy: 96.30 %
  * MCC:               0.8851

where, MCC is the Matthews correlation coefficient.

At last, if the positive_label parameter is given, into the Configuration File, the script is able to calculate some other measures that assume the presence of a positive class as in the case of patients vs. controls. For example, the following table is generated if we assume that for the “Leukemia” dataset the AML class has to be considered as positive:

Considering AML as the positive class:
  * Sensitivity:   100.00 %
  * Specificity:   92.59 %
  * Precision:     84.62 %
  * Recall:        100.00 %
  * F-measure:     0.9167

Note that some of this measures are pure aliases of the ones already calculated on the last row or column of the confusion matrix.

See also: l1l2signature.utils.confusion_matrix() and l1l2signature.utils.classification_measures().

Signatures

Obviously, the script generates a set of signatures, each one written in a separated text file signature_mu*.txt, in order to eventually simplify the parsing. Each file contains the ordered list of probes belonging to the signature:

index	probe	freq	num_sel
1823	M27891_at	1.000	4/4
4877	Y00433_at	1.000	4/4
1720	M19507_at	1.000	4/4
6142	Y00787_s_at	1.000	4/4
2343	M96326_rna1_at	1.000	4/4
6150	Z19554_s_at	0.750	3/4
1317	L19779_at	0.750	3/4
5589	D49824_s_at	0.750	3/4
6730	HG3576-HT3779_f_at	0.750	3/4
6120	M14328_s_at	0.750	3/4
6825	U01317_cds4_at	0.750	3/4
2286	M91036_rna1_at	0.750	3/4
5249	L20688_at	0.750	3/4
5493	L06797_s_at	0.750	3/4
4560	X78992_at	0.750	3/4
1050	J04164_at	0.750	3/4
6554	Z48501_s_at	0.750	3/4
1615	M11147_at	0.750	3/4
1335	L20941_at	0.750	3/4
0	hum_alu_at	0.750	3/4
1704	M17733_at	0.500	2/4
3199	U46751_at	0.500	2/4
1645	M13792_at	0.500	2/4
1635	M12886_at	0.500	2/4
224	D21261_at	0.500	2/4
6109	M13560_s_at	0.500	2/4
2583	U05259_rna1_at	0.500	2/4
1491	L38941_at	0.500	2/4
4137	X17042_at	0.500	2/4
5880	J02621_s_at	0.500	2/4
6122	M14483_rna1_s_at	0.500	2/4
5657	X57351_s_at	0.500	2/4
2127	M69043_at	0.500	2/4
2175	M77232_rna1_at	0.500	2/4
6141	M28130_rna1_s_at	0.500	2/4
5651	M25079_s_at	0.500	2/4
5170	Z84721_cds2_at	0.500	2/4
4099	X15183_at	0.500	2/4

The file is tab-delimited, the signatures are thresholded with respect to the frequency_threshold option, and they correspond to the signatures used to generate Heatmaps plots.

See also: l1l2signature.utils.signatures() and l1l2signature.utils.selection_summary().