SlipGURU Dipartimento di Informatica e Scienze dell'Informazione Università Degli Studi di Genova

Main functions (l1l2py)

This module implements the two main stages of the \ell_1\ell_2 with double optimization variable selection, as in [DeMol09b].

Given a supervised training set (\mathbf{X}, \mathbf{Y}), the aim is to select a linear model built on few relevant input variables with good prediction ability.

The linear model is \mathbf{X}\boldsymbol{\beta}, where \boldsymbol{\beta} is found as the minimizer of the (naive) elastic-net functional combined with a regularized least squares functional.

\frac{1}{n} \| \mathbf{Y} - \mathbf{X}\boldsymbol{\beta} \|_2^2
+ \mu \|\boldsymbol{\beta}\|_2^2
+ \tau \|\boldsymbol{\beta}\|_1

\frac{1}{n} \| \mathbf{Y} - \mathbf{\tilde{X}}\boldsymbol{\tilde{\beta}} \|_2^2
+ \lambda \|\boldsymbol{\tilde{\beta}}\|_2^2

in which \boldsymbol{\tilde{\beta}} and \mathbf{\tilde{X}} represent, respectively, the weights vector and the input matrix restricted to the genes selected by the \ell_1\ell_2 selection.

The optimal solution depends on two regularization parameters, \tau and \lambda and one correlation parameter \mu and is found in two different stages:

  • Stage I (minimal_model)

    This stage aims at selecting the optimal pair of regularization parameters \tau_{opt} and \lambda_{opt} within a k-fold cross validation loop for a fixed and small value of the correlation parameter \mu.

    The function follows exactly the pesudocode described in [DeMol09b] (pag.7 - Stage I).

  • Stage II (nested_models)

    For fixed \tau_{opt} and \lambda_{opt}, Stage II identifies the set of relevant lists of variables for increasing values of the correlation parameter \mu.

    Note

    For \tau_{opt} and \lambda_{opt} the lists of relevant variables have same prediction power [DeMol09a].

    The function performs exactly the pesudocode described in [DeMol09b] (pag.7 - Stage II).

This module also provide a wrapper function (model_selection) that runs the two stages sequentially.

Stage I: Minimal Model Selection

l1l2py.minimal_model(data, labels, mu, tau_range, lambda_range, cv_splits, error_function, data_normalizer=None, labels_normalizer=None)

Minimal model selection.

Given a supervised training set (data and labels), for a fixed value of mu (should be minimum), it finds the values in tau_range and lambda_range minimizing the prediction error via cross validation (see error functions in the l1l2py.tools module).

Cross validation splits must be provided (cv_splits) as a list of pairs containing traning-set and validation-set indexes (see cross validation tools in the l1l2py.tools module).

Data and labels will be normalized on each split using the function data_normalizer and labels_normalizer (see data normalization functions in the l1l2py.tools module).

Warning

On each cross validation split the number of valid solutions (not void) may be different (on high values of tau). The function calculates the optimum value of tau for which the model is not void on all cross validation splits.

This means than in extreme cases the output could be void.

Parameters :

data : (N, P) ndarray

Data matrix.

labels : (N,) or (N, 1) ndarray

Labels vector.

mu : float

Minimum l2 norm penalty (l1l2 functional).

tau_range : array_like of T floats

l1 norm penalties (l1l2 functional).

lambda_range : array_like L of floats

l2 norm penalties (RLS functional).

cv_splits : array_like of tuples

Each tuple contains two lists with the training set and testing set indexes.

error_function : function object

Cross validation error function.

data_normalizer : function object, optional (default is None)

Data normalization function.

labels_normalizer : function object, optional (default is None)

Labels normalization function.

Returns :

err_ts : (< T, L) ndarray

Matrix of average cross validation error on the training set. The first dimension depends on the number of valid tau values, even zero.

err_tr : (< T, L) ndarray

Matrix of average cross validation error on the training set. The first dimension depends on the number of valid tau values, even zero.

Raises :

ValueError :

If the given range of tau values produces all void solutions with the given data splits.

Stage II: Nested lists generation

l1l2py.nested_models(data, labels, test_data, test_labels, mu_range, tau, lambda_, error_function, data_normalizer=None, labels_normalizer=None, return_predictions=False)

The function generates the models with the (almost) nested lists of selected variables.

Given a training set (data and labels) and a test set (test_data and test_labels), for fixed values of tau and lambda (should be the optimal values estimated at Stage I), it calculates one model for each increasing value in mu_range.

Data and labels will be normalized using the function data_normalizer and labels_normalizer (see data normalization functions in the l1l2py.tools module).

The function returns test and training errors using the error_function provided (see error functions in the l1l2py.tools module).

Parameters :

data : (N, P) ndarray

Data matrix.

labels : (N,) or (N, 1) ndarray

Labels vector.

test_data : (T, P) ndarray

Test set matrix.

test_labels : (T,) or (T, 1) ndarray

Test set labels vector.

mu_range : array_like of M floats

l2 norm penalties (l1l2 functional).

tau : float

Optimal l1 norm penalty (l1l2 functional).

lambda_: float :

Optimal l2 norm penalty (RLS functional).

error_function : function object

Error function.

data_normalizer : function object, optional (default is None)

Data normalization function.

labels_normalizer : function object, optional (default is None)

Labels normalization function.

Returns :

beta_list : list of M (S,1) ndarray

Models calculated for each value in mu_range.

selected_list : list of M (P,) ndarray of boolean

Selected feature for each models calculated.

err_ts_list : list of M floats

Test error for the models calculated.

err_tr_list : list of M floats

Training error for the models calculated.

prediction_ts_list : list of M (T, 1) ndarray

Prediction vector calculated for each value in mu_range on the test set.

prediction_tr_list : list of M (N, 1) ndarray

Prediction vector calculated for each value in mu_range on the training set.

Raises :

ValueError :

If the given value of tau produces a void solution with the given data.

Complete model selection

l1l2py.model_selection(data, labels, test_data, test_labels, mu_range, tau_range, lambda_range, cv_splits, cv_error_function, error_function, data_normalizer=None, labels_normalizer=None, sparse=False, regularized=True, return_predictions=False)

Complete model selection procedure.

It executes the two stages implemented in minimal_model and nested_models and returns their output wrapped in a dictionary.

Note that the error function calculated in the Stage I may have more than one minimum.

By default the less sparse but more regularized solution (minimum value of tau and maximum value of lambda) is selected, in the set of (tau, lambda) pairs with minimum error, .

The boolean parameters sparse and regularized allow to change this behaviour.

Note

See the functions documentation for details on each stage and the meaning of each parameter. The Parameters section describes only the sparse and regularized parameters.

Parameters :

sparse : bool, optional (default is False)

If True, the function selects at STAGE I the sparsest solution with minimum cross validation error.

regularized : bool, optional (default is True)

If True, the function selects at STAGE I the most regularized solution with minimum cross validation error.

Returns :

out : dict

Output dictionary. According with the parameters the dictionary has the following keys:

kcv_err_ts : (T, L) ndarray

[STAGE I] Mean cross validation errors on the training set.

kcv_err_tr : (T, L) ndarray

[STAGE I] Mean cross validation errors on the training set.

tau_opt : float

Optimal value of tau selected in tau_range.

lambda_opt : float

Optimal value of lambda selected in lambda_range.

beta_list : list of M (S,1) ndarray

[STAGE II] Models calculated for each value in mu_range.

selected_list : list of M (P,) ndarray of boolean

[STAGE II] Selected variables for each model calculated.

err_ts_list : list of M floats

[STAGE II] List of Test errors evaluated for the all the models.

err_tr_list : list of M floats

[STAGE II] List of Training errors evaluated for the all the models.

prediction_ts_list : list of M two-dimensional ndarray, optional

[STAGE II] Prediction vectors for the models evaluated on the test set.

prediction_tr_list : list of M two-dimensional ndarray, optional

[STAGE II] Prediction vectors for the models evaluated on the training set.

Table Of Contents