# Getting started¶

L1L2Py is a simple and lightweight python package to perform variable selection. The algoritms proposed and implemented are been well studied in different experimental settings.

The package is self contained and gives all the needed tools to generate sparse solution for a given linear classification or regression problem.

## Installation¶

L1L2Py is available open-source under the GNU GPL license. It requires Python version 2.5 or higher and the NumPy package.

There are two ways you can get it:

• Automatic Installation (recomended)

L1L2Py is available on the Python Package Index and can be installed via easy-install

$easy_install -U L1L2Py  or pip $ pip install -U L1L2Py

• Manual Installation

$tar xzvf L1L2Py-1.0.5.tar.gz$ cd L1L2Py-1.0.5
$python setup.py install  Using this manual installation, if the testing framework Nose is installed, is possible to run the given Unit Test running the nosetests script $ nosetests
....................................
----------------------------------------------------------------------
Ran 36 tests in 3.002s

OK


Moreover, in order to generate the plots showed in the following tutorial the Matplotlib package (with the mplot3d toolkit enabled) is required.

## Tutorial¶

This tutorial aims to show how the regularization with double optimization is able to perform variable selection. Moreover, the tutorial shows how to use L1L2Py on a synthetic dataset generated with a given function which simulates a linear regression problem with a subset of relevant and linearly correlated variables.

Even if it’s not mandatory, in order to better understand this tutorial the reader should read the method described in [DeMol09b], or at least the introduction of Main functions (l1l2py).

### Synthetic Data Generation¶

Using the script dataset_generation.py (L1L2Py-1.0.5/docs/tutorial/dataset_generation.py), synthetic data can be generated for a supervised regression problem. The script contains a function (called correlated_dataset) which generates a data matrix with some relevant, correlated and noisy variables.

Running the script with only two parameters (i.e. the data matrix dimensions) two text files, namely data.txt and labels.txt are generated in the same directory containing the script file.

$python dataset_generation.py 100 40 Generation of 100 samples with 40 variables... done$ ls
dataset_generation.py  data.txt  labels.txt


The script generates a random dataset with 3 groups of 5 correlated variables. In total, there are 15 relevant variables and, following the example above, 40 - 15 = 25 noisy variables. The weight assigned to each relevant variable is 1.0.

The data matrix and the labels matrix generated with the script and used in this tutorial can be found in the the directory L1L2py-1.0.5/docs/tutorial), where the script itself is located.

To familiarize with the l1l2py code, the two files can be copied where needed and used following the tutorial steps below (alternatively, different datasets can be generated using either the script or the function correlated_dataset).

### Preparing the data¶

First, starting a new python terminal, import the needed packages:

>>> import numpy as np
>>> import l1l2py


and load the data from disk (change file paths as needed):

>>> X = np.loadtxt('tutorial/data.txt')


Then, split the data in two blocks, training-set and test-set using the standard NumPy functions numpy.vsplit and numpy.hsplit

>>> train_data, test_data = np.vsplit(X, 2)
>>> print train_data.shape, test_data.shape
(50, 40) (50, 40)
>>> train_labels, test_labels = np.hsplit(Y, 2)
>>> print train_labels.shape, test_labels.shape
(50,) (50,)


as shown, each set contains 50 samples.

### Setting parameters ranges¶

At this point a correct range for the regularization parameters has to be chosen. The function l1l2py.algorithms.l1_bound can be used to devise an optimal range for the sparsity parameter .

>>> train_data_centered = l1l2py.tools.center(train_data)
>>> tau_max = l1l2py.algorithms.l1_bound(train_data_centered, train_labels)
>>> print tau_max
10.5458157567


Note that the matrix is centered, because the same normalization will be used when running the model selection procedure (see later).

Using this parameter to solve a Lasso optimization problem, a void solution would be obtained:

>>> beta = l1l2py.algorithms.l1l2_regularization(train_data_centered,
...                                              train_labels, 0.0, tau_max)
>>> print np.allclose(np.zeros_like(beta), beta)
True


A good choice for the extreme values for could be

>>> tau_max = tau_max - 1e-2
>>> tau_min = tau_max * 1e-2
>>> beta_max = l1l2py.algorithms.l1l2_regularization(train_data_centered,
...                                                  train_labels, 0.0, tau_max)
>>> beta_min = l1l2py.algorithms.l1l2_regularization(train_data_centered,
...                                                  train_labels, 0.0, tau_min)
>>> print len(beta_max.nonzero()[0]), len(beta_min.nonzero()[0])
1 10


The minimum value of should be set in order to get a solution with more non-zero variables than the number of hypotetical number of relevant groups of correlated variables (in the our case we know to have 3 groups).

The range of values can therefore be set as:

>>> tau_range = l1l2py.tools.geometric_range(tau_min, tau_max, 20)


For the regularization parameter a wide range of values is advisable

>>> lambda_range = l1l2py.tools.geometric_range(1e-6, 1e-3, 7)


as for the correlation parameter . For this simple example some different levels of correlation are set, starting from 0.0

>>> mu_range = [0.0, 0.001, 0.01, 0.1, 1.0]


### Run the model selection¶

To correctly use the Stage I: Minimal Model Selection, cross validation splits must be generated:

>>> cv_splits = l1l2py.tools.kfold_splits(train_labels, k=5) #5-fold CV


Now, call the l1l2py.model_selection function to get the results of model selection (the complete execution of the function will take some minutes)

>>> out = l1l2py.model_selection(train_data, train_labels,
...                              test_data, test_labels,
...                              mu_range, tau_range, lambda_range,
...                              cv_splits,
...                              cv_error_function=l1l2py.tools.regression_error,
...                              error_function=l1l2py.tools.regression_error,
...                              data_normalizer=l1l2py.tools.center,
...                              labels_normalizer=None)


### Analyze the results¶

The optimal value of and found in the Stage I: Minimal Model Selection are:

>>> print out['tau_opt'], out['lambda_opt']
0.451073293459 0.000316227766017


The module plot.py (L1L2py-1.0.5/docs/tutorial/plot.py), provides a function (called kcv_errors) to plot the mean cross validation error (remember that for some high values of , the solution could be void on some cross validation splits, see Stage I: Minimal Model Selection, so the mean could be evaluated on a subset of (, ) pairs)

>>> from matplotlib import pyplot as plt
>>> from plot import kcv_errors
>>> tau_max_idx = out['kcv_err_ts'].shape[0]
>>> kcv_errors(out['kcv_err_ts'],
...            np.log10(tau_range[:tau_max_idx]), np.log10(lambda_range),
...            r'$log_{10}(\tau)$', r'$log_{10}(\lambda)$')
>>> plt.show()


Since the error increases rapidly with the highest value of , is useful to show the error surface removing the (corresponding) last row from the mean errors matrix

>>> tau_max_idx -= 1
>>> kcv_errors(out['kcv_err_ts'][:tau_max_idx,:],
...            np.log10(tau_range[:tau_max_idx]), np.log10(lambda_range),
...            r'$log_{10}(\tau)$', r'$log_{10}(\lambda)$')
>>> plt.show()


The (almost completely) nested list of relevant variables is stored in the selected_list entry of the resulting dict object:

>>> for mu, sel in zip(mu_range, out['selected_list']):
...     print "%.3f:" % mu, sel.nonzero()[0]
0.000: [ 3  4  5 10 12 14]
0.001: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14]
0.010: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
0.100: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
1.000: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 22 24 29 32 35 38 39]


Remembering that in the used dataset we have 3 groups of 5 relevant variables, with indexes from 0 to 14, the result shows that the minimal list contains two variables belonging to the first group (indexes 3 and 4), one variable belonging to the second group (index 5) and three variables belonging to the third group (indexes 10, 12 and 14), without any noisy variables! Incrementing the correlation parameter all the relevant variables can be included obtaining a model with (almost) constant prediction power.

In fact, the test error evaluated by the RLS solution computed on the selected variables (with the optimal value of ) is:

>>> for mu, err in zip(mu_range, out['err_ts_list']):
...     print "%.3f: %.3f" % (mu, err)
0.000: 2.131
0.001: 2.236
0.010: 2.224
0.100: 2.224
1.000: 2.227


### Appendix: functions used in this tutorial¶

plot.kcv_errors(errors, range_x, range_y, label_x, label_y)

Plot a 3D error surface.

Parameters : errors : (N, D) ndarray Error matrix. range_x : array_like of N values First axis values. range_y : array_like of D values Second axis values. label_x : str First axis label. label_y : str Second axis label.

Examples

>>> errors = numpy.empty((20, 10))
>>> x = numpy.arange(20)
>>> y = numpy.arange(10)
>>> for i in range(20):
...     for j in range(10):
...         errors[i, j] = (x[i] * y[j])
...
>>> kcv_errors(errors, x, y, 'x', 'y')
>>> plt.show()

dataset_generation.correlated_dataset(num_samples, num_variables, groups, weights, variables_stdev=1.0, correlations_stdev=0.01, labels_stdev=0.01)

Random supervised dataset generation with correlated variables.

The function returns a supervised training set with num_samples examples with num_variables variables.

Parameters : num_samples : int Number of samples. num_variables : int Number of variables. groups : tuple of int For each group of relevant variables indicates the group cardinality. weights : array_like of sum(groups) float True regression model. variables_stdev : float, optional (default is 1.0) Standard deviation of the zero-mean Gaussian distribution generating variables column vectors. correlations_stdev : float, optional (default is 1e-2) Standard deviation of the zero-mean Gaussian distribution generating errors between variables which belong to the same group labels_stdev : float, optional (default is 1e-2) Standard deviation of the zero-mean Gaussian distribution generating regression errors. X : (num_samples, num_variables) ndarray Data matrix. Y : (num_samples, 1) ndarray Regression output.

Notes

The data will have len(groups) correlated groups of variables, where for each one the function generates a column vector of num_samples values drawn from a zero-mean Gaussian distribution with standard deviation equal to variables_stdev.

For each variable of the group associated with the vector, the function generates the values as

where is additive noise drawn from a zero-mean Gaussian distribution with standard deviation equal to correlations_stdev.

The regression values will be generated as

where is the weights parameter, a list of sum(groups) coefficients of the relevant variables, is the submatrix containing only the column related to the relevant variables and is additive noise drawn from a zero-mean Gaussian distribution with standard deviation equal to labels_stdev.

At the end the function returns the matrices and where

is the concatenation of the matrix with the relevant variables with num_variables - sum(groups) noisy variables generated indipendently using values drawn from a zero-mean Gaussian distribution with standard deviation equal to variables_stdev.

Examples

>>> X, Y = correlated_dataset(30, 40, (5, 5, 5), [3.0]*15)
>>> X.shape
(30, 40)
>>> Y.shape
(30, 1)