L1L2Py is a simple and lightweight python package to perform variable selection. The algoritms proposed and implemented are been well studied in different experimental settings.
The package is self contained and gives all the needed tools to generate sparse solution for a given linear classification or regression problem.
L1L2Py is available opensource under the GNU GPL license. It requires Python version 2.5 or higher and the NumPy package.
There are two ways you can get it:
Automatic Installation (recomended)
L1L2Py is available on the Python Package Index and can be installed via easyinstall
$ easy_install U L1L2Py
or pip
$ pip install U L1L2Py
Manual Installation
Download the latest official version L1L2Py1.0.5.tar.gz, then:
$ tar xzvf L1L2Py1.0.5.tar.gz
$ cd L1L2Py1.0.5
$ python setup.py install
Using this manual installation, if the testing framework Nose is installed, is possible to run the given Unit Test running the nosetests script
$ nosetests
....................................

Ran 36 tests in 3.002s
OK
Moreover, in order to generate the plots showed in the following tutorial the Matplotlib package (with the mplot3d toolkit enabled) is required.
This tutorial aims to show how the regularization with double optimization is able to perform variable selection. Moreover, the tutorial shows how to use L1L2Py on a synthetic dataset generated with a given function which simulates a linear regression problem with a subset of relevant and linearly correlated variables.
Even if it’s not mandatory, in order to better understand this tutorial the reader should read the method described in [DeMol09b], or at least the introduction of Main functions (l1l2py).
Using the script dataset_generation.py (L1L2Py1.0.5/docs/tutorial/dataset_generation.py), synthetic data can be generated for a supervised regression problem. The script contains a function (called correlated_dataset) which generates a data matrix with some relevant, correlated and noisy variables.
Running the script with only two parameters (i.e. the data matrix dimensions) two text files, namely data.txt and labels.txt are generated in the same directory containing the script file.
$ python dataset_generation.py 100 40
Generation of 100 samples with 40 variables... done
$ ls
dataset_generation.py data.txt labels.txt
The script generates a random dataset with 3 groups of 5 correlated variables. In total, there are 15 relevant variables and, following the example above, 40  15 = 25 noisy variables. The weight assigned to each relevant variable is 1.0.
The data matrix and the labels matrix generated with the script and used in this tutorial can be found in the the directory L1L2py1.0.5/docs/tutorial), where the script itself is located.
To familiarize with the l1l2py code, the two files can be copied where needed and used following the tutorial steps below (alternatively, different datasets can be generated using either the script or the function correlated_dataset).
First, starting a new python terminal, import the needed packages:
>>> import numpy as np
>>> import l1l2py
and load the data from disk (change file paths as needed):
>>> X = np.loadtxt('tutorial/data.txt')
>>> Y = np.loadtxt('tutorial/labels.txt')
Then, split the data in two blocks, trainingset and testset using the standard NumPy functions numpy.vsplit and numpy.hsplit
>>> train_data, test_data = np.vsplit(X, 2)
>>> print train_data.shape, test_data.shape
(50, 40) (50, 40)
>>> train_labels, test_labels = np.hsplit(Y, 2)
>>> print train_labels.shape, test_labels.shape
(50,) (50,)
as shown, each set contains 50 samples.
At this point a correct range for the regularization parameters has to be chosen. The function l1l2py.algorithms.l1_bound can be used to devise an optimal range for the sparsity parameter .
>>> train_data_centered = l1l2py.tools.center(train_data)
>>> tau_max = l1l2py.algorithms.l1_bound(train_data_centered, train_labels)
>>> print tau_max
10.5458157567
Note that the matrix is centered, because the same normalization will be used when running the model selection procedure (see later).
Using this parameter to solve a Lasso optimization problem, a void solution would be obtained:
>>> beta = l1l2py.algorithms.l1l2_regularization(train_data_centered,
... train_labels, 0.0, tau_max)
>>> print np.allclose(np.zeros_like(beta), beta)
True
A good choice for the extreme values for could be
>>> tau_max = tau_max  1e2
>>> tau_min = tau_max * 1e2
>>> beta_max = l1l2py.algorithms.l1l2_regularization(train_data_centered,
... train_labels, 0.0, tau_max)
>>> beta_min = l1l2py.algorithms.l1l2_regularization(train_data_centered,
... train_labels, 0.0, tau_min)
>>> print len(beta_max.nonzero()[0]), len(beta_min.nonzero()[0])
1 10
The minimum value of should be set in order to get a solution with more nonzero variables than the number of hypotetical number of relevant groups of correlated variables (in the our case we know to have 3 groups).
The range of values can therefore be set as:
>>> tau_range = l1l2py.tools.geometric_range(tau_min, tau_max, 20)
For the regularization parameter a wide range of values is advisable
>>> lambda_range = l1l2py.tools.geometric_range(1e6, 1e3, 7)
as for the correlation parameter . For this simple example some different levels of correlation are set, starting from 0.0
>>> mu_range = [0.0, 0.001, 0.01, 0.1, 1.0]
To correctly use the Stage I: Minimal Model Selection, cross validation splits must be generated:
>>> cv_splits = l1l2py.tools.kfold_splits(train_labels, k=5) #5fold CV
Now, call the l1l2py.model_selection function to get the results of model selection (the complete execution of the function will take some minutes)
>>> out = l1l2py.model_selection(train_data, train_labels,
... test_data, test_labels,
... mu_range, tau_range, lambda_range,
... cv_splits,
... cv_error_function=l1l2py.tools.regression_error,
... error_function=l1l2py.tools.regression_error,
... data_normalizer=l1l2py.tools.center,
... labels_normalizer=None)
The optimal value of and found in the Stage I: Minimal Model Selection are:
>>> print out['tau_opt'], out['lambda_opt']
0.451073293459 0.000316227766017
The module plot.py (L1L2py1.0.5/docs/tutorial/plot.py), provides a function (called kcv_errors) to plot the mean cross validation error (remember that for some high values of , the solution could be void on some cross validation splits, see Stage I: Minimal Model Selection, so the mean could be evaluated on a subset of (, ) pairs)
>>> from matplotlib import pyplot as plt
>>> from plot import kcv_errors
>>> tau_max_idx = out['kcv_err_ts'].shape[0]
>>> kcv_errors(out['kcv_err_ts'],
... np.log10(tau_range[:tau_max_idx]), np.log10(lambda_range),
... r'$log_{10}(\tau)$', r'$log_{10}(\lambda)$')
>>> plt.show()
Since the error increases rapidly with the highest value of , is useful to show the error surface removing the (corresponding) last row from the mean errors matrix
>>> tau_max_idx = 1
>>> kcv_errors(out['kcv_err_ts'][:tau_max_idx,:],
... np.log10(tau_range[:tau_max_idx]), np.log10(lambda_range),
... r'$log_{10}(\tau)$', r'$log_{10}(\lambda)$')
>>> plt.show()
The (almost completely) nested list of relevant variables is stored in the selected_list entry of the resulting dict object:
>>> for mu, sel in zip(mu_range, out['selected_list']):
... print "%.3f:" % mu, sel.nonzero()[0]
0.000: [ 3 4 5 10 12 14]
0.001: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
0.010: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
0.100: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
1.000: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 22 24 29 32 35 38 39]
Remembering that in the used dataset we have 3 groups of 5 relevant variables, with indexes from 0 to 14, the result shows that the minimal list contains two variables belonging to the first group (indexes 3 and 4), one variable belonging to the second group (index 5) and three variables belonging to the third group (indexes 10, 12 and 14), without any noisy variables! Incrementing the correlation parameter all the relevant variables can be included obtaining a model with (almost) constant prediction power.
In fact, the test error evaluated by the RLS solution computed on the selected variables (with the optimal value of ) is:
>>> for mu, err in zip(mu_range, out['err_ts_list']):
... print "%.3f: %.3f" % (mu, err)
0.000: 2.131
0.001: 2.236
0.010: 2.224
0.100: 2.224
1.000: 2.227
Plot a 3D error surface.
Parameters :  errors : (N, D) ndarray
range_x : array_like of N values
range_y : array_like of D values
label_x : str
label_y : str


Examples
>>> errors = numpy.empty((20, 10))
>>> x = numpy.arange(20)
>>> y = numpy.arange(10)
>>> for i in range(20):
... for j in range(10):
... errors[i, j] = (x[i] * y[j])
...
>>> kcv_errors(errors, x, y, 'x', 'y')
>>> plt.show()
Random supervised dataset generation with correlated variables.
The function returns a supervised training set with num_samples examples with num_variables variables.
Parameters :  num_samples : int
num_variables : int
groups : tuple of int
weights : array_like of sum(groups) float
variables_stdev : float, optional (default is 1.0)
correlations_stdev : float, optional (default is 1e2)
labels_stdev : float, optional (default is 1e2)


Returns :  X : (num_samples, num_variables) ndarray
Y : (num_samples, 1) ndarray

Notes
The data will have len(groups) correlated groups of variables, where for each one the function generates a column vector of num_samples values drawn from a zeromean Gaussian distribution with standard deviation equal to variables_stdev.
For each variable of the group associated with the vector, the function generates the values as
where is additive noise drawn from a zeromean Gaussian distribution with standard deviation equal to correlations_stdev.
The regression values will be generated as
where is the weights parameter, a list of sum(groups) coefficients of the relevant variables, is the submatrix containing only the column related to the relevant variables and is additive noise drawn from a zeromean Gaussian distribution with standard deviation equal to labels_stdev.
At the end the function returns the matrices and where
is the concatenation of the matrix with the relevant variables with num_variables  sum(groups) noisy variables generated indipendently using values drawn from a zeromean Gaussian distribution with standard deviation equal to variables_stdev.
Examples
>>> X, Y = correlated_dataset(30, 40, (5, 5, 5), [3.0]*15)
>>> X.shape
(30, 40)
>>> Y.shape
(30, 1)