SlipGURU Dipartimento di Informatica e Scienze dell'Informazione Università Degli Studi di Genova

Utility functions and classes (utils)

This module contains functions and classes useful to manipulate input data (e.g. gene expressions, labels), create outputs and collect results.

Data and parameters

exception l1l2signature.utils.L1L2SignatureException

Exception raised by L1L2Signature classes and functions.

class l1l2signature.utils.BioDataReader(data_file, labels_file, sample_remover=None, variable_remover=None, delimiter=', ', samples_on='col', positive_label=None)

Biological Data reader.

This class reads a pair of CSV files containing respectively a data matrix and a list of labels.

The reader can discard some samples and some variables according to give arguments (as described below) and assumes the presence of an header line in both files.

If labels file contains exactly 2 labels, BioDataReader automatically maps two classes in -1 and +1. Then, the labels_reverse attribute will contain a dictionary mapping from numeric to string labels (otherwise it is None). Otherwise BioDataReader assumes to find numeric values (regression task).

Parameters :

data_file : file or str

File, filename, or generator to read (see also numpy.loadtxt())

labels_file : file or str

File, filename, or generator to read (see also numpy.loadtxt())

variable_remover : int or None, optional (default None)

Variable names prefix used to discard samples (e.g. AFFX for Affymetrix Gene Expression MicroArray)

sample_remover : str or None, optional (default None)

Label value used to discard samples prefix. This value must refer to the original labels into the labels file.

delimiter : str, optional (default ‘,’)

CSV char delimiter (if it is not a comma)

samples_on : ‘col’ or ‘row’, optional (default ‘col’)

Indicates if the samples are arranged on rows or columns into the data file

positive_label : str, optional (default None)

Indicates what label has to be considered as the positive class (mapped to +1 value). If None, mapping follows a lexicographic order.

Examples

>>> from l1l2signature.utils import BioDataReader
>>> from cStringIO import StringIO
>>> STD_DATA = '\n'.join(['probe, A,   B,   C,   D',
...                       'p1,    0.0, 0.1, 0.2, 0.3',
...                       'p2,    0.0, 0.1, 0.2, 0.3',
...                       'p3,    0.0, 0.1, 0.2, 0.3',
...                       'p4,    0.0, 0.1, 0.2, 0.3',
...                       'p5,    0.0, 0.1, 0.2, 0.3'])
>>> STD_LABELS = '\n'.join(['name, value',
...                         'A,    1',
...                         'B,    0',
...                         'C,    1',
...                         'D,    1'])
>>> br = BioDataReader(StringIO(STD_DATA), StringIO(STD_LABELS))
>>> print br.samples
['A' 'B' 'C' 'D']
>>> print br.variables
['p1' 'p2' 'p3' 'p4' 'p5']
>>> print br.data
[[ 0.   0.   0.   0.   0. ]
 [ 0.1  0.1  0.1  0.1  0.1]
 [ 0.2  0.2  0.2  0.2  0.2]
 [ 0.3  0.3  0.3  0.3  0.3]]
>>> print br.labels
[ 1. -1.  1.  1.]
>>>

Attributes

data numpy.ndarray Data matrix (float) of dimensions samples X variables
labels numpy.ndarray Labels array (float)
samples numpy.ndarray Samples names (str)
variables numpy.ndarray Variables names (str)
labels_reverse dict Mapping from -1 and +1 classes to original string labels. The attribute is None if automatic mapping is not performed.
class l1l2signature.utils.RangesScaler(data, labels, data_normalizer=None, labels_normalizer=None)

Given data and labels helps to scale L1L2 parameters ranges properly.

This class works on tau and mu ranges passed to the l1l2 selection framework (see also l1l2py.model_selection() and related function for details).

Scaling ranges permits to use relative (and not absolute) ranges of parameters.

Attributes

norm_data numpy.ndarray Normalized data matrix.
norm_labels numpy.ndarray Normalized labels vector.
tau_range(trange)

Returns a scaled tau range.

Tau scaling factor is the maximum tau value to avoid and empty solution (where all variables are discarded). The value is estimated on the maximum correlation between data and labels.

Parameters :

trange : numpy.ndarray

Tau range containing relative values (expected maximum is lesser than 1.0 and minimum greater than 0.0).

Returns :

tau_range : numpy.ndarray

Scaled tau range.

Raises :

L1L2SignatureException :

If trange values are not in the [0, 1) interval (right extreme excluded).

mu_range(mrange)

Returns a scaled mu range.

Mu scaling factor is estimated on the maximum eigenvalue of the correlation matrix and is used to simplify the parameters choice.

Parameters :

mrange : numpy.ndarray

Mu range containing relative values (expected maximum is lesser than 1.0 and minimum greater than 0.0).

Returns :

mu_range : numpy.ndarray

Scaled mu range.

Raises :

L1L2SignatureException :

If mrange values are not all greater than 0.

tau_scaling_factor

Tau scaling factor calculated on given data and labels.

mu_scaling_factor

Mu scaling factor calculated on given data matrix.

Results analysis

l1l2signature.utils.ordered_submatrices(data, labels, signatures_idxs)

Returns a list of sorted and filtered submatrices.

The matrices are sorted by labels and filtered by signatures_idxs.

Parameters :

labels : list

Data labels

signatures_idxs : list of numpy.ndarray

Each list item contains a signature in terms of variables boolean mask or indexes. If indexes are given, submatrices are also properly ordered.

Returns :

labels_idxs : numpy.ndarray

Labels ordering used to produce submatrices.

sub_matrices : list of numpy.ndarray

List of ordered and filtered submatrices.

Examples

>>> from l1l2signature.utils import ordered_submatrices
>>> data = [[1., 2., 3.],
...         [4., 5., 6.],
...         [7., 8., 9.]]
>>> labels = [1, -1, 1]
>>> signatures_idxs = [[True, False, False],
...                    [0, 1, 2],
...                    [2, 1]]
>>> labels_idxs, sub_matrices = ordered_submatrices(data, labels,
...                                                 signatures_idxs)
>>> print labels_idxs
[1 0 2]
>>> print sub_matrices[0]
[[ 4.]
 [ 1.]
 [ 7.]]
>>> print sub_matrices[1]
[[ 4.  5.  6.]
 [ 1.  2.  3.]
 [ 7.  8.  9.]]
>>> print sub_matrices[2]
[[ 6.  5.]
 [ 3.  2.]
 [ 9.  8.]]
l1l2signature.utils.signatures(splits_results, frequency_threshold=0.0)

Returns (almost) nested signatures for each correlation value.

The function returns 3 lists where each item refers to a signature (for increasing value of linear correlation). Each signature is orderer from the most to the least selected variable across KCV splits results.

Parameters :

splits_results : iterable

List of results from L1L2Py module, one for each external split.

frequency_threshold : float

Only the variables selected more (or equal) than this threshold are included into the signature.

Returns :

sign_totals : list of numpy.ndarray.

Counts the number of times each variable in the signature is selected.

sign_freqs : list of numpy.ndarray.

Frequencies calculated from sign_totals.

sign_idxs : list of numpy.ndarray.

Indexes of the signatures variables .

Examples

>>> from l1l2signature.utils import signatures
>>> splits_results = [{'selected_list':[[True, False], [True, True]]},
...                   {'selected_list':[[True, False], [False, True]]}]
>>> sign_totals, sign_freqs, sign_idxs = signatures(splits_results)
>>> print sign_totals
[array([ 2.,  0.]), array([ 2.,  1.])]
>>> print sign_freqs
[array([ 1.,  0.]), array([ 1. ,  0.5])]
>>> print sign_idxs
[array([0, 1]), array([1, 0])]
l1l2signature.utils.selection_summary(splits_results)

Counts how many times each variables was selected.

Parameters :

splits_results : iterable

List of results from L1L2Py module, one for each external split.

Returns :

summary : numpy.ndarray

Selection summary. # mu_values X # variables matrix.

l1l2signature.utils.confusion_matrix(labels, predictions)

Calculates a confusion matrix.

From given real and predicted labels, the function calculated a confusion matrix as a double nested dictionary. The external one contains two keys, 'T' and 'F'. Both internal dictionaries contain a key for each class label. Then the ['T']['C1'] entry counts the number of correctly predicted 'C1' labels, while ['F']['C2'] the incorrectly predicted 'C2' labels.

Note that each external dictionary correspond to a confusion matrix diagonal and the function works only on two-class labels.

Parameters :

labels : iterable

Real labels.

predictions : iterable

Predicted labels.

Returns :

cm : dict

Dictionary containing the confusion matrix values.

l1l2signature.utils.classification_measures(confusion_matrix, positive_label=None)

Calculates some classification measures.

Measures are calculated from a given confusion matrix (see confusion_matrix() for a detailed description of the required structure).

The positive_label arguments allows to specify what label has to be considered the positive class. This is needed to calculate some measures like F-measure and set some aliases (e.g. precision and recall are respectively the ‘predictive value’ and the ‘true rate’ for the positive class).

If positive_label is None, the resulting dictionary will not contain all the measures. Assuming to have to classes ‘C1’ and ‘C2’, and to indicate ‘C1’ as the positive (P) class, the function returns a dictionary with the following structure:

{
    'C1': {'predictive_value': --,  # TP / (TP + FP)
           'true_rate':        --}, # TP / (TP + FN)
    'C2': {'predictive_value': --,  # TN / (TN + FN)
           'true_rate':        --}, # TN / (TN + FP)
    'accuracy':          --,        # (TP + TN) / (TP + FP + FN + TN)
    'balanced_accuracy': --,        # 0.5 * ( (TP / (TP + FN)) +
                                    #         (TN / (TN + FP)) )
    'MCC':               --,        # ( (TP * TN) - (FP * FN) ) /
                                    # sqrt( (TP + FP) * (TP + FN) *
                                    #       (TN + FP) * (TN + FN) )

    # Following, only with positive_labels != None
    'sensitivity':       --,        # P true rate: TP / (TP + FN)
    'specificity':       --,        # N true rate: TN / (TN + FP)
    'precision':         --,        # P predictive value: TP / (TP + FP)
    'recall':            --,        # P true rate: TP / (TP + FN)
    'F_measure':         --         # 2. * ( (Precision * Recall ) /
                                    #        (Precision + Recall) )
}
Parameters :

confusion_matrix : dict

Confusion matrix (as the one returned by confusion_matrix()).

positive_label : str

Positive class label.

Returns :

summary : dict

Dictionary containing calculated measures.

Plotting functions (plots)

This module contains all the utilities used to plot useful results.

l1l2signature.plots.kfold_errors(xrange, yrange, labels, ts_errors, tr_errors=None, fig_num=None)

Returns a matplotlib figure object containing a kfold error plot.

Parameters :

xrange : iterable

Range of values on the x-axis

yrange : iterable

Range of values on the y-axis

labels : iterable

Pair of string labels for X and Y axes

ts_errors : numpy.ndarray

Test Error matrix (float) of dimensions len(xrange) X len(yrange)

tr_errors : numpy.ndarray, optional

Train Error matrix (float) of dimensions len(xrange) X len(yrange)

fig_num : int, optional

Figure Number. If not given a new figure is initialized

Returns :

fig : matplotlib.figure.Figure

Created figure handle

l1l2signature.plots.errors_boxplot(errors, positions, label=None, title=None, fig_num=None)

Returns a matplotlib figure object containing errors box plots.

Parameters :

errors : numpy.ndarray

Error matrix (float) of dimensions K X len(positions)

positions : numpy.ndarray

Box plot x axis

label : str, optional

X-Axis label

title : str, optional

Plot title

fig_num : int, optional

Figure Number. If not given a new figure is initialized

Returns :

fig : matplotlib.figure.Figure

Created figure handle

l1l2signature.plots.heatmap(submatrix, labels, sample_names=None, var_names=None, clustering_method='ward', clustering_metric='euclidean', var_preorder=None, fig_num=None)

Returns a matplotlib figure object containing an heatmap plot.

If scipy is not installed samples and variables will be shown is given order.

Parameters :

submatrix : numpy.ndarray

Submatrix obtained from a signature.

labels : numpy.ndarray

Samples labels.

sample_names : iterable or str, optional

Sample names. If None, heatmap will be anonymous.

var_names : iterable or str, optional

Variable names. If None, heatmap does not contain variables labels.

clustering_method : str, optional (default ‘ward’)

Clustering method used to order samples and variables. See scipy.cluster.hierarchy.linkage() function.

clustering_metric : str, optional (default ‘euclidean’)

Clustering metric used to order samples and variables. See scipy.cluster.hierarchy.linkage() function.

var_preorder : numpy.ndarray like, optional (default None)

If given, variables are not clustered but given indexes are used to order them.

fig_num : int, optional

Figure Number. If not given a new figure is initialized.

Returns :

fig : matplotlib.figure.Figure

Created figure handle

l1l2signature.plots.pca(submatrix, labels, fig_num=None)

Returns a matplotlib figure containing sample points.

Starting from given submatrix calculates a PCA projection to plot samples in a 3D-space. If the signatures contains only 2 or 3 variables, PCA is obviously not performed.

Parameters :

submatrix : numpy.ndarray

Submatrix obtained from a signature.

labels : numpy.ndarray

Samples labels.

fig_num : int, optional

Figure Number. If not given a new figure is initialized

Returns :

fig : matplotlib.figure.Figure

Created figure handle

l1l2signature.plots.selected_over_threshold(frequencies, mu_range, fig_num=None)

Returns a figure containing a plot of selected vars cumulative counting.

For each mu value plots a curve which indicates how many variables are been selected for each frequency threshold.

Parameters :

frequencies : numpy.ndarray

List of len(mu_range) lists containing coordinates to plot.

mu_range : class:numpy.ndarray

Range of mu values.

fig_num : int, optional

Figure Number. If not given a new figure is initialized

Returns :

fig : matplotlib.figure.Figure

Created figure handle

Table Of Contents

Previous topic

Quick start tutorial

This Page