Cross-validation¶

This module contains various provisions for cross-validation.

The main functions in this module are:

cross_validated()
generate_folds()
strata_by_labels()
random_permutation()
mean()
identity()
list_mean()

Module author: Marc Claesen

optunity.cross_validation.select(collection, indices)[source]¶

Selects the subset specified by indices from collection.

>>> select([0, 1, 2, 3, 4], [1, 3])
[1, 3]

optunity.cross_validation.random_permutation(data)[source]¶

Returns a list containing a random permutation of r elements out of data.

Parameters:	data – an iterable containing the elements to permute over
Returns:	returns a list containing permuted entries of `data`.

optunity.cross_validation.cross_validated(x, num_folds=10, y=None, strata=None, folds=None, num_iter=1, regenerate_folds=False, clusters=None, aggregator=<function mean>)[source]¶

Function decorator to perform cross-validation as configured.

Parameters:

x – data to be used for cross-validation
num_folds – number of cross-validation folds (default 10)
y – (optional) labels to be used for cross-validation. If specified, len(labels) must equal len(x)
strata – (optional) strata to account for when generating folds. Strata signify instances that must be spread across folds. Not every instance must be in a stratum. Specify strata as a list of lists of instance indices.
folds – (optional) prespecified cross-validation folds to be used (list of lists (iterations) of lists (folds)).
num_iter – (optional) number of iterations to use (default 1)
regenerate_folds – (optional) whether or not to regenerate folds on every evaluation (default false)
clusters – (optional) clusters to account for when generating folds. Clusters signify instances that must be assigned to the same fold. Not every instance must be in a cluster. Specify clusters as a list of lists of instance indices.
aggregator – function to aggregate scores of different folds (default: mean)

Returns:

a cross_validated_callable with the proper configuration.

This resulting decorator must be used on a function with the following signature (+ potential other arguments):

Parameters:	x_train (iterable) – training data y_train (iterable) – training labels (optional) x_test (iterable) – testing data y_test (iterable) – testing labels (optional)

y_train and y_test must be available of the y argument to this function is not None.

These 4 keyword arguments will be bound upon decoration. Further arguments will remain free (e.g. hyperparameter names).

>>> data = list(range(5))
>>> @cross_validated(x=data, num_folds=5, folds=[[[i] for i in range(5)]], aggregator=identity)
... def f(x_train, x_test, a):
...     return x_test[0] + a
>>> f(a=1)
[1, 2, 3, 4, 5]
>>> f(1)
[1, 2, 3, 4, 5]
>>> f(a=2)
[2, 3, 4, 5, 6]

The number of folds must be less than or equal to the size of the data.

>>> data = list(range(5))
>>> @cross_validated(x=data, num_folds=6) 
... def f(x_train, x_test, a):
...     return x_test[0] + a
AssertionError

The number of labels (if specified) must match the number of data instances.

>>> data = list(range(5))
>>> labels = list(range(3))
>>> @cross_validated(x=data, y=labels, num_folds=2) 
... def f(x_train, x_test, a):
...     return x_test[0] + a
AssertionError

optunity.cross_validation.generate_folds(num_rows, num_folds=10, strata=None, clusters=None)[source]¶

Generates folds for a given number of rows.

Parameters:

num_rows – number of data instances
num_folds – number of folds to use (default 10)
strata – (optional) list of lists to indicate different sampling strata. Not all rows must be in a stratum. The number of rows per stratum must be larger than or equal to num_folds.
clusters – (optional) list of lists indicating clustered instances. Clustered instances must be placed in a single fold to avoid information leaks.

Returns:

a list of folds, each fold is a list of instance indices

>>> folds = generate_folds(num_rows=6, num_folds=2, clusters=[[1, 2]], strata=[[3,4]])
>>> len(folds)
2
>>> i1 = [idx for idx, fold in enumerate(folds) if 1 in fold]
>>> i2 = [idx for idx, fold in enumerate(folds) if 2 in fold]
>>> i1 == i2
True
>>> i3 = [idx for idx, fold in enumerate(folds) if 3 in fold]
>>> i4 = [idx for idx, fold in enumerate(folds) if 4 in fold]
>>> i3 == i4
False

Warning

Instances in strata are not necessarily spread out over all folds. Some folds may already be full due to clusters. This effect should be negligible.

optunity.cross_validation.strata_by_labels(labels)[source]¶

Constucts a list of strata (lists) based on unique values of labels.

Parameters:	labels – iterable, identical values will end up in identical strata
Returns:	the strata, as a list of lists

optunity.cross_validation.mean(x)[source]¶

optunity.cross_validation.identity(x)[source]¶

optunity.cross_validation.list_mean(list_of_measures)[source]¶

Computes means of consequent elements in given list.

Parameters:	list_of_measures (list) – a list of tuples to compute means from
Returns:	a list containing the means

This function can be used as an aggregator in cross_validated(), when multiple performance measures are being returned by the wrapped function.

>>> list_mean([(1, 4), (2, 5), (3, 6)])
[2.0, 5.0]

optunity.cross_validation.mean_and_list(x)[source]¶

Returns a tuple, the first element is the mean of x and the second is x itself.

This function can be used as an aggregator in cross_validated(),

>>> mean_and_list([1,2,3])
(2.0, [1, 2, 3])