# Cross-validation¶

This module contains various provisions for cross-validation.

The main functions in this module are:

Module author: Marc Claesen

optunity.cross_validation.select(collection, indices)[source]

Selects the subset specified by indices from collection.

>>> select([0, 1, 2, 3, 4], [1, 3])
[1, 3]

optunity.cross_validation.random_permutation(data)[source]

Returns a list containing a random permutation of r elements out of data.

Parameters: data – an iterable containing the elements to permute over returns a list containing permuted entries of data.
optunity.cross_validation.cross_validated(x, num_folds=10, y=None, strata=None, folds=None, num_iter=1, regenerate_folds=False, clusters=None, aggregator=<function mean>)[source]

Function decorator to perform cross-validation as configured.

Parameters: x – data to be used for cross-validation num_folds – number of cross-validation folds (default 10) y – (optional) labels to be used for cross-validation. If specified, len(labels) must equal len(x) strata – (optional) strata to account for when generating folds. Strata signify instances that must be spread across folds. Not every instance must be in a stratum. Specify strata as a list of lists of instance indices. folds – (optional) prespecified cross-validation folds to be used (list of lists (iterations) of lists (folds)). num_iter – (optional) number of iterations to use (default 1) regenerate_folds – (optional) whether or not to regenerate folds on every evaluation (default false) clusters – (optional) clusters to account for when generating folds. Clusters signify instances that must be assigned to the same fold. Not every instance must be in a cluster. Specify clusters as a list of lists of instance indices. aggregator – function to aggregate scores of different folds (default: mean) a cross_validated_callable with the proper configuration.

This resulting decorator must be used on a function with the following signature (+ potential other arguments):

Parameters: x_train (iterable) – training data y_train (iterable) – training labels (optional) x_test (iterable) – testing data y_test (iterable) – testing labels (optional)

y_train and y_test must be available of the y argument to this function is not None.

These 4 keyword arguments will be bound upon decoration. Further arguments will remain free (e.g. hyperparameter names).

>>> data = list(range(5))
>>> @cross_validated(x=data, num_folds=5, folds=[[[i] for i in range(5)]], aggregator=identity)
... def f(x_train, x_test, a):
...     return x_test[0] + a
>>> f(a=1)
[1, 2, 3, 4, 5]
>>> f(1)
[1, 2, 3, 4, 5]
>>> f(a=2)
[2, 3, 4, 5, 6]


The number of folds must be less than or equal to the size of the data.

>>> data = list(range(5))
>>> @cross_validated(x=data, num_folds=6)
... def f(x_train, x_test, a):
...     return x_test[0] + a
AssertionError


The number of labels (if specified) must match the number of data instances.

>>> data = list(range(5))
>>> labels = list(range(3))
>>> @cross_validated(x=data, y=labels, num_folds=2)
... def f(x_train, x_test, a):
...     return x_test[0] + a
AssertionError

optunity.cross_validation.generate_folds(num_rows, num_folds=10, strata=None, clusters=None)[source]

Generates folds for a given number of rows.

Parameters: num_rows – number of data instances num_folds – number of folds to use (default 10) strata – (optional) list of lists to indicate different sampling strata. Not all rows must be in a stratum. The number of rows per stratum must be larger than or equal to num_folds. clusters – (optional) list of lists indicating clustered instances. Clustered instances must be placed in a single fold to avoid information leaks. a list of folds, each fold is a list of instance indices
>>> folds = generate_folds(num_rows=6, num_folds=2, clusters=[[1, 2]], strata=[[3,4]])
>>> len(folds)
2
>>> i1 = [idx for idx, fold in enumerate(folds) if 1 in fold]
>>> i2 = [idx for idx, fold in enumerate(folds) if 2 in fold]
>>> i1 == i2
True
>>> i3 = [idx for idx, fold in enumerate(folds) if 3 in fold]
>>> i4 = [idx for idx, fold in enumerate(folds) if 4 in fold]
>>> i3 == i4
False


Warning

Instances in strata are not necessarily spread out over all folds. Some folds may already be full due to clusters. This effect should be negligible.

optunity.cross_validation.strata_by_labels(labels)[source]

Constucts a list of strata (lists) based on unique values of labels.

Parameters: labels – iterable, identical values will end up in identical strata the strata, as a list of lists
optunity.cross_validation.mean(x)[source]
optunity.cross_validation.identity(x)[source]
optunity.cross_validation.list_mean(list_of_measures)[source]

Computes means of consequent elements in given list.

Parameters: list_of_measures (list) – a list of tuples to compute means from a list containing the means

This function can be used as an aggregator in cross_validated(), when multiple performance measures are being returned by the wrapped function.

>>> list_mean([(1, 4), (2, 5), (3, 6)])
[2.0, 5.0]

optunity.cross_validation.mean_and_list(x)[source]

Returns a tuple, the first element is the mean of x and the second is x itself.

This function can be used as an aggregator in cross_validated(),

>>> mean_and_list([1,2,3])
(2.0, [1, 2, 3])