Cross-validation¶
This module contains various provisions for cross-validation.
The main functions in this module are:
cross_validated()
generate_folds()
strata_by_labels()
random_permutation()
mean()
identity()
list_mean()
Module author: Marc Claesen
-
optunity.cross_validation.
select
(collection, indices)[source]¶ Selects the subset specified by indices from collection.
>>> select([0, 1, 2, 3, 4], [1, 3]) [1, 3]
-
optunity.cross_validation.
random_permutation
(data)[source]¶ Returns a list containing a random permutation of
r
elements out ofdata
.Parameters: data – an iterable containing the elements to permute over Returns: returns a list containing permuted entries of data
.
-
optunity.cross_validation.
cross_validated
(x, num_folds=10, y=None, strata=None, folds=None, num_iter=1, regenerate_folds=False, clusters=None, aggregator=<function mean>)[source]¶ Function decorator to perform cross-validation as configured.
Parameters: - x – data to be used for cross-validation
- num_folds – number of cross-validation folds (default 10)
- y – (optional) labels to be used for cross-validation. If specified, len(labels) must equal len(x)
- strata – (optional) strata to account for when generating folds. Strata signify instances that must be spread across folds. Not every instance must be in a stratum. Specify strata as a list of lists of instance indices.
- folds – (optional) prespecified cross-validation folds to be used (list of lists (iterations) of lists (folds)).
- num_iter – (optional) number of iterations to use (default 1)
- regenerate_folds – (optional) whether or not to regenerate folds on every evaluation (default false)
- clusters – (optional) clusters to account for when generating folds. Clusters signify instances that must be assigned to the same fold. Not every instance must be in a cluster. Specify clusters as a list of lists of instance indices.
- aggregator – function to aggregate scores of different folds (default: mean)
Returns: a
cross_validated_callable
with the proper configuration.This resulting decorator must be used on a function with the following signature (+ potential other arguments):
Parameters: - x_train (iterable) – training data
- y_train (iterable) – training labels (optional)
- x_test (iterable) – testing data
- y_test (iterable) – testing labels (optional)
y_train and y_test must be available of the y argument to this function is not None.
These 4 keyword arguments will be bound upon decoration. Further arguments will remain free (e.g. hyperparameter names).
>>> data = list(range(5)) >>> @cross_validated(x=data, num_folds=5, folds=[[[i] for i in range(5)]], aggregator=identity) ... def f(x_train, x_test, a): ... return x_test[0] + a >>> f(a=1) [1, 2, 3, 4, 5] >>> f(1) [1, 2, 3, 4, 5] >>> f(a=2) [2, 3, 4, 5, 6]
The number of folds must be less than or equal to the size of the data.
>>> data = list(range(5)) >>> @cross_validated(x=data, num_folds=6) ... def f(x_train, x_test, a): ... return x_test[0] + a AssertionError
The number of labels (if specified) must match the number of data instances.
>>> data = list(range(5)) >>> labels = list(range(3)) >>> @cross_validated(x=data, y=labels, num_folds=2) ... def f(x_train, x_test, a): ... return x_test[0] + a AssertionError
-
optunity.cross_validation.
generate_folds
(num_rows, num_folds=10, strata=None, clusters=None)[source]¶ Generates folds for a given number of rows.
Parameters: - num_rows – number of data instances
- num_folds – number of folds to use (default 10)
- strata – (optional) list of lists to indicate different sampling strata. Not all rows must be in a stratum. The number of rows per stratum must be larger than or equal to num_folds.
- clusters – (optional) list of lists indicating clustered instances. Clustered instances must be placed in a single fold to avoid information leaks.
Returns: a list of folds, each fold is a list of instance indices
>>> folds = generate_folds(num_rows=6, num_folds=2, clusters=[[1, 2]], strata=[[3,4]]) >>> len(folds) 2 >>> i1 = [idx for idx, fold in enumerate(folds) if 1 in fold] >>> i2 = [idx for idx, fold in enumerate(folds) if 2 in fold] >>> i1 == i2 True >>> i3 = [idx for idx, fold in enumerate(folds) if 3 in fold] >>> i4 = [idx for idx, fold in enumerate(folds) if 4 in fold] >>> i3 == i4 False
Warning
Instances in strata are not necessarily spread out over all folds. Some folds may already be full due to clusters. This effect should be negligible.
-
optunity.cross_validation.
strata_by_labels
(labels)[source]¶ Constucts a list of strata (lists) based on unique values of
labels
.Parameters: labels – iterable, identical values will end up in identical strata Returns: the strata, as a list of lists
-
optunity.cross_validation.
list_mean
(list_of_measures)[source]¶ Computes means of consequent elements in given list.
Parameters: list_of_measures (list) – a list of tuples to compute means from Returns: a list containing the means This function can be used as an aggregator in
cross_validated()
, when multiple performance measures are being returned by the wrapped function.>>> list_mean([(1, 4), (2, 5), (3, 6)]) [2.0, 5.0]
-
optunity.cross_validation.
mean_and_list
(x)[source]¶ Returns a tuple, the first element is the mean of x and the second is x itself.
This function can be used as an aggregator in
cross_validated()
,>>> mean_and_list([1,2,3]) (2.0, [1, 2, 3])