This module contains various provisions for cross-validation.
The main functions in this module are:
Module author: Marc Claesen
Selects the subset specified by indices from collection.
>>> select([0, 1, 2, 3, 4], [1, 3]) [1, 3]
Returns a list containing a random permutation of
relements out of
Parameters: data – an iterable containing the elements to permute over Returns: returns a list containing permuted entries of
cross_validated(x, num_folds=10, y=None, strata=None, folds=None, num_iter=1, regenerate_folds=False, clusters=None, aggregator=<function mean>)¶
Function decorator to perform cross-validation as configured.
- x – data to be used for cross-validation
- num_folds – number of cross-validation folds (default 10)
- y – (optional) labels to be used for cross-validation. If specified, len(labels) must equal len(x)
- strata – (optional) strata to account for when generating folds. Strata signify instances that must be spread across folds. Not every instance must be in a stratum. Specify strata as a list of lists of instance indices.
- folds – (optional) prespecified cross-validation folds to be used (list of lists (iterations) of lists (folds)).
- num_iter – (optional) number of iterations to use (default 1)
- regenerate_folds – (optional) whether or not to regenerate folds on every evaluation (default false)
- clusters – (optional) clusters to account for when generating folds. Clusters signify instances that must be assigned to the same fold. Not every instance must be in a cluster. Specify clusters as a list of lists of instance indices.
- aggregator – function to aggregate scores of different folds (default: mean)
cross_validated_callablewith the proper configuration.
This resulting decorator must be used on a function with the following signature (+ potential other arguments):
- x_train (iterable) – training data
- y_train (iterable) – training labels (optional)
- x_test (iterable) – testing data
- y_test (iterable) – testing labels (optional)
y_train and y_test must be available of the y argument to this function is not None.
These 4 keyword arguments will be bound upon decoration. Further arguments will remain free (e.g. hyperparameter names).
>>> data = list(range(5)) >>> @cross_validated(x=data, num_folds=5, folds=[[[i] for i in range(5)]], aggregator=identity) ... def f(x_train, x_test, a): ... return x_test + a >>> f(a=1) [1, 2, 3, 4, 5] >>> f(1) [1, 2, 3, 4, 5] >>> f(a=2) [2, 3, 4, 5, 6]
The number of folds must be less than or equal to the size of the data.
>>> data = list(range(5)) >>> @cross_validated(x=data, num_folds=6) ... def f(x_train, x_test, a): ... return x_test + a AssertionError
The number of labels (if specified) must match the number of data instances.
>>> data = list(range(5)) >>> labels = list(range(3)) >>> @cross_validated(x=data, y=labels, num_folds=2) ... def f(x_train, x_test, a): ... return x_test + a AssertionError
generate_folds(num_rows, num_folds=10, strata=None, clusters=None)¶
Generates folds for a given number of rows.
- num_rows – number of data instances
- num_folds – number of folds to use (default 10)
- strata – (optional) list of lists to indicate different sampling strata. Not all rows must be in a stratum. The number of rows per stratum must be larger than or equal to num_folds.
- clusters – (optional) list of lists indicating clustered instances. Clustered instances must be placed in a single fold to avoid information leaks.
a list of folds, each fold is a list of instance indices
>>> folds = generate_folds(num_rows=6, num_folds=2, clusters=[[1, 2]], strata=[[3,4]]) >>> len(folds) 2 >>> i1 = [idx for idx, fold in enumerate(folds) if 1 in fold] >>> i2 = [idx for idx, fold in enumerate(folds) if 2 in fold] >>> i1 == i2 True >>> i3 = [idx for idx, fold in enumerate(folds) if 3 in fold] >>> i4 = [idx for idx, fold in enumerate(folds) if 4 in fold] >>> i3 == i4 False
Instances in strata are not necessarily spread out over all folds. Some folds may already be full due to clusters. This effect should be negligible.
Constucts a list of strata (lists) based on unique values of
Parameters: labels – iterable, identical values will end up in identical strata Returns: the strata, as a list of lists
Computes means of consequent elements in given list.
Parameters: list_of_measures (list) – a list of tuples to compute means from Returns: a list containing the means
This function can be used as an aggregator in
cross_validated(), when multiple performance measures are being returned by the wrapped function.
>>> list_mean([(1, 4), (2, 5), (3, 6)]) [2.0, 5.0]