Basic: cross-validation¶
This notebook explores the main elements of Optunity’s cross-validation facilities, including:
- standard cross-validation
- using strata and clusters while constructing folds
- using different aggregators
We recommend perusing the related documentation for more details.
Nested cross-validation is available as a separate notebook.
import optunity
import optunity.cross_validation
We start by generating some toy data containing 6 instances which we will partition into folds.
data = list(range(6))
labels = [True] * 3 + [False] * 3
Standard cross-validation¶
Each function to be decorated with cross-validation functionality must accept the following arguments: - x_train: training data - x_test: test data - y_train: training labels (required only when y is specified in the cross-validation decorator) - y_test: test labels (required only when y is specified in the cross-validation decorator)
These arguments will be set implicitly by the cross-validation decorator to match the right folds. Any remaining arguments to the decorated function remain as free parameters that must be set later on.
Lets start with the basics and look at Optunity’s cross-validation in action. We use an objective function that simply prints out the train and test data in every split to see what’s going on.
def f(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\t train labels:\t" + str(y_train))
print("test data:\t" + str(x_test) + "\t test labels:\t" + str(y_test))
return 0.0
We start with 2 folds, which leads to equally sized train and test partitions.
f_2folds = optunity.cross_validated(x=data, y=labels, num_folds=2)(f)
print("using 2 folds")
f_2folds()
using 2 folds
train data: [4, 2, 0] train labels: [False, True, True]
test data: [3, 1, 5] test labels: [False, True, False]
train data: [3, 1, 5] train labels: [False, True, False]
test data: [4, 2, 0] test labels: [False, True, True]
0.0
# f_2folds as defined above would typically be written using decorator syntax as follows
# we don't do that in these examples so we can reuse the toy objective function
@optunity.cross_validated(x=data, y=labels, num_folds=2)
def f_2folds(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\t train labels:\t" + str(y_train))
print("test data:\t" + str(x_test) + "\t test labels:\t" + str(y_test))
return 0.0
If we use three folds instead of 2, we get 3 iterations in which the training set is twice the size of the test set.
f_3folds = optunity.cross_validated(x=data, y=labels, num_folds=3)(f)
print("using 3 folds")
f_3folds()
using 3 folds
train data: [2, 1, 3, 0] train labels: [True, True, False, True]
test data: [5, 4] test labels: [False, False]
train data: [5, 4, 3, 0] train labels: [False, False, False, True]
test data: [2, 1] test labels: [True, True]
train data: [5, 4, 2, 1] train labels: [False, False, True, True]
test data: [3, 0] test labels: [False, True]
0.0
If we do two iterations of 3-fold cross-validation (denoted by 2x3 fold), two sets of folds are generated and evaluated.
f_2x3folds = optunity.cross_validated(x=data, y=labels, num_folds=3, num_iter=2)(f)
print("using 2x3 folds")
f_2x3folds()
using 2x3 folds
train data: [4, 1, 5, 3] train labels: [False, True, False, False]
test data: [0, 2] test labels: [True, True]
train data: [0, 2, 5, 3] train labels: [True, True, False, False]
test data: [4, 1] test labels: [False, True]
train data: [0, 2, 4, 1] train labels: [True, True, False, True]
test data: [5, 3] test labels: [False, False]
train data: [0, 2, 1, 4] train labels: [True, True, True, False]
test data: [5, 3] test labels: [False, False]
train data: [5, 3, 1, 4] train labels: [False, False, True, False]
test data: [0, 2] test labels: [True, True]
train data: [5, 3, 0, 2] train labels: [False, False, True, True]
test data: [1, 4] test labels: [True, False]
0.0
Using strata and clusters¶
Strata are defined as sets of instances that should be spread out across folds as much as possible (e.g. stratify patients by age). Clusters are sets of instances that must be put in a single fold (e.g. cluster measurements of the same patient).
Optunity allows you to specify strata and/or clusters that must be accounted for while construct cross-validation folds. Not all instances have to belong to a stratum or clusters.
Strata¶
We start by illustrating strata. Strata are specified as a list of lists of instances indices. Each list defines one stratum. We will reuse the toy data and objective function specified above. We will create 2 strata with 2 instances each. These instances will be spread across folds. We create two strata: \(\{0, 1\}\) and \(\{2, 3\}\).
strata = [[0, 1], [2, 3]]
f_stratified = optunity.cross_validated(x=data, y=labels, strata=strata, num_folds=3)(f)
f_stratified()
train data: [0, 4, 2, 5] train labels: [True, False, True, False]
test data: [1, 3] test labels: [True, False]
train data: [1, 3, 2, 5] train labels: [True, False, True, False]
test data: [0, 4] test labels: [True, False]
train data: [1, 3, 0, 4] train labels: [True, False, True, False]
test data: [2, 5] test labels: [True, False]
0.0
Clusters¶
Clusters work similarly, except that now instances within a cluster are guaranteed to be placed within a single fold. The way to specify clusters is identical to strata. We create two clusters: \(\{0, 1\}\) and \(\{2, 3\}\). These pairs will always occur in a single fold.
clusters = [[0, 1], [2, 3]]
f_clustered = optunity.cross_validated(x=data, y=labels, clusters=clusters, num_folds=3)(f)
f_clustered()
train data: [0, 1, 2, 3] train labels: [True, True, True, False]
test data: [4, 5] test labels: [False, False]
train data: [4, 5, 2, 3] train labels: [False, False, True, False]
test data: [0, 1] test labels: [True, True]
train data: [4, 5, 0, 1] train labels: [False, False, True, True]
test data: [2, 3] test labels: [True, False]
0.0
Strata and clusters¶
Strata and clusters can be used together. Lets say we have the following configuration:
- 1 stratum: \(\{0, 1, 2\}\)
- 2 clusters: \(\{0, 3\}\), \(\{4, 5\}\)
In this particular example, instances 1 and 2 will inevitably end up in a single fold, even though they are part of one stratum. This happens because the total data set has size 6, and 4 instances are already in clusters.
strata = [[0, 1, 2]]
clusters = [[0, 3], [4, 5]]
f_strata_clustered = optunity.cross_validated(x=data, y=labels, clusters=clusters, strata=strata, num_folds=3)(f)
f_strata_clustered()
train data: [4, 5, 0, 3] train labels: [False, False, True, False]
test data: [1, 2] test labels: [True, True]
train data: [1, 2, 0, 3] train labels: [True, True, True, False]
test data: [4, 5] test labels: [False, False]
train data: [1, 2, 4, 5] train labels: [True, True, False, False]
test data: [0, 3] test labels: [True, False]
0.0
Aggregators¶
Aggregators are used to combine the scores per fold into a single result. The default approach used in cross-validation is to take the mean of all scores. In some cases, we might be interested in worst-case or best-case performance, the spread, ...
Opunity allows passing a custom callable to be used as aggregator.
The default aggregation in Optunity is to compute the mean across folds.
@optunity.cross_validated(x=data, num_folds=3)
def f(x_train, x_test):
result = x_test[0]
print(result)
return result
f(1)
4
1
2
2.3333333333333335
This can be replaced by any function, e.g. min or max.
@optunity.cross_validated(x=data, num_folds=3, aggregator=max)
def fmax(x_train, x_test):
result = x_test[0]
print(result)
return result
fmax(1)
2
5
1
5
@optunity.cross_validated(x=data, num_folds=3, aggregator=min)
def fmin(x_train, x_test):
result = x_test[0]
print(result)
return result
fmin(1)
3
4
5
3
Retaining intermediate results¶
Often, it may be useful to retain all intermediate results, not just the
final aggregated data. This is made possible via
optunity.cross_validation.mean_and_list
aggregator. This aggregator
computes the mean for internal use in cross-validation, but also returns
a list of lists containing the full evaluation results.
@optunity.cross_validated(x=data, num_folds=3,
aggregator=optunity.cross_validation.mean_and_list)
def f_full(x_train, x_test, coeff):
return x_test[0] * coeff
# evaluate f
mean_score, all_scores = f_full(1.0)
print(mean_score)
print(all_scores)
2.33333333333
[0.0, 2.0, 5.0]
Note that a cross-validation based on the mean_and_list
aggregator
essentially returns a tuple of results. If the result is iterable, all
solvers in Optunity use the first element as the objective function
value. You can let the cross-validation procedure return other useful
statistics too, which you can access from the solver trace.
opt_coeff, info, _ = optunity.minimize(f_full, coeff=[0, 1], num_evals=10)
print(opt_coeff)
print("call log")
for args, val in zip(info.call_log['args']['coeff'], info.call_log['values']):
print(str(args) + '\t\t' + str(val))
{'coeff': 0.15771484375}
call log
0.34521484375 (0.8055013020833334, [0.0, 0.6904296875, 1.72607421875])
0.47021484375 (1.09716796875, [0.0, 0.9404296875, 2.35107421875])
0.97021484375 (2.2638346354166665, [0.0, 1.9404296875, 4.85107421875])
0.72021484375 (1.6805013020833333, [0.0, 1.4404296875, 3.60107421875])
0.22021484375 (0.5138346354166666, [0.0, 0.4404296875, 1.10107421875])
0.15771484375 (0.3680013020833333, [0.0, 0.3154296875, 0.78857421875])
0.65771484375 (1.53466796875, [0.0, 1.3154296875, 3.28857421875])
0.90771484375 (2.1180013020833335, [0.0, 1.8154296875, 4.53857421875])
0.40771484375 (0.9513346354166666, [0.0, 0.8154296875, 2.03857421875])
0.28271484375 (0.65966796875, [0.0, 0.5654296875, 1.41357421875])
Cross-validation with scikit-learn¶
In this example we will show how to use cross-validation methods that are provided by scikit-learn in conjunction with Optunity. To do this we provide Optunity with the folds that scikit-learn produces in a specific format.
In supervised learning datasets often have unbalanced labels. When performing cross-validation with unbalanced data it is good practice to preserve the percentage of samples for each class across folds. To achieve this label balance we will use StratifiedKFold.
data = list(range(20))
labels = [1 if i%4==0 else 0 for i in range(20)]
@optunity.cross_validated(x=data, y=labels, num_folds=5)
def unbalanced_folds(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\ntrain labels:\t" + str(y_train)) + '\n'
print("test data:\t" + str(x_test) + "\ntest labels:\t" + str(y_test)) + '\n'
return 0.0
unbalanced_folds()
train data: [16, 6, 4, 14, 0, 11, 19, 5, 9, 2, 12, 8, 7, 10, 18, 3]
train labels: [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
test data: [15, 1, 13, 17]
test labels: [0, 0, 0, 0]
train data: [15, 1, 13, 17, 0, 11, 19, 5, 9, 2, 12, 8, 7, 10, 18, 3]
train labels: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
test data: [16, 6, 4, 14]
test labels: [1, 0, 1, 0]
train data: [15, 1, 13, 17, 16, 6, 4, 14, 9, 2, 12, 8, 7, 10, 18, 3]
train labels: [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0]
test data: [0, 11, 19, 5]
test labels: [1, 0, 0, 0]
train data: [15, 1, 13, 17, 16, 6, 4, 14, 0, 11, 19, 5, 7, 10, 18, 3]
train labels: [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]
test data: [9, 2, 12, 8]
test labels: [0, 0, 1, 1]
train data: [15, 1, 13, 17, 16, 6, 4, 14, 0, 11, 19, 5, 9, 2, 12, 8]
train labels: [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1]
test data: [7, 10, 18, 3]
test labels: [0, 0, 0, 0]
0.0
Notice above how the test label sets have a varying number of postive samples, some have none, some have one, and some have two.
from sklearn.cross_validation import StratifiedKFold
stratified_5folds = StratifiedKFold(labels, n_folds=5)
folds = [[list(test) for train, test in stratified_5folds]]
@optunity.cross_validated(x=data, y=labels, folds=folds, num_folds=5)
def balanced_folds(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\ntrain labels:\t" + str(y_train)) + '\n'
print("test data:\t" + str(x_test) + "\ntest labels:\t" + str(y_test)) + '\n'
return 0.0
balanced_folds()
train data: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
train labels: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
test data: [0, 1, 2, 3]
test labels: [1, 0, 0, 0]
train data: [0, 1, 2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
train labels: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
test data: [4, 5, 6, 7]
test labels: [1, 0, 0, 0]
train data: [0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 18, 19]
train labels: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
test data: [8, 9, 10, 11]
test labels: [1, 0, 0, 0]
train data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 16, 17, 18, 19]
train labels: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
test data: [12, 13, 14, 15]
test labels: [1, 0, 0, 0]
train data: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
train labels: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
test data: [16, 17, 18, 19]
test labels: [1, 0, 0, 0]
0.0
Now all of our train sets have four positive samples and our test sets have one positive sample.
To use predetermined folds, place a list of the test sample idices into a list. And then insert that list into another list. Why so many nested lists? Because you can perform multiple cross-validation runs by setting num_iter appropriately and then append num_iter lists of test samples to the outer most list. Note that the test samples for a given fold are the idicies that you provide and then the train samples for that fold are all of the indices from all other test sets joined together. If not done carefully this may lead to duplicated samples in a train set and also samples that fall in both train and test sets of a fold if a datapoint is in multiple folds’ test sets.
data = list(range(6))
labels = [True] * 3 + [False] * 3
fold1 = [[0, 3], [1, 4], [2, 5]]
fold2 = [[0, 5], [1, 4], [0, 3]] # notice what happens when the indices are not unique
folds = [fold1, fold2]
@optunity.cross_validated(x=data, y=labels, folds=folds, num_folds=3, num_iter=2)
def multiple_iters(x_train, y_train, x_test, y_test):
print("")
print("train data:\t" + str(x_train) + "\t train labels:\t" + str(y_train))
print("test data:\t" + str(x_test) + "\t\t test labels:\t" + str(y_test))
return 0.0
multiple_iters()
train data: [1, 4, 2, 5] train labels: [True, False, True, False]
test data: [0, 3] test labels: [True, False]
train data: [0, 3, 2, 5] train labels: [True, False, True, False]
test data: [1, 4] test labels: [True, False]
train data: [0, 3, 1, 4] train labels: [True, False, True, False]
test data: [2, 5] test labels: [True, False]
train data: [1, 4, 0, 3] train labels: [True, False, True, False]
test data: [0, 5] test labels: [True, False]
train data: [0, 5, 0, 3] train labels: [True, False, True, False]
test data: [1, 4] test labels: [True, False]
train data: [0, 5, 1, 4] train labels: [True, False, True, False]
test data: [0, 3] test labels: [True, False]
0.0