verde.BlockShuffleSplit#

class verde.BlockShuffleSplit(spacing=None, shape=None, n_splits=10, test_size=0.1, train_size=None, random_state=None, balancing=10)[source]#

Random permutation of spatial blocks cross-validator.

Yields indices to split data into training and test sets. Data are first grouped into rectangular blocks of size given by the spacing argument. Alternatively, blocks can be defined by the number of blocks in each dimension using the shape argument instead of spacing. The blocks are then split into testing and training sets randomly.

The proportion of blocks assigned to each set is controlled by test_size and/or train_size. However, the total amount of actual data points in each set could be different from these values since blocks can have a different number of data points inside them. To guarantee that the proportion of actual data is as close as possible to the proportion of blocks, this cross-validator generates an extra number of splits and selects the one with proportion of data points in each set closer to the desired amount [Valavi_etal2019]. The number of balancing splits per iteration is controlled by the balancing argument.

This cross-validator is preferred over sklearn.model_selection.ShuffleSplit for spatial data to avoid overestimating cross-validation scores. This can happen because of the inherent autocorrelation that is usually associated with this type of data (points that are close together are more likely to have similar values). See [Roberts_etal2017] for an overview of this topic.

Note

Like sklearn.model_selection.ShuffleSplit, this cross-validator cannot guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Parameters
  • spacing (float, tuple = (s_north, s_east), or None) – The block size in the South-North and West-East directions, respectively. A single value means that the spacing is equal in both directions. If None, then shape must be provided.

  • shape (tuple = (n_north, n_east) or None) – The number of blocks in the South-North and West-East directions, respectively. If None, then spacing must be provided.

  • n_splits (int, default 10) – Number of re-shuffling & splitting iterations.

  • test_size (float, int, None, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.1.

  • train_size (float, int, or None, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • balancing (int) – The number of splits generated per iteration to try to balance the amount of data in each set so that test_size and train_size are respected. If 1, then no extra splits are generated (essentially disabling the balancing). Must be >= 1.

See also

train_test_split

Split a dataset into a training and a testing set.

cross_val_score

Score an estimator/gridder using cross-validation.

Examples

>>> from verde import grid_coordinates
>>> import numpy as np
>>> # Make a regular grid of data points
>>> coords = grid_coordinates(region=(0, 3, -10, -7), spacing=1)
>>> # Need to convert the coordinates into a feature matrix
>>> X = np.transpose([i.ravel() for i in coords])
>>> shuffle = BlockShuffleSplit(spacing=1.5, n_splits=3, random_state=0)
>>> # These are the 1D indices of the points belonging to each set
>>> for train, test in shuffle.split(X):
...     print("Train: {} Test: {}".format(train, test))
Train: [ 0  1  2  3  4  5  6  7 10 11 14 15] Test: [ 8  9 12 13]
Train: [ 2  3  6  7  8  9 10 11 12 13 14 15] Test: [0 1 4 5]
Train: [ 0  1  4  5  8  9 10 11 12 13 14 15] Test: [2 3 6 7]
>>> # A better way to visualize this is to create a 2D array and put
>>> # "train" or "test" in the corresponding locations.
>>> shape = coords[0].shape
>>> mask = np.full(shape=shape, fill_value="     ")
>>> for iteration, (train, test) in enumerate(shuffle.split(X)):
...     # The index needs to be converted to 2D so we can index our matrix.
...     mask[np.unravel_index(train, shape)] = "train"
...     mask[np.unravel_index(test, shape)] = " test"
...     print("Iteration {}:".format(iteration))
...     print(mask)
Iteration 0:
[['train' 'train' 'train' 'train']
 ['train' 'train' 'train' 'train']
 [' test' ' test' 'train' 'train']
 [' test' ' test' 'train' 'train']]
Iteration 1:
[[' test' ' test' 'train' 'train']
 [' test' ' test' 'train' 'train']
 ['train' 'train' 'train' 'train']
 ['train' 'train' 'train' 'train']]
Iteration 2:
[['train' 'train' ' test' ' test']
 ['train' 'train' ' test' ' test']
 ['train' 'train' 'train' 'train']
 ['train' 'train' 'train' 'train']]

Methods Summary

BlockShuffleSplit.get_n_splits([X, y, groups])

Returns the number of splitting iterations in the cross-validator

BlockShuffleSplit.split(X[, y, groups])

Generate indices to split data into training and test set.


BlockShuffleSplit.get_n_splits(X=None, y=None, groups=None)#

Returns the number of splitting iterations in the cross-validator

Parameters
  • X (object) – Always ignored, exists for compatibility.

  • y (object) – Always ignored, exists for compatibility.

  • groups (object) – Always ignored, exists for compatibility.

Returns

n_splits (int) – Returns the number of splitting iterations in the cross-validator.

BlockShuffleSplit.split(X, y=None, groups=None)#

Generate indices to split data into training and test set.

Parameters
  • X (array-like, shape (n_samples, 2)) – Columns should be the easting and northing coordinates of data points, respectively.

  • y (array-like, shape (n_samples,)) – The target variable for supervised learning problems. Always ignored.

  • groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set. Always ignored.

Yields
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.