Reference¶

class rerf.rerfClassifier.rerfClassifier(projection_matrix='RerF', n_estimators=500, max_depth=None, min_samples_split=1, max_features='auto', feature_combinations=1.5, oob_score=False, n_jobs=None, random_state=None, image_height=None, image_width=None, patch_height_max=None, patch_height_min=1, patch_width_max=None, patch_width_min=1)[source]¶

A random forest classifier.

Supports both Random Forest, developed by Breiman (2001) [1], as well as Randomer Forest or Random Projection Forests (RerF) developed by Tomita et al. (2016) [2].

The difference between the two algorithms is where the random linear combinations occur: Random Forest combines features at the tree level whereas RerF combines features at the node level.

There are two new parameters to be aware of:

projection_matrix

feature_combinations

For more information, see Parameters.

References

[1]	(1, 2) Breiman (2001). https://doi.org/10.1023/A:1010933404324

[2]	(1, 2, 3, 4) Tomita et al. (2016). https://arxiv.org/abs/1506.03410

Parameters:

projection_matrix (str, optional (default: "RerF")) – The random combination of features to use: either “RerF”, “Base”, or “S-RerF”. “RerF” randomly combines features for each mtry. Base is our implementation of Random Forest. “S-RerF” is structured RerF, combining multiple features together in random patches. See Tomita et al. (2016) [2] for further details.
n_estimators (int, optional (default: 500)) –
Number of trees in forest.

Note: This differs from scikit-learn’s default of 100.
max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split (int, optional (default: 1)) – The minimum splittable node size. A node size < min_samples_split will be a leaf node. Note: other implementations called min.parent or minParent
max_features (int, float, string, or None, optional (default="auto")) –
The number of features or feature combinations to consider when looking for the best split. Note: also called mtry or d.
- If int, then consider max_features features or feature combinations at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
feature_combinations (float, optional (default: 1.5)) – Average number of features combined to form a new feature when using “RerF.” Otherwise, ignored. Each feature is independently included with probability feature_combination / n_features.
oob_score (bool (default=False)) – Whether to use out-of-bag samples to estimate the generalization accuracy. Note, setting to True currently runs our non-binned implementation which has slower prediction times.
n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel for both fit and predict. None means 1. -1 means use all processors.
random_state (int or None, optional (default=None)) – Random seed to use. If None, set seed to np.random.randint(1, 1000000).
image_height (int, optional (default=None)) – S-RerF required parameter. Image height of each observation.
image_width (int, optional (default=None)) – S-RerF required parameter. Width of each observation.
patch_height_max (int, optional (default=max(2, floor(sqrt(image_height))))) – S-RerF parameter. Maximum image patch height to randomly select from. If None, set to max(2, floor(sqrt(image_height))).
patch_height_min (int, optional (default=1)) – S-RerF parameter. Minimum image patch height to randomly select from.
patch_width_max (int, optional (default=max(2, floor(sqrt(image_width))))) – S-RerF parameter. Maximum image patch width to randomly select from. If None, set to max(2, floor(sqrt(image_width))).
patch_width_min (int, optional (default=1)) – S-RerF parameter. Minimum image patch height to randomly select from.

Examples

>>> from rerfClassifier import rerfClassifier
>>> from sklearn.datasets import make_classification

>>> X, y = make_classification(
...    n_samples=1000,
...    n_features=4,
...    n_informative=2,
...    n_redundant=0,
...    random_state=0,
...    shuffle=False,
... )
>>> clf = rerfClassifier(n_estimators=100, max_depth=2, random_state=0)
>>> clf.fit(X, y)
starting tree 1
max depth: 2
avg leaf node depth: 1.9899
num leaf nodes: 396
rerfClassifier(feature_combinations=1.5, max_depth=2, max_features='auto',
        min_samples_split=1, n_estimators=100, n_jobs=None,
        projection_matrix='RerF', random_state=0)
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]
>>> print(clf.predict_proba([[0, 0, 0, 0]]))
[[0.2 0.8]]

Notes

fit(X, y)[source]¶

Fit estimator. :param X: Input data. Rows are observations and columns are features. :type X: array-like, shape=(n_samples, n_features) :param y: Labels :type y: array-like, 1D numpy array

Returns:	self
Return type:	object

predict(X)[source]¶

Predict class for X.

Parameters:	X (array_like of shape [nsamples, n_features]) – The input samples. If more than 1 row, run multiple predictions.
Returns:	y – Returns the class of prediction (int) or predictions (list) depending on input parameters.
Return type:	int, list of int

predict_proba(X)[source]¶

Predict class probabilities for X. The predicted class probabilities of an input sample are computed as the mean predicted class of the trees in the forest.

Parameters:	X (array_like of shape [nsamples, n_features]) – The input samples. If more than 1 row, run multiple predictions.
Returns:	p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type:	array of shape = [n_samples, n_classes]

predict_log_proba(X)[source]¶

Predict class log-probabilities for X. The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class probabilities of the trees in the forest.

Parameters:	X (array-like or sparse matrix of shape = [n_samples, n_features]) – The input samples. Internally, its dtype will be converted to `dtype=np.float32`.
Returns:	p – such arrays if n_outputs > 1. The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
Return type:	array of shape = [n_samples, n_classes], or a list of n_outputs

class rerf.urerf.UnsupervisedRandomForest(projection_matrix='RerF', n_estimators=100, max_depth=None, min_samples_split='auto', max_features='auto', feature_combinations='auto', n_jobs=None, random_state=None)[source]¶

Unsupervised random(er) forest

Supports both Random Forest, developed by Breiman (2001) [1], as well as Randomer Forest or Random Projection Forests (RerF) developed by Tomita et al. (2016) [2].

The difference between the two algorithms is where the random linear combinations occur: Random Forest combines features at the tree level whereas RerF combines features at the node level.

In addition to the normal RandomForestClassifier parameters, there are two parameters to be aware of:

projection_matrix

feature_combinations

Parameters:

projection_matrix (str, optional (default: "RerF")) – The random combination of features to use: either “RerF”, “Base”. “RerF” randomly combines features for each mtry. Base is our implementation of Random Forest. “S-RerF” is structured RerF, combining multiple features together in random patches. See Tomita et al. (2016) [2] for further details.
n_estimators (int, optional (default: 100)) – Number of trees in forest.
max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split (int, optional (default: "auto")) –
The minimum splittable node size. A node size < min_samples_split will be a leaf node. Note: other implementations called min.parent or minParent
- If “auto”, then min_samples_split=sqrt(num_obs)
- If int, then consider min_samples_split at each split.
max_features (int, float, string, or None, optional (default="auto")) –
The number of features or feature combinations to consider when looking for the best split. Note: also called mtry or d.
- If int, then consider max_features features or feature combinations at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “auto”, then max_features=sqrt(n_features).
- If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
feature_combinations (float, optional (default: "auto")) –
Average number of features combined to form a new feature when using “RerF.” Otherwise, ignored.
- If int or float, then feature_combinations is average number of features to combine for each max_features to try.
- If “auto”, then feature_combinations=n_features.
- If “sqrt”, then feature_combinations=sqrt(n_features) (same as “auto”).
- If “log2”, then feature_combinations=log2(n_features).
- If None, then feature_combinations=n_features.
n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel for both fit and predict. None means 1. -1 means use all processors.
random_state (int or None, optional (default=None)) – Random seed to use. If None, set seed to np.random.randint(1, 1000000).

Examples

>>> from matplotlib import pyplot as plt
>>> from sklearn.cluster import AgglomerativeClustering
>>> from sklearn.datasets import make_classification
>>> from sklearn.metrics import adjusted_rand_score
>>> from rerf.urerf import UnsupervisedRandomForest

>>> X, y = make_classification(
...    n_samples=1000,
...    n_features=4,
...    n_informative=2,
...    n_redundant=0,
...    random_state=0,
...    shuffle=False,
... )
>>> clf = UnsupervisedRandomForest(n_estimators=100, random_state=0)
>>> clf.fit(X)
>>> sim_mat = clf.transform()
>>> plt.imshow(sim_mat)
>>> cluster = AgglomerativeClustering(n_clusters=2)
>>> predict_labels = cluster.fit_predict(sim_mat)
>>> score = adjusted_rand_score(y, predict_labels)
>>> print(score)
0.7601439767776818

Notes

fit(X, y=None)[source]¶

Fit estimator. :param X: Input data. Rows are observations and columns are features. :type X: array-like, shape=(n_samples, n_features)

Returns:	self
Return type:	object

transform(return_sparse=False)[source]¶

Transform dataset into an affinity matrix / similarity matrix.

Returns:	affinity_matrix
Return type:	sparse matrix, shape=(n_samples, n_samples)

rerf.RerF.fastRerF(X=None, Y=None, CSVFile=None, Ycolumn=None, forestType='binnedBaseRerF', trees=500, minParent=1, maxDepth=None, numCores=1, mtry=None, mtryMult=1.5, fractionOfFeaturesToTest=None, seed=None, imageHeight=0, imageWidth=0, patchHeightMax=0, patchHeightMin=0, patchWidthMax=0, patchWidthMin=0)[source]¶

Creates a decision forest based on an input matrix and class vector and grows the forest.

Parameters:	X (2D numpy array, optional) – Input data. Rows are observations and columns are features. Y (list, 1D numpy array, optional) – Labels CSVFile (str, optional) – training CSV filename Ycolumn (int, optional) – column in data with labels forestType (str, optional) – the type of forest: binnedBase, binnedBaseRerF, binnedBaseTern, S-RerF (structured for 2-d images), rfBase, rerf (default: “binnedBaseRerF”) trees (int, optional) – Number of trees in forest (default: 500) minParent (int, optional) – (default: 1) maxDepth (int, optional) – maxDepth (default: None). If None, set to max system supported value numCores (int, optional) – Number of cores to use (default: 1). mtry (int, optional) – d, the number of features to consider when splitting a node (default: None). If None, sets to `sqrt(numFeatures)`. mtryMult (double, optional) – Average number of features combined to form a new feature when using RerF (default: 1.5) fractionOfFeaturesToTest (float, optional) – Sets mtry based on a fraction of the features instead of an exact number (default: None). seed (int, optional) – Random seed to use (default: None). If None, set seed to `np.random.randint(1, 1000000)`.
Returns:	forest – forest class object
Return type:	pyfp.fpForest

Examples

>>> from multiprocessing import cpu_count
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
...        n_informative=2, n_redundant=0,
...        random_state=0, shuffle=False)
>>> forest = fastRerF(
...    X=X,
...    Y=Y,
...    forestType="binnedBaseRerF",
...    trees=500,
...    numCores=cpu_count(),
...    )

rerf.RerF.fastPredict(X, forest)[source]¶

Predict class for X.

The predicted class of an input sample is the majority vote by the trees in the forest where each vote is the majority class of each tree’s leaf node.

Parameters:	X (array_like) – Numpy ndarray of data, if more than 1 row, run multiple predictions. forest (pyfp.fpForest) – Forest to run predictions on
Returns:	predictions – Returns the class of prediction (int) or predictions (list) depending on input parameters.
Return type:	int, list of int

Examples

>>> fastPredict([0, 1, 2, 3], forest)

rerf.RerF.fastPredictPost(X, forest)[source]¶

Predict class probabilities for X.

The predicted class probabilities of an input sample are computed as the normalized votes of each tree in the forest.

Parameters:	X (array_like) – Numpy ndarray of data, if more than 1 row, run multiple predictions. forest (pyfp.fpForest) – Forest to run predictions on
Returns:	posterior_probabilities – Returns the class probabilities for a single observation (list) or numpy array of class probabilities for each observation depending on input parameters.
Return type:	list of ints, shape = [n_classes] or array, shape = [n_samples, n_classes]

Examples

>>> fastPredictPost([0, 1, 2, 3], forest)

py-RerF

rerf.check_version()[source]¶: Tells you if you have an old version of RerF.