Reference¶
-
class
rerf.rerfClassifier.
rerfClassifier
(projection_matrix='RerF', n_estimators=500, max_depth=None, min_samples_split=1, max_features='auto', feature_combinations=1.5, oob_score=False, n_jobs=None, random_state=None, image_height=None, image_width=None, patch_height_max=None, patch_height_min=1, patch_width_max=None, patch_width_min=1)[source]¶ A random forest classifier.
Supports both Random Forest, developed by Breiman (2001) [1], as well as Randomer Forest or Random Projection Forests (RerF) developed by Tomita et al. (2016) [2].
The difference between the two algorithms is where the random linear combinations occur: Random Forest combines features at the tree level whereas RerF combines features at the node level.
There are two new parameters to be aware of:
projection_matrix
feature_combinations
For more information, see Parameters.
References
[1] (1, 2) Breiman (2001). https://doi.org/10.1023/A:1010933404324 [2] (1, 2, 3, 4) Tomita et al. (2016). https://arxiv.org/abs/1506.03410 Parameters: - projection_matrix (str, optional (default: "RerF")) – The random combination of features to use: either “RerF”, “Base”, or “S-RerF”. “RerF” randomly combines features for each mtry. Base is our implementation of Random Forest. “S-RerF” is structured RerF, combining multiple features together in random patches. See Tomita et al. (2016) [2] for further details.
- n_estimators (int, optional (default: 500)) –
Number of trees in forest.
Note: This differs from scikit-learn’s default of 100.
- max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_split (int, optional (default: 1)) – The minimum splittable node size. A node size <
min_samples_split
will be a leaf node. Note: other implementations called min.parent or minParent - max_features (int, float, string, or None, optional (default="auto")) –
The number of features or feature combinations to consider when looking for the best split. Note: also called mtry or d.
- If int, then consider
max_features
features or feature combinations at each split. - If float, then max_features is a fraction and
int(max_features * n_features)
features are considered at each split. - If “auto”, then
max_features=sqrt(n_features)
. - If “sqrt”, then
max_features=sqrt(n_features)
(same as “auto”). - If “log2”, then
max_features=log2(n_features)
. - If None, then
max_features=n_features
.
- If int, then consider
- feature_combinations (float, optional (default: 1.5)) – Average number of features combined to form a new feature when using “RerF.” Otherwise, ignored. Each feature is independently included with probability feature_combination / n_features.
- oob_score (bool (default=False)) – Whether to use out-of-bag samples to estimate the generalization accuracy. Note, setting to True currently runs our non-binned implementation which has slower prediction times.
- n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel for both fit and predict.
None
means 1.-1
means use all processors. - random_state (int or None, optional (default=None)) – Random seed to use. If None, set seed to
np.random.randint(1, 1000000)
. - image_height (int, optional (default=None)) – S-RerF required parameter. Image height of each observation.
- image_width (int, optional (default=None)) – S-RerF required parameter. Width of each observation.
- patch_height_max (int, optional (default=max(2, floor(sqrt(image_height))))) – S-RerF parameter. Maximum image patch height to randomly select from.
If None, set to
max(2, floor(sqrt(image_height)))
. - patch_height_min (int, optional (default=1)) – S-RerF parameter. Minimum image patch height to randomly select from.
- patch_width_max (int, optional (default=max(2, floor(sqrt(image_width))))) – S-RerF parameter. Maximum image patch width to randomly select from.
If None, set to
max(2, floor(sqrt(image_width)))
. - patch_width_min (int, optional (default=1)) – S-RerF parameter. Minimum image patch height to randomly select from.
Examples
>>> from rerfClassifier import rerfClassifier >>> from sklearn.datasets import make_classification
>>> X, y = make_classification( ... n_samples=1000, ... n_features=4, ... n_informative=2, ... n_redundant=0, ... random_state=0, ... shuffle=False, ... ) >>> clf = rerfClassifier(n_estimators=100, max_depth=2, random_state=0) >>> clf.fit(X, y) starting tree 1 max depth: 2 avg leaf node depth: 1.9899 num leaf nodes: 396 rerfClassifier(feature_combinations=1.5, max_depth=2, max_features='auto', min_samples_split=1, n_estimators=100, n_jobs=None, projection_matrix='RerF', random_state=0) >>> print(clf.predict([[0, 0, 0, 0]])) [1] >>> print(clf.predict_proba([[0, 0, 0, 0]])) [[0.2 0.8]]
Notes
-
fit
(X, y)[source]¶ Fit estimator. :param X: Input data. Rows are observations and columns are features. :type X: array-like, shape=(n_samples, n_features) :param y: Labels :type y: array-like, 1D numpy array
Returns: self Return type: object
-
predict
(X)[source]¶ Predict class for X.
Parameters: X (array_like of shape [nsamples, n_features]) – The input samples. If more than 1 row, run multiple predictions. Returns: y – Returns the class of prediction (int) or predictions (list) depending on input parameters. Return type: int, list of int
-
predict_proba
(X)[source]¶ Predict class probabilities for X. The predicted class probabilities of an input sample are computed as the mean predicted class of the trees in the forest.
Parameters: X (array_like of shape [nsamples, n_features]) – The input samples. If more than 1 row, run multiple predictions. Returns: p – The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_. Return type: array of shape = [n_samples, n_classes]
-
predict_log_proba
(X)[source]¶ Predict class log-probabilities for X. The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class probabilities of the trees in the forest.
Parameters: X (array-like or sparse matrix of shape = [n_samples, n_features]) – The input samples. Internally, its dtype will be converted to dtype=np.float32
.Returns: p – such arrays if n_outputs > 1. The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_. Return type: array of shape = [n_samples, n_classes], or a list of n_outputs
-
class
rerf.urerf.
UnsupervisedRandomForest
(projection_matrix='RerF', n_estimators=100, max_depth=None, min_samples_split='auto', max_features='auto', feature_combinations='auto', n_jobs=None, random_state=None)[source]¶ Unsupervised random(er) forest
Supports both Random Forest, developed by Breiman (2001) [1], as well as Randomer Forest or Random Projection Forests (RerF) developed by Tomita et al. (2016) [2].
The difference between the two algorithms is where the random linear combinations occur: Random Forest combines features at the tree level whereas RerF combines features at the node level.
In addition to the normal RandomForestClassifier parameters, there are two parameters to be aware of:
projection_matrix
feature_combinations
Parameters: - projection_matrix (str, optional (default: "RerF")) – The random combination of features to use: either “RerF”, “Base”. “RerF” randomly combines features for each mtry. Base is our implementation of Random Forest. “S-RerF” is structured RerF, combining multiple features together in random patches. See Tomita et al. (2016) [2] for further details.
- n_estimators (int, optional (default: 100)) – Number of trees in forest.
- max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_split (int, optional (default: "auto")) –
The minimum splittable node size. A node size <
min_samples_split
will be a leaf node. Note: other implementations called min.parent or minParent- If “auto”, then
min_samples_split=sqrt(num_obs)
- If int, then consider
min_samples_split
at each split.
- If “auto”, then
- max_features (int, float, string, or None, optional (default="auto")) –
The number of features or feature combinations to consider when looking for the best split. Note: also called mtry or d.
- If int, then consider
max_features
features or feature combinations at each split. - If float, then max_features is a fraction and
int(max_features * n_features)
features are considered at each split. - If “auto”, then
max_features=sqrt(n_features)
. - If “sqrt”, then
max_features=sqrt(n_features)
(same as “auto”). - If “log2”, then
max_features=log2(n_features)
. - If None, then
max_features=n_features
.
- If int, then consider
- feature_combinations (float, optional (default: "auto")) –
Average number of features combined to form a new feature when using “RerF.” Otherwise, ignored.
- If int or float, then
feature_combinations
is average number of features to combine for eachmax_features
to try. - If “auto”, then
feature_combinations=n_features
. - If “sqrt”, then
feature_combinations=sqrt(n_features)
(same as “auto”). - If “log2”, then
feature_combinations=log2(n_features)
. - If None, then
feature_combinations=n_features
.
- If int or float, then
- n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel for both fit and predict.
None
means 1.-1
means use all processors. - random_state (int or None, optional (default=None)) – Random seed to use. If None, set seed to
np.random.randint(1, 1000000)
.
Examples
>>> from matplotlib import pyplot as plt >>> from sklearn.cluster import AgglomerativeClustering >>> from sklearn.datasets import make_classification >>> from sklearn.metrics import adjusted_rand_score >>> from rerf.urerf import UnsupervisedRandomForest
>>> X, y = make_classification( ... n_samples=1000, ... n_features=4, ... n_informative=2, ... n_redundant=0, ... random_state=0, ... shuffle=False, ... ) >>> clf = UnsupervisedRandomForest(n_estimators=100, random_state=0) >>> clf.fit(X) >>> sim_mat = clf.transform() >>> plt.imshow(sim_mat) >>> cluster = AgglomerativeClustering(n_clusters=2) >>> predict_labels = cluster.fit_predict(sim_mat) >>> score = adjusted_rand_score(y, predict_labels) >>> print(score) 0.7601439767776818
Notes
-
rerf.RerF.
fastRerF
(X=None, Y=None, CSVFile=None, Ycolumn=None, forestType='binnedBaseRerF', trees=500, minParent=1, maxDepth=None, numCores=1, mtry=None, mtryMult=1.5, fractionOfFeaturesToTest=None, seed=None, imageHeight=0, imageWidth=0, patchHeightMax=0, patchHeightMin=0, patchWidthMax=0, patchWidthMin=0)[source]¶ Creates a decision forest based on an input matrix and class vector and grows the forest.
Parameters: - X (2D numpy array, optional) – Input data. Rows are observations and columns are features.
- Y (list, 1D numpy array, optional) – Labels
- CSVFile (str, optional) – training CSV filename
- Ycolumn (int, optional) – column in data with labels
- forestType (str, optional) – the type of forest: binnedBase, binnedBaseRerF, binnedBaseTern, S-RerF (structured for 2-d images), rfBase, rerf (default: “binnedBaseRerF”)
- trees (int, optional) – Number of trees in forest (default: 500)
- minParent (int, optional) – (default: 1)
- maxDepth (int, optional) – maxDepth (default: None). If None, set to max system supported value
- numCores (int, optional) – Number of cores to use (default: 1).
- mtry (int, optional) – d, the number of features to consider when splitting a node
(default: None). If None, sets to
sqrt(numFeatures)
. - mtryMult (double, optional) – Average number of features combined to form a new feature when using RerF (default: 1.5)
- fractionOfFeaturesToTest (float, optional) – Sets mtry based on a fraction of the features instead of an exact number (default: None).
- seed (int, optional) – Random seed to use (default: None). If None, set seed to
np.random.randint(1, 1000000)
.
Returns: forest – forest class object
Return type: pyfp.fpForest
Examples
>>> from multiprocessing import cpu_count >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=1000, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> forest = fastRerF( ... X=X, ... Y=Y, ... forestType="binnedBaseRerF", ... trees=500, ... numCores=cpu_count(), ... )
-
rerf.RerF.
fastPredict
(X, forest)[source]¶ Predict class for X.
The predicted class of an input sample is the majority vote by the trees in the forest where each vote is the majority class of each tree’s leaf node.
Parameters: - X (array_like) – Numpy ndarray of data, if more than 1 row, run multiple predictions.
- forest (pyfp.fpForest) – Forest to run predictions on
Returns: predictions – Returns the class of prediction (int) or predictions (list) depending on input parameters.
Return type: int, list of int
Examples
>>> fastPredict([0, 1, 2, 3], forest)
-
rerf.RerF.
fastPredictPost
(X, forest)[source]¶ Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as the normalized votes of each tree in the forest.
Parameters: - X (array_like) – Numpy ndarray of data, if more than 1 row, run multiple predictions.
- forest (pyfp.fpForest) – Forest to run predictions on
Returns: posterior_probabilities – Returns the class probabilities for a single observation (list) or numpy array of class probabilities for each observation depending on input parameters.
Return type: list of ints, shape = [n_classes] or array, shape = [n_samples, n_classes]
Examples
>>> fastPredictPost([0, 1, 2, 3], forest)
py-RerF