calculate_nearest_neighbors_distances#

calculate_nearest_neighbors_distances(data: DataFrame, cat_cols: List[Hashable], numeric_cols: List[Hashable], num_neighbors: int, samples_to_calc_neighbors_for: Optional[DataFrame] = None)[source]#

Calculate distance matrix for a dataset using Gower’s method.

Gowers distance is a measurement for distance between two samples. It returns the average of their distances per feature. For numeric features it calculates the absolute distance divide by the range of the feature. For categorical features it is an indicator whether the values are the same. See https://www.jstor.org/stable/2528823 for further details. This method minimizes memory usage by saving in memory and returning only the closest neighbors of each sample. In addition, it can deal with missing values.

Parameters
data: pd.DataFrame

DataFrame including all

cat_cols: List[Hashable]

List of categorical columns in the data.

numeric_cols: List[Hashable]

List of numerical columns in the data.

num_neighbors: int

Number of neighbors to return. For example, for n=2 for each sample returns the distances to the two closest samples in the dataset.

samples_to_calc_neighbors_for: pd.DataFrame, default None

Samples for which to calculate nearest neighbors. If None, calculates for all given samples in data. These samples do not have to exist in data, but must share all relevant features.

Returns
numpy.ndarray

representing the distance matrix to the nearest neighbors.

numpy.ndarray

representing the indexes of the nearest neighbors.