calculate_nearest_neighbors_distances#

calculate_nearest_neighbors_distances(data: DataFrame, cat_cols: List[Hashable], numeric_cols: List[Hashable], num_neighbors: int, samples_to_calc_neighbors_for: Optional[DataFrame] = None)[source]#

Calculate distance matrix for a dataset using Gower’s method.

Gowers distance is a measurement for distance between two samples. It returns the average of their distances per feature. For numeric features it calculates the absolute distance divide by the range of the feature. For categorical features it is an indicator whether the values are the same. See https://www.jstor.org/stable/2528823 for further details. This method minimizes memory usage by saving in memory and returning only the closest neighbors of each sample. In addition, it can deal with missing values.

Parameters

data: pd.DataFrame: DataFrame including all
cat_cols: List[Hashable]: List of categorical columns in the data.
numeric_cols: List[Hashable]: List of numerical columns in the data.
num_neighbors: int: Number of neighbors to return. For example, for n=2 for each sample returns the distances to the two closest samples in the dataset.
samples_to_calc_neighbors_for: pd.DataFrame, default None: Samples for which to calculate nearest neighbors. If None, calculates for all given samples in data. These samples do not have to exist in data, but must share all relevant features.

Returns

numpy.ndarray: representing the distance matrix to the nearest neighbors.
numpy.ndarray: representing the indexes of the nearest neighbors.

calculate_distance

gower_matrix