calculate_nearest_neighbors_distances#
- calculate_nearest_neighbors_distances(data: DataFrame, cat_cols: List[Hashable], numeric_cols: List[Hashable], num_neighbors: int, samples_to_calc_neighbors_for: Optional[DataFrame] = None)[source]#
Calculate distance matrix for a dataset using Gower’s method.
Gowers distance is a measurement for distance between two samples. It returns the average of their distances per feature. For numeric features it calculates the absolute distance divide by the range of the feature. For categorical features it is an indicator whether the values are the same. See https://www.jstor.org/stable/2528823 for further details. This method minimizes memory usage by saving in memory and returning only the closest neighbors of each sample. In addition, it can deal with missing values.
- Parameters
- data: pd.DataFrame
DataFrame including all
- cat_cols: List[Hashable]
List of categorical columns in the data.
- numeric_cols: List[Hashable]
List of numerical columns in the data.
- num_neighbors: int
Number of neighbors to return. For example, for n=2 for each sample returns the distances to the two closest samples in the dataset.
- samples_to_calc_neighbors_for: pd.DataFrame, default None
Samples for which to calculate nearest neighbors. If None, calculates for all given samples in data. These samples do not have to exist in data, but must share all relevant features.
- Returns
- numpy.ndarray
representing the distance matrix to the nearest neighbors.
- numpy.ndarray
representing the indexes of the nearest neighbors.