# Machine learning notes: implementation of KNN algorithm pandas combined with scikit learn

2022-01-26 22:53:37

## 1、K- Nearest neighbor algorithm ：

`````` If a sample is in the feature space K Most of the most similar samples belong to a certain category , Then the sample also belongs to this category
``````

## 2、KNN Algorithm flow ：

``````1、 Calculates the distance between a point in a known category and the current point
2、 Sort by increasing distance
3、 Select the least distance from the current point K A little bit
4、 Before statistics K The frequency of occurrence of the category of points
5、 Return to the former K The category with the highest frequency of occurrence of points is used as the prediction classification of the current point
``````

## 3、 Machine learning process ：

``````1、 get data
2、 Basic data processing
3、 Feature Engineering
4、 machine learning
5、 Model to evaluate
``````

## 4、 Code implementation process ：

``````''' -*- coding: utf-8 -*- @Author : Dongze Xu @Time : 2021/12/15 15:47 @Function: '''
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
``````
``````''' # 1. Get data set  # 2. Basic data processing  # 2.1  Narrow the data range  # 2.2  Select the time characteristic  # 2.3  Get rid of places with less check-in  # 2.4  Determine eigenvalues and target values  # 2.5  Split the dataset  # 3. Feature Engineering  --  Feature preprocessing ( Standardization ) # 4. machine learning  -- knn+cv # 5. Model to evaluate  '''

# 1. Get data set
# 2. Basic data processing
# 2.1  Narrow the data range
# selection x∈（2, 2.5）; y∈（2, 2.5）
partial_data = data.query("x>2.0 & x<2.5 & y>2.0 & y<2.5")
# 2.2  Select the time characteristic
# Convert to time format （ Date form ）
time = pd.to_datetime(partial_data["time"], unit="s")
time = pd.DatetimeIndex(time)
# 2.4  Determine eigenvalues and target values
x = partial_data[["x", "y", "accuracy", "hour", "day", "weekday"]]
y = partial_data["place_id"]
# 2.5  Split the dataset
#random_state: Random number seed ,test_size： Divide
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=2, test_size=0.25)
# 3. Feature Engineering  --  Feature preprocessing ( Standardization )
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 4. machine learning  -- knn+cv
# 4.1  Instantiate a trainer
estimator = KNeighborsClassifier()
# 4.2  Cross validation , Grid search implementation
param_grid = {
"n_neighbors": [3, 5, 7, 9]}
estimator = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=10, n_jobs=4)

# 4.3  model training
estimator.fit(x_train, y_train)
# 5. Model to evaluate
# 5.1  Accuracy output
score_ret = estimator.score(x_test, y_test)
# 5.2  Predicted results
y_pre = estimator.predict(x_test)
# 5.3  Other output results
print(" The best model is :\n", estimator.best_estimator_)
print(" The best result is :\n", estimator.best_score_)
print(" All the results are :\n", estimator.cv_results_)
``````