current position:Home>Machine learning notes: implementation of KNN algorithm pandas combined with scikit learn

Machine learning notes: implementation of KNN algorithm pandas combined with scikit learn

2022-01-26 22:53:37 xMathematics

1、K- Nearest neighbor algorithm :

 If a sample is in the feature space K Most of the most similar samples belong to a certain category , Then the sample also belongs to this category 

2、KNN Algorithm flow :

1、 Calculates the distance between a point in a known category and the current point 
2、 Sort by increasing distance 
3、 Select the least distance from the current point K A little bit 
4、 Before statistics K The frequency of occurrence of the category of points 
5、 Return to the former K The category with the highest frequency of occurrence of points is used as the prediction classification of the current point 

3、 Machine learning process :

1、 get data 
2、 Basic data processing 
3、 Feature Engineering 
4、 machine learning 
5、 Model to evaluate 

4、 Code implementation process :

''' -*- coding: utf-8 -*- @Author : Dongze Xu @Time : 2021/12/15 15:47 @Function: '''
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
''' # 1. Get data set  # 2. Basic data processing  # 2.1  Narrow the data range  # 2.2  Select the time characteristic  # 2.3  Get rid of places with less check-in  # 2.4  Determine eigenvalues and target values  # 2.5  Split the dataset  # 3. Feature Engineering  --  Feature preprocessing ( Standardization ) # 4. machine learning  -- knn+cv # 5. Model to evaluate  '''

# 1. Get data set 
data = pd.read_csv("./data/FBlocation/train.csv")
# 2. Basic data processing 
# 2.1  Narrow the data range 
# selection x∈(2, 2.5); y∈(2, 2.5)
partial_data = data.query("x>2.0 & x<2.5 & y>2.0 & y<2.5")
# 2.2  Select the time characteristic 
# Convert to time format ( Date form )
time = pd.to_datetime(partial_data["time"], unit="s")
time = pd.DatetimeIndex(time)
# 2.4  Determine eigenvalues and target values 
x = partial_data[["x", "y", "accuracy", "hour", "day", "weekday"]]
y = partial_data["place_id"]
# 2.5  Split the dataset 
#random_state: Random number seed ,test_size: Divide 
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=2, test_size=0.25)
# 3. Feature Engineering  --  Feature preprocessing ( Standardization )
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 4. machine learning  -- knn+cv
# 4.1  Instantiate a trainer 
estimator = KNeighborsClassifier()
# 4.2  Cross validation , Grid search implementation 
param_grid = {
    "n_neighbors": [3, 5, 7, 9]}
estimator = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=10, n_jobs=4)

# 4.3  model training 
estimator.fit(x_train, y_train)
# 5. Model to evaluate 
# 5.1  Accuracy output 
score_ret = estimator.score(x_test, y_test)
# 5.2  Predicted results 
y_pre = estimator.predict(x_test)
# 5.3  Other output results 
print(" The best model is :\n", estimator.best_estimator_)
print(" The best result is :\n", estimator.best_score_)
print(" All the results are :\n", estimator.cv_results_)

copyright notice
author[xMathematics],Please bring the original link to reprint, thank you.
https://en.cdmana.com/2022/01/202201262253346670.html

Random recommended