current position:Home>Machine learning -- spectral clustering

Machine learning -- spectral clustering

2022-01-26 23:29:36 Nuggets user 007

This is my participation 8 The fourth of the yuegengwen challenge 16 God , Check out the activity details :8 Yuegengwen challenge

Introduction to spectral clustering

Spectral clustering is an algorithm evolved from graph theory , Later, it has been widely used in clustering . Its main idea is to treat all data as points in space , These points can be connected by edges . The edge weight value between two points that are far away is low , The edge weight value between two points close to each other is higher , Cut through the graph composed of all data points , Let the edge weights between different subgraphs be as low as possible , And the edge weight sum in the subgraph is as high as possible , So as to achieve the purpose of clustering .

Spectral clustering principle

Spectral clustering algorithm is an algorithm that is easy to use but not so easy to understand in principle . For spectral clustering algorithm, we can summarize the following steps :
Input : Sample set D=(x1,x2,...,xn), How similar matrices are generated , Dimension after dimension reduction k1, Clustering method , Dimension after clustering k2
Output : Clustering C(c1,c2,...ck2). \

  1. The similarity matrix of the sample is constructed according to the generation method of the input similarity matrix S\

2) According to the similarity matrix S Building adjacency matrix W, Construct the degree matrix D
3) Calculate the Laplace matrix L
4) Construct the standardized Laplace matrix D−1/2LD−1/2
5) Calculation D−1/2LD−1/2 The smallest k1 The eigenvector corresponding to each eigenvalue f
6) The corresponding eigenvectors f The matrix is normalized by row , The final composition n×k1 The characteristic matrix of dimension F
7) Yes F Each line in the is treated as a k1 V's sample , common n Samples , Use the input clustering method for clustering , The clustering dimension is k2.
8) Get the cluster partition C(c1,c2,...ck2). 
The specific principle of spectral clustering algorithm , We need some basis related to graph theory , It's not going to unfold here . For a detailed explanation, please refer to the University of Stanford CS224w The content of lesson 5 of the course , For the courseware of the course, see 【 Reference link 2】, It can also be read 【 Reference link 3】 Understand the content of the courseware .

Spectrum clustering algorithm summary

The main advantages of spectral clustering

  1. Spectral clustering only needs the similarity matrix between data , Therefore, it is very effective for clustering sparse data . This traditional clustering algorithm, such as K-Means It's hard to do
  2. Due to the use of dimensionality reduction , Therefore, the complexity of processing high-dimensional data clustering is better than the traditional clustering algorithm .

The main disadvantages of spectral clustering

  1. If the dimension of the final cluster is very high , Because the range of dimension reduction is not enough , The running speed of spectral clustering and the final clustering effect are not good .
  2. The clustering effect depends on the similarity matrix , The final clustering effect obtained by different similarity matrices may be very different .

Spectral clustering in Sklearn In the implementation of

stay sklearn The implementation of spectral clustering algorithm is provided in :

sklearn.cluster.SpectralClustering(n_clusters=8, *, eigen_solver=None, n_components=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, verbose=False)
 Copy code 

Parameters, :

'n_clusters' Is the dimension of shadow space , That is, the number of clusters after clustering is also in the principle k
'eigen_solver' Is the eigenvalue decomposition strategy used , The optional parameters are 'arpack', 'lobpcg', 'amg'.AMG Need to install pyamg. It can be faster on very large sparse problems , But it can also lead to instability . without , You can also use 'arpack'
'random_state' Is a pseudo-random number generator , be used for K-means Eigenvector decomposition in .
'Gamma' It means rbf, poly, sigmoid, laplacian and chi2 kernels Nuclear coefficient of
Here are only some of the parameters used , For complete parameters, please refer to the official document, i.e 【 Reference link 4】 It can also be read 【 Reference link 5】 Understand the meaning and principle of parameters .

copyright notice
author[Nuggets user 007],Please bring the original link to reprint, thank you.

Random recommended