Kmeans聚类算法_kmeans聚类算法的应用_kmeans聚类算法实现-数据分析

关键词：kmeans聚类算法优缺点、kmeans聚类算法matlab、kmeans聚类算法流程图、kmeans聚类算法的应用、kmeans聚类算法实现、kmeans聚类算法原理

聚类算法是这样的一种算法：给定一坨样本数据Sample，要求将样本Sample中相似的数据聚到一类。有了这个认识之后，就应该了解了聚类算法要干什么了吧。说白了，就是归类。

首先，我们需要考虑的是，如何衡量数据之间的相似程度？比如说，有一群说不同语言的人，我们一般是根据他们的方言来聚类的（当然，你也可以指定以身高来聚类）。这里，语言的相似性（或者身高）就成了我们衡量相似的量度了。在考虑存在海量数据，如微博上各种用户的关系网，如何根据用户的关注和被关注来聚类，给用户推荐他们感兴趣的用户？这就是聚类算法研究的内容之一了。

Kmeans就是这样的聚类算法中比较简单的算法，给定数据样本集Sample和应该划分的类数K，对样本数据Sample进行聚类，最终形成K个cluster，其相似的度量是某条数据i与中心点的“距离”（这里所说的距离，不止于二维）。如下图：

算法的过程是这样的：

1) :input Samples and it’s number N， numSample , k ;

2) :randomize select k sample from the Samples as the initialed clusters , set the maxCompute to m ,and the passNum to 0;

3) :for i=0 to N-1

a): for j=0 to K-1

i) : compute the distance between record i and cluster center j;

ii): add the record i to closest cluster;

4) :for j=0 to K-1

a) : reSelect center between the cluster j ,according given rule;

b) : clear all the older merber in the cluster;

5) :passNum++ ; if all the new center same to the older center or passNum >m ,terminate!else continue;

下面给出Kmeans运行的一个实例：

样本数据如下：

2.1 3.0 10.0
4.0 5.2 -1.0
5.1 1.5 2.3
10.5 12.6 10.8
12.1 10.9 11.0
4.2 5.3 -9.8
5.4 1.6 8.7
-1.0 -2.1 -0.9
0.5 0.3 0.4

需要分成三个聚类

用这个数据运行后，程序打印出划分结果：

—- 第1个聚类 —-
聚类中心：3.75,2.3,9.35
2.1,3,10
5.4,1.6,8.7

—- 第2个聚类 —-
聚类中心：4.1,5.25,-5.4
4,5.2,-1
4.2,5.3,-9.8

—- 第3个聚类 —-
聚类中心：1.53333,-0.1,0.6
5.1,1.5,2.3
-1,-2.1,-0.9
0.5,0.3,0.4

—- 第4个聚类 —-
聚类中心：11.3,11.75,10.9
10.5,12.6,10.8
12.1,10.9,11

至此，我们得到了三个相似性很高的聚类

下面，就是c++的粗略实现了：

//the single sample
class Sample{
public:
unsigned int r, g ,b;
Sample(){};
Sample(unsigned int x, unsigned int y , unsigned int z){
r=x ; g= y; b= z;
}
//compute the distance of two sample
int distance(Sample other){
return ((r-other.r)*(r-other.r) +
(g-other.g) * (g-other.g) +
(b-other.b) * (b-other.b));
}
};

//cluster class

class Cluster{
private :
Sample center; //the center of cluster
int iCenter; //the sub of center
int number; //number of the merber
int *merber;// index of the merber, record the sub
public :
Cluster(){merber = NULL; number = 0; iCenter = -1;}
Cluster(int n , Sample c , int *m , int i)
:number(n), center(c) , merber(m) , iCenter(i)
{

}
Sample getCenter(){
return center;
}

void setCenterNumber(int i){
iCenter = i;
}
int getCenterNumber(){
return iCenter;
}

void setCenter(Sample c){
center = c;
}
void setNumber(int n){
number = n;
}
int getMerberNumber (){
return number;
}
void addMerber(int n){
merber[number++] = n;
}
int getNewCenter(){
int newSub=0;
for(int i=0 ; i<number ; i++){
newSub+=merber[i];
}
return newSub/number;
}
};

//Kmeans class

#include <time.h>
#include <stdio>
#include “Sample.h”
#include “Cluster.h”
class Kmeans{
private:
int num , clusterNum ;
Cluster * clusters;
Sample * samples;
public :
Kmeans(int n, int d, int k, Sample * ob)
: num(n)
, clusterNum(k)
, samples(ob)
, clusters(new Cluster[k])
{
//assert(n>1 && k<n) ;
}
Kmeans(){};
~Kmeans(){delete []clusters ; delete []samples ; }

//initial the clusters
void initialCluster(){
int temp = num/clusterNum;
for(int i=0;i<clusterNum ; i++){
clusters[i].setCenter(samples[(temp*i)%num]) ;
clusters[i].setNumber(0);
}
}

bool run(){
bool converged = false; // signing converged
int passNum = 0;
while (!converged && passNum < 999)   //if not converged , try again
  // the compute times passNum should under 999
{
  for(int i=0 ; i<clusterNum ; i++){
   //converged = ( clusters[i].getCenter()) == clusters[clusters[i].getNewCenter()].getCenter();
   int temp = clusters[i].getCenterNumber();
   converged = clusters[i].getNewCenter()!=temp ;
   clusters[i].setCenter(samples[temp]);
  }
  distribute();                     //distribute the samples to the closest cluster
  //converged = (); //
  passNum++;
}
}

void distribute()
{
// clear all the older cluster merber , and reCluster
for(int k=0; k<clusterNum; k++)
clusters[k].setNumber(0);
// compute the new closest cluster
for(int i=0; i<num; i++){
clusters[getClosestCluster(i)].addMerber(i);
}
}

//get closest cluster,return the sub
int getClosestCluster(int i){
int iShortestDistance=0 , sub=0;
for(int j=0;j<clusterNum ; j++){
  int temp = samples[i].distance(clusters[j].getCenter());
  if(iShortestDistance< temp){
   iShortestDistance = temp;
   sub=j;
  }
}
return sub;
}
};

转载请注明：数据分析 » Kmeans聚类算法_kmeans聚类算法的应用_kmeans聚类算法实现