This forum is about

Evaluation of k-means and hierarchical clustering without train and testing dataset?

Posted by:
**
DRam
**

Date: November 16, 2017 12:33AM

Hi,

I’m a data mining research scholar. I have implemented k-means and hierarchical clustering using Java, without train and testing data. It just computes measures to group similar combination of words. My input data is 3 words combination (for ex, new game changer, biggest game changer, new game version….), nearly 2000 data, which is extracted from information retrieval concept. Now the problem is I don’t know how to evaluate it. Generally, we can use human expert evaluation procedure. But I’m not sure, is it used for clustering?

Especially, I don’t know how to evaluate for the Hierarchical cluster. because it group as more than two combinations(for ex {2,4,3}=>{{2,3},4}. So please help me to clarify my doubts.

1. Can we cluster the data without train and testing data?

2. How to give this words combination as input for data mining tools like weka arff file?

3. How to convert this words combination as training and testing data ( or I need to process this data for training or testing, pls give example training and testing format using this combination of words)?

4. How to measure observed and expected values for clusters evaluation?

Pls do this needful, thanks in advance.

I’m a data mining research scholar. I have implemented k-means and hierarchical clustering using Java, without train and testing data. It just computes measures to group similar combination of words. My input data is 3 words combination (for ex, new game changer, biggest game changer, new game version….), nearly 2000 data, which is extracted from information retrieval concept. Now the problem is I don’t know how to evaluate it. Generally, we can use human expert evaluation procedure. But I’m not sure, is it used for clustering?

Especially, I don’t know how to evaluate for the Hierarchical cluster. because it group as more than two combinations(for ex {2,4,3}=>{{2,3},4}. So please help me to clarify my doubts.

1. Can we cluster the data without train and testing data?

2. How to give this words combination as input for data mining tools like weka arff file?

3. How to convert this words combination as training and testing data ( or I need to process this data for training or testing, pls give example training and testing format using this combination of words)?

4. How to measure observed and expected values for clusters evaluation?

Pls do this needful, thanks in advance.

Posted by:
**
webmasterphilfv
**

Date: November 16, 2017 05:55AM

Hello,

In general, clustering is an unsupervised type of data mining technique. This means that you don't need training and testing data.

You can just apply some clustering algorithms on some data to find clusters directly. Then how to evaluate these clusters? There are several ways:

1) you could ask some experts to look at your clusters visually to see if they make sense

2) you could use some measures like the SSE (sum of squared errors). For this, you don't need testing data. You just need to calculate the distance between points of the different clusters to calculate this measure. If the SSE is small, usually it means some good clusters. For the formula to calculate the SSE, you could find it in various data mining books. There are also many other measures to evaluate a set of clusters.

3) ...

Basically, you should read section 8.5 of this book which explains how to evaluate a set of clusters to determine if it is good or not:

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf

In general, clustering is an unsupervised type of data mining technique. This means that you don't need training and testing data.

You can just apply some clustering algorithms on some data to find clusters directly. Then how to evaluate these clusters? There are several ways:

1) you could ask some experts to look at your clusters visually to see if they make sense

2) you could use some measures like the SSE (sum of squared errors). For this, you don't need testing data. You just need to calculate the distance between points of the different clusters to calculate this measure. If the SSE is small, usually it means some good clusters. For the formula to calculate the SSE, you could find it in various data mining books. There are also many other measures to evaluate a set of clusters.

3) ...

Basically, you should read section 8.5 of this book which explains how to evaluate a set of clusters to determine if it is good or not:

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf

Posted by:
**
Dang Nguyen
**

Date: November 17, 2017 11:13PM

A clustering method doesn't require training and testing data. Its main goal is to discover meaningful groups (i.e., clusters) from the data.

To evaluate a clustering method, there are two ways:

1. If you have ground truth (i.e., you know label of each data point), you can compare them with the clusters assigned by the clustering method to see whether they are matched. More specifically, you compute some measures such as mutual-information (MI), normalized mutual-information (NMI), adjusted rand-index (ARI).

2. If you don't know ground truth, you can compute the distance among data points within a cluster (intra-distance) and the distance among data points in different clusters (inter-distance). Ideally, intra-distance should be small and inter-distance should be large for a good clustering result. You can compute Silhouette Coefficient in this case.

To evaluate a clustering method, there are two ways:

1. If you have ground truth (i.e., you know label of each data point), you can compare them with the clusters assigned by the clustering method to see whether they are matched. More specifically, you compute some measures such as mutual-information (MI), normalized mutual-information (NMI), adjusted rand-index (ARI).

2. If you don't know ground truth, you can compute the distance among data points within a cluster (intra-distance) and the distance among data points in different clusters (inter-distance). Ideally, intra-distance should be small and inter-distance should be large for a good clustering result. You can compute Silhouette Coefficient in this case.