The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining workshop
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Retaining class label while classification in r and python
Posted by: immahin
Date: September 09, 2018 07:45AM

down vote

I need an explanation about a certain matter. While data classification process in python, before training the classifier data is divided into training samples where the label class is the one we are going to identify is removed before training. For example, for classifying yeast data, where "class" is the label, I do things this way:

headers = ["name", "mcg", "gvh", "alm", "mit","erl", "pox", "vac", "nuc","class"]
df = pd.read_csv("", header=None, names=headers, na_values="?"winking smiley
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])
knn = NearestNeighbors(n_neighbors=6, algorithm='ball_tree', metric='euclidean')

However, in the R language, while distance calculation the label class is also considered. For example for the same dataset, while distance calculation in R, things goes this way:

df <- read.table(file="~/yeast.txt",header=T, sep=","winking smiley
names(df) <- c("name", "mcg", "gvh", "alm", "mit","erl", "pox", "vac", "nuc","class"winking smiley
dist <- distances("class",df, "Euclidean"winking smiley

Here, we are needed to add the label class too. Could someone explain me the reason? Am I doing something wrong?

Options: ReplyQuote
Re: Retaining class label while classification in r and python
Posted by: Dang Nguyen
Date: September 11, 2018 03:12PM

Given a data point, if you just want to find k its nearest neighbors, then you don't need to use labels.

In case of your R code, I think the result would be the same as the Python code if you removed the labels in your training data.

Options: ReplyQuote

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.