The Data Mining Forum                             open-source data mining software open-source data mining software data science journal data mining conferences
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Best approach to guess missing fields using incomplete datasets
Posted by: S Nath
Date: March 08, 2013 05:24PM

Hi,
Disclaimer: I’m a health informatics expert with very limited data mining knowledge, so my apologies for any obviously stupid comments that I make.
I’m developing an open source health application for underdeveloped countries. As part of this, I’m required to ‘guess’ missing data fields using existing data fields.
Now, I’ve heard that there are many ways to do this, but I’m afraid that I’ve failed to identify the best for my scenario.
I’ve heard that clustering can be a good solution to this problem. However, my records can be very very incomplete, which may affect the success of this approach.
It struck me that I could identify all association rules in my dataset, and then use associations to guesstimate the missing data based on their support and confidence.
My questions are,

Is this the valid way to do this?
What method would you recommend as the best approach to solve this?

Options: ReplyQuote
Re: Best approach to guess missing fields using incomplete datasets
Date: March 10, 2013 11:11AM

Hi,

Welcome to the forum.

There are different ways to fill incomplete data.

Using association rules is a solution. Since association rules represents associations and have some kind of probability, it would make some sense to use it.

Another way would be to train some neural networks with some records to then predict the value of a missing attribute for other records.

Another way is to use clustering as you have mentioned. Given a record, compute the closest records and then use their values to fill the missing values of the record.

Another way would be to not fill the missing data. But to use some algorithms to analyses your data that is tolerant to missing data.

Another would be to use a statistical approach. You may read this: http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html There is some link to filling missing data by using some softwares such as SPSS... I think that maybe that you can find some specialized software that can perform this task for you perhaps.

This is what I know about this subject. Actually I have never done it by myself.;-) My only concern with filling missing data would be to not mess with the statistical significance of the data.

Philippe

Options: ReplyQuote
Re: Best approach to guess missing fields using incomplete datasets
Posted by: suranga
Date: March 11, 2013 04:17AM

Thank you Philippe,

No worries, this information itself was quite helpful to me. Hopefully, someone else with more knowledge on the subject will drop me some further tips.

But thank you for all your help :-)

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.