The Data Mining Forum                             open-source data mining software data science journal data mining conferences machine learning in software engineering MLISE 2021 utility mining workshop at ICDM 2021
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Evaluation of the generated patterns
Posted by: Niko
Date: July 06, 2021 06:58AM

Dear Prof,

First, thanks for your help and efforts.

I would like to know the way that I can use to evaluate the generated patterns.

For a given dataset, I used the FP-growth algorithm on a real dataset. After the patterns are generated, I want to evaluate them based on several statistical measures such as Confidence, lift, Kulc, Allconf, and etc.
How should this evaluation be done?. Besides, How we can choose the best one based on this evaluation?

Could you please elaborate on this for me?

Kind regards,
Niko

Options: ReplyQuote
Re: Evaluation of the generated patterns
Date: July 06, 2021 05:09PM

Hi,

Thanks for using the software.

In pattern mining, an algorithm will find the patterns that you ask for. This means that if you use an algorithm like Apriori to find the frequent patterns, then Apriori will give you exactly that.

There are different measures that can be used to find patterns like the support, lift, confidence, etc. Each measure has some advantages and disadvantages.

In SPMF, if you use the user interface, and you choose "FPGrowth_itemsets", then you can set a minimum support threshold and the results will tell you about the support of itemsets (how many times they appear).

Now, if you want to use other measures, you can use other algorithms. For the LIFT and CONFIDENCE, these measures are for association rules. So you would need to use "FPGrowth_association_rules". That one, let you specify the minimum confidence. And if you want to use the LIFT, you can use "FPGrowth_association_rules_with_lift". That algorithm will let you use support, confidence and lift.

After that you could also try other algorithms in SPMF which offers other measures.

Note that there are a lot of existing measures in the research papers. I have not implemented all of them because one of my goal is to design some efficient software. Some measures would require to redesign the algorithm and could make them less efficient. But the most popular measures can be used like support, confidence and lift.

Hope that this helps.

Best regards

Options: ReplyQuote
Re: Evaluation of the generated patterns
Posted by: Niko
Date: July 07, 2021 07:58AM

Thanks so much for your kind reply.

Yes, I got your idea.

However, for a given dataset, I will first generate the complete set of rules that satisfy several statical measures such as lift, Allconf, Maxconf, Cosine, and Confidence.
I want to know how I can compare the resulted rules based on these measures. Besides, I want to know which one of these statistical measures is better to get knowledge from a given dataset.

So, my inquiry is how to compare the generated rules based on several measures?

Kind regards,
Kino

Options: ReplyQuote
Re: Evaluation of the generated patterns
Date: July 07, 2021 05:15PM

Hi,

I see.

There are many measures for association rules. I have ever seen some kind of survey papers listing over 15 different measures. Which one is the best? I think it depends on your application and data.

In some books like "Introduction to data mining" by Tan & Kumar, there is a chapter about association rule mining that talk about the evaluation of patterns, and they show with examples that in some scenario the confidence is a good measure, but in other scenarios, the lift is better, etc. From this, we can see that there is not a single measure that is better than the other measures all the time... Some measures have some advantages in some situations but it depends on the data and most importantly on how you want to use the rules.

For the basic measures, some interpretation can be like this:


support: higher can be viewed as better usually

confidence : This is like the conditional probability. Higher is better and between 0 to 100%

lift : For an association rule X ==> Y, if the lift is equal to 1, it means that X and Y are independent. If the lift is higher than 1, it means that X and Y are positively correlated. If the lift is lower than 1, it means that X and Y are negatively correlated.

etc.

But still, how good a rule is depends on the application.

For example, in some application like information retrieval, you may be interested in finding the most frequent patterns so the support may be quite important. But in some other applications, you may want to focus on patterns that are not necessarily frequent but have a strong correlation. So in that case, the support is not very important and you may want to focus on a high confidence or lift.

Another example is for using the rules for prediction. Some people will build some classifiers using association rules and use various measures to select the best rules for prediction. Some for exmaple will multiply the support by confidence to select the best rule or use the lift or other measures (i do not remember the details)

Personally, from the dozen of measures that exist, I notice that most people just use the most simple measures like support, confidence and lift.

If you want to interpret many measures, I think first you need to real carefully the definitions of these measures to make sure you understand well what they mean and then try to interpret them in your application.

Best regards,

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.