The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Sequential Rule Mining | Different definitions of Support and Confidence
Posted by: Fabian
Date: July 15, 2017 04:49AM

Hello,

I am looking into the problem of Sequential Rule Mining, and see different definitions of the support and confidence values regarding such a sequential rule. On this site's documentations of the different Algorithms implemented in SPMF (such as CMRules, RuleGrowth etc.), I find the support and confidence being defined in the same way as it is defined in Sequential Pattern Mining:
The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

Now, I have read into the different papers of the implemented algorithms, and find different definitions for the same values:
Here the support is defined as the number of sequences from a sequence database where all the items of X appear before all the items of Y. The confidence is defined accordingly.

Now, this is actually quite a difference as I see it. The examples on the documentary pages showing possible sequence databases and the identified sequential rules with its support and confidence values seem to be according to the definitions in the papers - not according to the definition given a few lines above.

Is this just an issue with the definitions being false in the documentation pages for the implemented algorithms here? Or does this have something to do with the alternative term "sequential support" and "sequential confidence" I also came across reading through some papers?

Any input would be greatly appreciated smiling smiley

Options: ReplyQuote
Re: Sequential Rule Mining | Different definitions of Support and Confidence
Posted by: Fabian
Date: July 15, 2017 05:42AM

I think I might have found an implementation error in the CMRules - Algorithm provided in SPMF. I tested the GUI Version on the following example (file contextPrefixSpan.txt as input) and checked with the given expected results on the documentation page:

image host

For me, the CMRules Algorithm outputs a confidence of 1.0 for all mined rules, which obviously is false. Maybe someone should check into the source code here. I haven't checked if the same happens for the Code-Version of SPMF.

Options: ReplyQuote
Re: Sequential Rule Mining | Different definitions of Support and Confidence
Date: July 22, 2017 09:47AM

Hello,

Sorry about the delay for replying. I usually check the forum very often and get notified by e-mails about new messages but for some reason, I did not see your message until today.

I have checked the confidence and support values for the rules in the table above for CMRules from the documentation and they are correct. The code also generates the same result.

Actually, there is no contradiction between the sentence "Here the support is defined as the number of sequences from a sequence database where all the items of X appear before all the items of Y. " and the paper. It is the same definition but I assume that it could be made clearer. Perhaps that it could be written as "all the items of X appear before all the items of Y at least once" to avoid ambiguity.

Let me explain with some example.

The definition of the support used in my sequential rule papers is actually a little bit different than that used in sequential pattern mining. I think that it might be what is confusing you. I will try to give you more details about that difference.

Let's consider a first sequence: {a,b,c}, {a,d}, {e, f, g} {d}, which contains 4 events(itemsets).

In sequential pattern mining, we consider that the pattern <{a,b},{d},{f}> appears in that sequence since {a,b} appear in the same itemset/event, which are followed by {d}, which is followed by {f}.

Now, for sequential rule mining, things are a little bit different for the support caculation. Consider the rule {a,b,d} --> {f}. This rule is considered to appear in the sequence. As you can see in that sequence, the items a,b and d actually never appear in the same event. But it does not matter for sequential rules. To calculate the support, we only care that a, b, and d appear before f. But there is no constraint on the order between a, b and d, and these items are not required to be in the same itemset (unlike sequential pattern mining). The items a, b and d can appear in any order, as long as they appear before f.

Now consider a second sequence: {d}, {a,b,c}, {e, f, g} {d}

For that sequence, we would still consider that {a,b,d} --> {f} appears in that sequence. However, the sequential pattern <{a,b},{d},{f}> does not appear in that sequence because in sequential pattern mining the order is very strict.

So, the difference between how the support is computed in sequential pattern mining and sequential rule mining is the following:
- In sequential pattern mining, the order is very strict. If we have a pattern <{a,b},{d},{f}>, then this pattern appears in a sequence only if it follows that exact order. This means that a and b must appear in the same itemset, then there must be another itemset containing d, and then another itemset containing f in the same sequence. This is very strict. If the order is not followed, then the pattern is considered as not appearing in the sequence.

- In sequential rule mining, we have a rule of the form X --> Y. Here we do not care about the order of items in X. We don't care if the items of X appear in the same itemset or not in the sequence. We also don't care about the order of the items of Y and also we don't care about whether the items of Y appear in the same itemset in the sequence or not. But what is important is that the items in X appear at least once before those in Y in the sequence. Note that the items may appear multiple times in a sequence but we only need it to appear at least once. If it appears multiple times it is counted as 1.

About the measures, in the first paper that I wrote about this topic (CMRules), I used the name "sequential support" and "sequential confidence". In my other papers about sequential rules (ERMiner RuleGrowth, etc.), I just decided to call these measures "support" and "confidence" to make it shorter. But they are the same measures.

Note that there exists some other sequential rule mining algorithm that utilize a different definition of support and confidence. So this can be another source of confusion. For example, the RuleGen algorithm, which is also offered in SPMF use a much more strict definition of support. It finds rules X--> Y where X and Y are sequential patterns. This definition is closer to sequential pattern mining than the definition that I used in my papers. But in my opinion the problem with that definition is that it is too strict.

Options: ReplyQuote
Re: Sequential Rule Mining | Different definitions of Support and Confidence
Posted by: Fabian
Date: July 23, 2017 11:17PM

Hello and thanks for your reply.

Quote

Actually, there is no contradiction between the sentence "Here the support is defined as the number of sequences from a sequence database where all the items of X appear before all the items of Y." and the paper. It is the same definition but I assume that it could be made clearer.

I agree with you on that point, those would say the same to me as well. Where I found was a difference were this definition and the one given on the documentation pages for SPMF here for example. There the support is defined with the following sentence:
Quote

The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database.
This definition is the exact same as the one given for the association rule mining algorithms and does not mention the fact at all, that it is about the number of sequences where all items of X need to appear BEFORE all items of Y. It just says that all items from X and all items from Y need to appear in it to me.
The only way it could say that is if the set-operator "∪" had a sequential meaning in this context, stating that X needs to appear before Y.

The rest of your explanation was very clear, thanks a lot. I understand the differences between sequential pattern and sequential rule mining much better now smiling smiley

Options: ReplyQuote
Re: Sequential Rule Mining | Different definitions of Support and Confidence
Posted by: Phil
Date: July 23, 2017 11:21PM

I see. Ok, in that case, yes the definition is not really correct, or it is ambiguous. I will try to fix it tonight when I go back home. Thanks.

Philippe

Options: ReplyQuote


Your Name: 
Your Email: 
Subject: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
 **         ********  ********   ******   ******** 
 **    **      **        **     **    **     **    
 **    **      **        **     **           **    
 **    **      **        **     **           **    
 *********     **        **     **           **    
       **      **        **     **    **     **    
       **      **        **      ******      **    
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.