Re: Lift for Sequential Rules (ERMiner) and Other Questions
Date: December 16, 2017 07:14AM
Sorry for the delay to answer your questions. I am currently travelling to attend an international conference and have been busy this week. My answers are below:
>I noticed that there is lift in SPMF's CMDeo algorithm but not ERMiner. Is there any reason for this?
Yes, the reason is that calculating the lift requires additional information that is not required for calculating the confidence. More specifically, to calculate the lift of a rule X -->Y we need to know the support of Y, which is not required to calculate the confidence. Calculating the support of Y requires to modify how the algorithm works a little bit, as they are not designed to calculate the support of Y, and this can change the speed of the algorithm. Besides, it requires some effort to integrate these modifications properly in the code. This is the main reason why it was not implemented. Actually, it is a contributor of SPMF that has provided the code for using the lift in CMDeo. If someone implement it in ERMiner, it would be great as ERMiner is faster than CMDeo.
Besides another reason is that there are actually more than 15 measures used in the literature for evaluating rules. Support and confidence are the most common in this field. So these are the reasons why they have been implemented first. But the more would be better of course.
>Is there any algorithm faster than ERMiner?
Not to my knowledge. To make it faster, another way would be to add some constraints such as finding all rules containing no more than 5 items. Besides, there is the TRuleGrowth algorithm that allows to specify time constraints. If we use constraints, it can improve the performance further, as the algorithms can avoid looking at many patterns.
By the way, although ERMiner is faster than RuleGrowth, it uses more memory so there is some trade-off.
Another possibility could also be to modify these algorithms to work using multiple threads. Algorithms like ERMiner and RuleGrowth could be parallelized as they do a depth-first search. But it would requires maybe a few days of programming to do something like that.
>What is the best way for determining the minimum confidence and support?
For the support, it needs to be found by trial and error, if you do not have background knowledge about the data. For some datasets, there might be nothing frequent for 0.01 % but for other datasets, there may be millions of patterns for minsup = 0.99%. So it really depends on the data. Usually, I start with a very high value, and decrease it until I find enough patterns.
For the confidence, it is easier to set than the support. The reason is that the confidence of a rule X-->Y represents the conditional probability that X will be followed by Y. So, for some applications, you may perhaps be interested in something having a confidence of at least 50% for instance or even higher. And rules with a very low confidence such as 10% would certainly be useless. So I would start with something high like 75 % or 80% as minconf threshold and decrease it if i find nothing.
>Will sequential rules generate less rules than association rule mining? Intuitively, it appears so because there are significantly less transactions when you can group together your transaction database.
Yes, that is correct. In the CMRules papers, it was shown that Sequential Rules are a subset of the set of association rules. There will always be no more sequential rules than there are association rules. Basically, a sequential rules is an association rules were we enforce some additional ordering constraint that X must be before Y.
>And just to be clear, is it safe to say that association rules cannot be used accurately for prediction? Because even though items may be closely associated with each other, we can not necessarily say that one item occurs because of the other item.
Because association rules do not consider the time or ordering between events, it is quite possible that they would not perform well for domains where the time or ordering between events is important. However, there are some applications where the time is not important. In that case, we could assume that they could perform well. So if time or the ordering is important, sequential rules have a good chance of performing better, in my opinion. But it is always good to compare different methods to see how they really compare to each other in real-life, as for different kind of data, some methods sometimes work better or worse.