The Data Mining Forum                             open-source data mining software open-source data mining software data science journal data mining conferences
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
choosing minsup in SPM
Posted by: jbd
Date: June 11, 2020 08:25AM

Hello,

Concerning your paper: "Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information", I would like to ask you on what basis did you make your choice of the values of minsup for each dataset ?

Thank you,
jbd

Options: ReplyQuote
Re: choosing minsup in SPM
Date: June 11, 2020 04:02PM

Hi,

In that paper, for the experiments, the minsup threshold was varied so as to show the performance gap between the algorithms. I started with a high minsup threshold and lower it down until it became very slow and i could see clearly the difference between the algorithms in terms of performance.

Generally, the minsup threshold will be different on each dataset. On some datasets, a minsup of 0.8 may results in finding 0 patterns (e.g. the BMS dataset I think) and on other datasets like Snake, you may find perhaps millions of patterns for the same threshold value. It really depends on the characteristics of the datasets (whether the sequences are long, whether the sequences are very similar or not, if there are many different items...)

From a practical perspective, if you use these algorithms in an application, I would suggest to start from a high minsup value and decrease it until when you find enough patterns.

As minsup is decreased, the number of frequent patterns may increase exponentially, and the speed or memory usage of algorithms may also increase in a similar way because as minsup is decrease the part of the search space to be explored may also increase exponentially.

If you dont want to set the minimum support, there exists a more intuitive way, which is to find the top-k sequential patterns. Here, instead of using minsup, you just say that you want to find for example the top-500 most frequent patterns. An algorithm like TKS will do that. Internally, it will automatically adjust the minsup threshold to find exactly k patterns (e.g. the top 500 patterns). TKS is very similar to the CM-SPAM algorithm in the PAKDD 2014 that you have read. The main difference is that TKS ask the user to set k the number of patterns to be found instead of the minsup threshold. This is in my opinion more intuitive.

Hope this helps.

Best regards,

Philippe



Edited 1 time(s). Last edit at 06/11/2020 04:03PM by webmasterphilfv.

Options: ReplyQuote
Re: choosing minsup in SPM
Posted by: jbd
Date: June 11, 2020 11:41PM

This is very helpful. Thank you a lot for your detailed response.

I have also read your blog post yesterday concerning the equation that you propose for choosing minsup, and I find it very useful.

jbd

Options: ReplyQuote
Re: choosing minsup in SPM
Date: June 12, 2020 09:42AM

Hi again,

You are welcome!

Yes, that equation is another solution to that problem of how to set the minsup!

Best regards,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.