The Data Mining Forum
This forum is about data mining
, data science
and big data
: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger
. No registration is required to use this forum!
Re: choosing minsup in SPM
Date: June 11, 2020 04:02PM
In that paper, for the experiments, the minsup threshold was varied so as to show the performance gap between the algorithms. I started with a high minsup threshold and lower it down until it became very slow and i could see clearly the difference between the algorithms in terms of performance.
Generally, the minsup threshold will be different on each dataset. On some datasets, a minsup of 0.8 may results in finding 0 patterns (e.g. the BMS dataset I think) and on other datasets like Snake, you may find perhaps millions of patterns for the same threshold value. It really depends on the characteristics of the datasets (whether the sequences are long, whether the sequences are very similar or not, if there are many different items...)
From a practical perspective, if you use these algorithms in an application, I would suggest to start from a high minsup value and decrease it until when you find enough patterns.
As minsup is decreased, the number of frequent patterns may increase exponentially, and the speed or memory usage of algorithms may also increase in a similar way because as minsup is decrease the part of the search space to be explored may also increase exponentially.
If you dont want to set the minimum support, there exists a more intuitive way, which is to find the top-k sequential patterns. Here, instead of using minsup, you just say that you want to find for example the top-500 most frequent patterns. An algorithm like TKS will do that. Internally, it will automatically adjust the minsup threshold to find exactly k patterns (e.g. the top 500 patterns). TKS is very similar to the CM-SPAM algorithm in the PAKDD 2014 that you have read. The main difference is that TKS ask the user to set k the number of patterns to be found instead of the minsup threshold. This is in my opinion more intuitive.
Hope this helps.
Edited 1 time(s). Last edit at 06/11/2020 04:03PM by webmasterphilfv.