This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Using CM-Spam for very long sequences
Posted by: Madiha Khan
Date: September 10, 2020 02:58AM


I am new to SPMF and am exploring the use of sequential pattern mining for my dataset. I'm finding the tool very useful, but have run into some difficulties with using it for longer sequences

My research uses unstructured data relating to tutor-student interactions. I have labelled the data using a framework, and now have a very long set of sequences, with each sequence relating to one tutoring session. The labels for the data are numerical positive integers, and I have prepared the input file using the instructions on the SPMF website (i.e. every event is separated by a -1, and there is a -2 at the end of every sequence). The sequences consist of a large number of events, rather than item-sets, and I have thus treated each event as its own item-set, and separated each event with a -1.

When I run the CM-SPAM algorthim on a set of 8 sequences with a minsup of 50%, the output file shows a number of patterns with a support of either 3, or 2. I can tell manually that the support should be higher for many of these patterns, and this leads me to think that CM-Spam is not recognizing the bottom 5 sequences for some reason. I'm not sure why this is. The sequences are very long and tend to spill over into multiple lines, so this may be impacting how the algorithm works - could this be a possibility? How could I adjust the files, if so?

Thanks in advance for the help


Re: Using CM-Spam for very long sequences
Date: October 04, 2020 10:32PM

Dear Madiha,

Thanks for using SPMF, and I am sorry for not answering earlier. Usually, I receive an e-mail for each message poster on the forum and I try to answer quickly to each message. But somehow, I did not see the notification.

I think the problem is likely due to some issue in the input file. It could be a bug... but since this algorithm has been used by many people, I think it is more likely to be a problem with the input file such as a -2 that is missing or something else like that.

If you have not found a solution yet, you can send me a direct email at philfv8 AT yahoo DOT COM and I will try it on my computer to see what is happening. But if you do so, please also tell me the parameters that you are using and also which patterns you think is not correct.

Best regards,


