The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining workshop
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Quest for specific SPM algorithm
Posted by: iisys
Date: August 02, 2017 12:39AM


Your forum, blog and SPMF implementation have made quite an impression on me over the past few weeks. I can see your work is in high demand amongst data scientists, making it all the more impressive!

I'm a Computer Science master student and I'm working on my graduation project. I am currently in need of some advice; a nudge in the right direction.

What I have is a sequence database, where the itemsets have single items:

What I want to get from this is the pattern <a,b,c>, and (preferably) the information that it occurs 5 times in the first sequence and 2 times in the second sequence. So I am looking for the repeated subpattern, not merely the sequence with the minsupport. Algorithms such as CM-ClaSP and VMSP only return the <a,b,c,a,b,c> pattern. As such, I'm not looking for the maximal pattern and also not quite the closed pattern.

A few other requirements:
- I must be able to specify the max gap, which will be set to 1.
- I must be able to specify the minimum pattern length, which will probably be set to 2 or 3.
- I must be able to specify the minimum pattern repetitions, which will probably be set to somewhere between 5 and 10.

I'm somewhat skilled in Java, so I'm perfectly fine with modifying one of your implemented algorithms. I just don't know where to start. Do you have any suggestions?

Options: ReplyQuote
Re: Quest for specific SPM algorithm
Date: August 02, 2017 05:06AM


Glad that you like SPMF and it is useful.

You could modify CM-SPAM. CM-SPAM is an algorithm that extends SPAM with some pruning strategy to more quickly find sequential patterns. I think that it meets most of your requirements, except that it does not care about the number of repetitions in a sequence. If a pattern appears multiple times in a sequence, it will still be counted as 1.

To consider the number of repetitions in a sequence, you would need to modify the class AlgoCMSPAM in the package:

In that class, you would first have to change the runAlgorithm() method to add some new parameters that would be the minimum and maximum number of repetitions.

Then, the most tricky part is to change the algorithm so that it would count the repetitions and check if it exceeds the minimum or maximum for each pattern that is considered by the algorithm.

To do that, you would need to change the class Bitmap in the same package. For each pattern in the search space, there is an instance of the class Bitmap that is created. This class is a bit vector that stores information about where a given pattern appears in sequences as a list of 0 and 1. The bitmap representation is the one used in the SPAM algorithm (see the SPAM paper). Now, what you want to modify is the function createNewBitmapSStep(). This function is used to create the bitmap of a new pattern and at the same time calculate its support. You could change that function to at the same time calculate the number of repetitions and check if it exceeds the minimum and maximum number of repetitions.

If you allow multiple items per itemsets in sequences, you would also need to modify createNewBitmapSSte().

These two above methods are not so easy to modify, and it is easy to introduce some bugs. Before modifying them, you should read the SPAM paper and understand the bitmap representation. After you understand that, it should become easier to modify these functions.

The best is to start with a simple dataset for testing your modifications and to write the bitvectors on a piece of paper. It will make it easier to understand the bitmap representation and also to think about how to modify the code.

If you familiar with Java and take the time to read the paper to see how it works, I think that it should not be a problem to modify the algorithm. If you do it sucessfully, I would be glad to include the code in SPMF if you wish to share it so that other people can also use it. But it is up to you ;-)

Options: ReplyQuote

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.