The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Timeout using VMSP
Posted by: mikazuki
Date: June 29, 2017 03:56AM

I was using VMSP to mine frequent maximal sequential patterns. For most of the inputs, it takes only a few ms to run. I set a timeout of one minute for VMSP and the algorithm does not halt for some inputs. I suspect that DFS or some loops in your algorithm do not halt.

For example, I have two strings and I expect VMSP to find a pattern like "{{#> layout}}{{/layout}}". I used 0.5 as the support when running VMSP.

{{#> layout}}Outer{{/layout}}
{{#> layout}}{{/layout}}

Here are my strings in SPMF format.
https://pastebin.com/J0mcDmuz

Options: ReplyQuote
Re: Timeout using VMSP
Posted by: mikazuki
Date: June 29, 2017 04:04AM

The above Pastebin is broken. Please check this one instead.

Link

Options: ReplyQuote
Re: Timeout using VMSP
Date: June 29, 2017 07:21AM

Hello,

The problem is not the algorithm. But it is the size of the search space.

The more the sequences are similar, the more frequent patterns there will be.

Moreover, the more the minsup threshold is set low, the more the search space will be large, and the algorithm will be slow.

For your data, there are only 2 sequences. If you set minsup = 0.5, it means that everything is frequent! That is everything that appear in at least one sequence is frequent.

In your file, the longest sequence has about 24 items. Thus, there is roughly 2^24 patterns that are frequent, which is about 16,777,216 sequences. Actually, there are probably more than that since there are two sequences... Moreover, the VMSP algorithm is an algorithm that can generate some candidates, so actually, it can consider more patterns than what actually appear in the database.

So the problem is the search space.

If you look at VMSP, there are several parameters that you can use to reduce the search space:
- raise the minsup threshold
- use the lenght constraint (for example, if you search only for patterns with 4 items or less, the search space will be much smaller)
- use the maxgap constraint
...

Options: ReplyQuote
Re: Timeout using VMSP
Posted by: mikazuki
Date: June 30, 2017 05:31AM

Thank you for your answer. I tried to reduce the max length and the max gap constraints and it is faster than before. However, the search space still seems to be large for VMSP.

Currently, I am trying to identify the longest and the most frequent subsequences from a group of strings. For example, given the following three strings.

{{#> layout}}Outer{{/layout}}
{{#> layout}}{{/layout}}
InOuter

I am looking for "{{#> layout}}{{/layout}}" and "Outer" as the solutions. It seems that the search space is too large for VMSP. I have also tried CM-ClaSP and TKS before but they gave me too many short patterns in the results. I tried to specify the minimum length for TKS but it threw an out of memory error. Which pattern mining algorithm from SPMF do you recommend for such a task?

Options: ReplyQuote
Re: Timeout using VMSP
Date: June 30, 2017 08:35AM

The fastest algorithm is probably CM-SPADE in SPMF if you want to find all sequential patterns. The second fastest should be CM-SPAM. CM-SPAM has the advantages of letting you set more constraints such as the length constraint and gaps...

But the above algorithms find all frequent patterns. In my opinion it is better to find only the closed patterns. Thus, I would use CM-CLASP, which should be the fastest for closed pattern mining.

For TKS, it is a top-k algorithm, so it consumes more memory than the regular algorithms, as mining top-k sequential patterns is more difficult than doing the traditional sequential pattern mining. So it is understandable that it run out of memory while other algorithms does not.

Another idea is to change the sequence representation if possible for your application, to make the sequences shorter or rewrite the items are more general items. In general, the more the sequences are long and similar, and the more minsup is set low, the larger the search space is. Pattern mining is a field where unlike big data, even with very small datasets the search space can be extremely large. For example, if you have a single sequence with 100 items and set minsup = 0.1, there could be 2^100 patterns, and the algorithms would never terminate or even be able to store all the patterns.



Edited 1 time(s). Last edit at 06/30/2017 08:37AM by webmasterphilfv.

Options: ReplyQuote


Your Name: 
Your Email: 
Subject: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
 ********    ******   ********  **     **  **     ** 
 **     **  **    **  **        **     **   **   **  
 **     **  **        **        **     **    ** **   
 **     **  **        ******    *********     ***    
 **     **  **        **        **     **    ** **   
 **     **  **    **  **        **     **   **   **  
 ********    ******   **        **     **  **     ** 
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.