The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Max Gap Constraint in VGEN Algorithm
Posted by: Andrea
Date: July 31, 2017 02:56AM

Hello to everyone,

I am currently using VGEN as a sub-algorithm inside another learner, for research purposes.
Recently, I have been working to add to my learner the support for handling the gaps between itemsets when dealing with sequences and, as a result, I am relying on the parameter "max gap" of VGEN algorithm.

The documentation states: If "max gap" is set to N, a gap of N-1 itemsets is allowed between two consecutive itemsets of a pattern. If the parameter is not used, by default "max gap" is set to +∞.

However, I have noticed that sometimes the results seem to be inconsistent. For example, given max gap set to 4:

SEQUENCE (> corresponds to -1): g>g>h>i>h>h>h>h>h>g>g>g>g>i>h>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a
PATTERN: h>g>h

The pattern should not be matched in the sequence, however it is, according to the output of the algorithm.

Am i missing something here?

Any help would be greatly appreciated!

Thank you in advance,
Andrea

Options: ReplyQuote
Re: Max Gap Constraint in VGEN Algorithm
Date: July 31, 2017 03:25AM

Hello Andrea,

In the above example, the pattern h>g>h can match with the sequence as follows:

1): g>g>h>i>h>h>h>h>h>g>g>g>g>i>h>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

Why it matches? If we set maxp = 4, it means that we allow to skip up to 3 itemsets between two consecutive letters in the pattern h > g > h.

If we consider the first two letters of the pattern h > g > h, the gap is equal to 2 because there are two itemsets between the first h and the g:

g>g>h>i>h>h>h>h>h>g>g>g>g>i>h>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

This is less than 3, so the maxgap constraint is not violated.

Now let's check the last two letters of the pattern h > g > h. The gap is equal to 2 because there are two itemsets between the g and the second h:

g>g>h>i>h>h>h>h>h>g>g>g>g>i>h>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

So again, this is less than 3, so the maxgap constraint is not violated.

So because the maxgap constraint is not violated for any two consecutive items in the pattern h > g >h for that sequence, it is said that the pattern h> g > h matches with that sequence.

Note that might be more than one way that the pattern h > g > h could match that sequence. For example, another way is:

g>g>h>i>h>h>h>h>h>g>g>g>g>i>h>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

But we only need one match to say that the pattern match with the sequence.

Hope that this is clear!



Edited 3 time(s). Last edit at 07/31/2017 03:27AM by webmasterphilfv.

Options: ReplyQuote
Re: Max Gap Constraint in VGEN Algorithm
Posted by: Andrea
Date: July 31, 2017 03:48AM

Understood, thanks a lot! :-)

Options: ReplyQuote


Your Name: 
Your Email: 
Subject: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
 **    **  ********   **    **  **    **  **     ** 
 ***   **  **     **  ***   **  ***   **  **     ** 
 ****  **  **     **  ****  **  ****  **  **     ** 
 ** ** **  ********   ** ** **  ** ** **  ********* 
 **  ****  **         **  ****  **  ****  **     ** 
 **   ***  **         **   ***  **   ***  **     ** 
 **    **  **         **    **  **    **  **     ** 
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.