Max Gap Constraint in VGEN Algorithm

Andrea
Date: July 31, 2017 02:56AM

Hello to everyone,

I am currently using VGEN as a sub-algorithm inside another learner, for research purposes.

Recently, I have been working to add to my learner the support for handling the gaps between itemsets when dealing with sequences and, as a result, I am relying on the parameter "max gap" of VGEN algorithm.

The documentation states: If "max gap" is set to N, a gap of N-1 itemsets is allowed between two consecutive itemsets of a pattern. If the parameter is not used, by default "max gap" is set to +∞.

However, I have noticed that sometimes the results seem to be inconsistent. For example, given max gap set to 4:

SEQUENCE (> corresponds to -1): g>g>h>i>h>h>h>h>h>g>g>g>g>i>h>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

PATTERN: h>g>h

The pattern should not be matched in the sequence, however it is, according to the output of the algorithm.

Am i missing something here?

Any help would be greatly appreciated!

Thank you in advance,

Andrea

webmasterphilfv
Date: July 31, 2017 03:25AM

Hello Andrea,

In the above example, the pattern h>g>h can match with the sequence as follows:

1): g>g>h>i>h>h>h>h>**h**>g>g>**g**>g>i>**h**>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

Why it matches? If we set maxp = 4, it means that we allow to skip up to 3 itemsets between two consecutive letters in the pattern h > g > h.

If we consider the first two letters of the pattern h > g > h, the gap is equal to 2 because there are two itemsets between the first h and the g:

g>g>h>i>h>h>h>h>**h***>g>g>***g**>g>i>**h**>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

This is less than 3, so the maxgap constraint is not violated.

Now let's check the last two letters of the pattern h > g > h. The gap is equal to 2 because there are two itemsets between the g and the second h:

g>g>h>i>h>h>h>h>**h**>g>g>**g***>g>i>***h**>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

So again, this is less than 3, so the maxgap constraint is not violated.

So because the maxgap constraint is not violated for any two consecutive items in the pattern h > g >h for that sequence, it is said that the pattern h> g > h matches with that sequence.

Note that might be more than one way that the pattern h > g > h could match that sequence. For example, another way is:

g>g>h>i>h>h>h>**h>**h>g>g>**g**>g>i>**h**>i>h>l>n>m>o>o>n>g>d>b>a>a>a>a

But we only need one match to say that the pattern match with the sequence.

Hope that this is clear!

Andrea
Date: July 31, 2017 03:48AM

Understood, thanks a lot! :-)