This forum is about

Order of items in itemset matters?

Posted by:
**
Victor
**

Date: June 07, 2018 12:24PM

Hey all,

I am experimenting a SPM algorithm: http://www.philippe-fournier-viger.com/spmf/SequentialPatterns_TimeConstraintsValuesItems.php

And notice that order of items in itemset matters.

E.g itemset {1,2,3} and itemset (2,1,3} will generate different frequent sequences.

Itemset {1,2,3} will generate {1,2} as subsequence while itemset (2,1,3} will generate {2,1}.

Previously, I used an R package called arulesequence and it will sort items in itemset internally. E.g {1,2,3} and {2,1,3} will both become {1,2,3}.

I just want to double check if this is truly the case for SPMF?

Thanks,

Victor

I am experimenting a SPM algorithm: http://www.philippe-fournier-viger.com/spmf/SequentialPatterns_TimeConstraintsValuesItems.php

And notice that order of items in itemset matters.

E.g itemset {1,2,3} and itemset (2,1,3} will generate different frequent sequences.

Itemset {1,2,3} will generate {1,2} as subsequence while itemset (2,1,3} will generate {2,1}.

Previously, I used an R package called arulesequence and it will sort items in itemset internally. E.g {1,2,3} and {2,1,3} will both become {1,2,3}.

I just want to double check if this is truly the case for SPMF?

Thanks,

Victor

Posted by:
**
webmasterphilfv
**

Date: June 07, 2018 05:42PM

Hi Victor,

Yes, in sequential pattern mining the order in itemsets should not be important. However, for practical purposes, all itemsets should be sorted according to some order in your input file, as explained in the documentation:

That order can be any order such as increasing order or decreasing order but it must be the same for all itemsets in your file.

Now, having said that, I want to also tell you that this particular algorithm for valued items may not alway generate the same result. The reason is that this algorithm applies k-means internally. Since k-means is a randomized algorithm, each time your run it, it may generate different clusters. Thus for this reason, if you apply this algorithm several times, you may get different results, even with the same input file.

Best,

Yes, in sequential pattern mining the order in itemsets should not be important. However, for practical purposes, all itemsets should be sorted according to some order in your input file, as explained in the documentation:

Quote

Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.

That order can be any order such as increasing order or decreasing order but it must be the same for all itemsets in your file.

Now, having said that, I want to also tell you that this particular algorithm for valued items may not alway generate the same result. The reason is that this algorithm applies k-means internally. Since k-means is a randomized algorithm, each time your run it, it may generate different clusters. Thus for this reason, if you apply this algorithm several times, you may get different results, even with the same input file.

Best,

Posted by:
**
Victor
**

Date: June 10, 2018 08:10PM

Thanks for replying!

My current way of defining itemset is that for a sequence of events that a customer took in the history, an itemset is the events happened in the same day. So in the final frequent sequence, I am able to know this sequence covers how many days. But I also want to reserve the original order of events in an itemset. So what if I reserve the order in an itemset anyway, will this cause some problem? Or destroy the robustness of algorithm?

And for your last comment about randomized result, is this just for valued items, which mean item is like A{3}, B{4}? (referred from your paper)

If my dataset only contains A, B and no values following, will the result be fixed?

Since I ran your sample data several times and found the results are the same.

Best,

Victor

My current way of defining itemset is that for a sequence of events that a customer took in the history, an itemset is the events happened in the same day. So in the final frequent sequence, I am able to know this sequence covers how many days. But I also want to reserve the original order of events in an itemset. So what if I reserve the order in an itemset anyway, will this cause some problem? Or destroy the robustness of algorithm?

And for your last comment about randomized result, is this just for valued items, which mean item is like A{3}, B{4}? (referred from your paper)

If my dataset only contains A, B and no values following, will the result be fixed?

Since I ran your sample data several times and found the results are the same.

Best,

Victor

Posted by:
**
webmasterphilfv
**

Date: June 12, 2018 02:56AM

Dear Victor,

> My current way of defining itemset is that for a

> sequence of events that a customer took in the

> history, an itemset is the events happened in the

> same day. So in the final frequent sequence, I am

> able to know this sequence covers how many days.

> But I also want to reserve the original order of

> events in an itemset. So what if I reserve the

> order in an itemset anyway, will this cause some

> problem? Or destroy the robustness of algorithm?

>

Yes, in sequential pattern mining, there is the assumption that there is no order between the items in an itemset appearing in a sequence. In order words, it is assume that items in an itemset occur at the same time or are simultaneous.

In SPMF, many algorithms further assume that all itemsets are sorted according to some total order such as the increasing order of items, or any other order. But this order must always be the same. The reason for this is that the algorithm will follow that order to explore the search space of all possible patterns. If there is no total order on items in itemsets, then it is possible that some patterns will not be found, or that the support will be incorrectly calculated. For example, consider this database of two sequences:

Sequence 1: (a b)(c) -- This means that A B appeared at the same time and were followed by C

Sequence 2: (a b) (c)

In that database, the pattern (a b） appears two times and has a support of 2.

Now consider another database:

Sequence 1: (b a )(c)

Sequence 2: (a b) (c)

This database violates the assumption that itemsets are sorted according to some order for processing purposes. In that case, if you apply PrefixSpan, it would probably find that the pattern (b a) has a support of 1 and that the pattern (a b) has a support of 1. This is obviously incorrect.

So that order between items for processing purposes is quite important.If you do not always use the same order inside the itemsets, it is possible that you miss some patterns or that the support is incorrectly calculated in the above example.

Now, there are several algorithms in SPMF. I cannot remember the behavior of each one of them. But I think that for most of them, that problem would occurs because all algorithms typically follow some order to process items in an itemset.

So, if you want to keep the order between items for some application, maybe you can find another way to do it. You could for example, save the information about the order in a separated file and keep it for the purpose of your application, and then give a file where itemsets are sorted to SPMF?

>

> And for your last comment about randomized result,

> is this just for valued items, which mean item is

> like A{3}, B{4}? (referred from your paper)

> If my dataset only contains A, B and no values

> following, will the result be fixed?

> Since I ran your sample data several times and

> found the results are the same.

Yes, you are right. It is only when using the valued items that K-Means will be used and that the results will be randomized.

Best,

Philippe

> My current way of defining itemset is that for a

> sequence of events that a customer took in the

> history, an itemset is the events happened in the

> same day. So in the final frequent sequence, I am

> able to know this sequence covers how many days.

> But I also want to reserve the original order of

> events in an itemset. So what if I reserve the

> order in an itemset anyway, will this cause some

> problem? Or destroy the robustness of algorithm?

>

Yes, in sequential pattern mining, there is the assumption that there is no order between the items in an itemset appearing in a sequence. In order words, it is assume that items in an itemset occur at the same time or are simultaneous.

In SPMF, many algorithms further assume that all itemsets are sorted according to some total order such as the increasing order of items, or any other order. But this order must always be the same. The reason for this is that the algorithm will follow that order to explore the search space of all possible patterns. If there is no total order on items in itemsets, then it is possible that some patterns will not be found, or that the support will be incorrectly calculated. For example, consider this database of two sequences:

Sequence 1: (a b)(c) -- This means that A B appeared at the same time and were followed by C

Sequence 2: (a b) (c)

In that database, the pattern (a b） appears two times and has a support of 2.

Now consider another database:

Sequence 1: (b a )(c)

Sequence 2: (a b) (c)

This database violates the assumption that itemsets are sorted according to some order for processing purposes. In that case, if you apply PrefixSpan, it would probably find that the pattern (b a) has a support of 1 and that the pattern (a b) has a support of 1. This is obviously incorrect.

So that order between items for processing purposes is quite important.If you do not always use the same order inside the itemsets, it is possible that you miss some patterns or that the support is incorrectly calculated in the above example.

Now, there are several algorithms in SPMF. I cannot remember the behavior of each one of them. But I think that for most of them, that problem would occurs because all algorithms typically follow some order to process items in an itemset.

So, if you want to keep the order between items for some application, maybe you can find another way to do it. You could for example, save the information about the order in a separated file and keep it for the purpose of your application, and then give a file where itemsets are sorted to SPMF?

>

> And for your last comment about randomized result,

> is this just for valued items, which mean item is

> like A{3}, B{4}? (referred from your paper)

> If my dataset only contains A, B and no values

> following, will the result be fixed?

> Since I ran your sample data several times and

> found the results are the same.

Yes, you are right. It is only when using the valued items that K-Means will be used and that the results will be randomized.

Best,

Philippe

Posted by:
**
Victor
**

Date: June 14, 2018 11:31AM

Gotcha! Thanks for your detailed reply!

Posted by:
**
Victor
**

Date: June 14, 2018 12:40PM

Hope this won't bother you much! I just found some more questions about this algorithm.

1. In the document of Fournier08-Closed+time algorithm: http://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php,

for**maximum time interval**, you mentioned: *Note : It is recommended to set max_whole_interval to a very large value because if it is used with closed sequential pattern mining, the algorithm become approximate and may not return all closed itemsets respecting the time constraint (the reason is that this constraint is not compatible with the "backScan pruning" of the BIDE+ algorithm).*

So how large this large should be set to make it exact not approximate algorithm? Does it depends on data size?

2. For the input data, should timestamp always start with timestamp 0? Or can I pass in a sequence like: <4> 48 <7> 18 23 31 45 48 -2? And the algorithm will convert first timestamp to 0 internally?

3. Now we know sequence ids for output patterns. But since the timestamps in output patterns are relative, it seems pretty hard to pinpoint the exact timestamps of certain pattern in the original sequence for each seq_id?

Thanks,

Victor

1. In the document of Fournier08-Closed+time algorithm: http://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php,

for

So how large this large should be set to make it exact not approximate algorithm? Does it depends on data size?

2. For the input data, should timestamp always start with timestamp 0? Or can I pass in a sequence like: <4> 48 <7> 18 23 31 45 48 -2? And the algorithm will convert first timestamp to 0 internally?

3. Now we know sequence ids for output patterns. But since the timestamps in output patterns are relative, it seems pretty hard to pinpoint the exact timestamps of certain pattern in the original sequence for each seq_id?

Thanks,

Victor

Posted by:
**
webmasterphilfv
**

Date: June 14, 2018 05:19PM

Hi Victor,

I will answer your question below.

> So how large this large should be set to make it

> exact not approximate algorithm? Does it depends

> on data size?

The problem with the maximal time interval constraint is that if you apply this constraint when doing closed sequential pattern mining, you may miss some patterns. If you don't care about missing a few patterns maybe, you can still use that constraint.

Now, how to avoid this? The solution is to not use that constraint. If you set the maximal time interval to a value that is larger than the time length of all your sequences (for example 1 million), it will be as if this constraint is not used. In that case, there will be no problem.

If you do closed sequential pattern mining and set the maximal time interval to a small value, then you may miss patterns. And yes, it depends on your data.

Otherwise, if you want to use that constraint and not miss any patterns, you can do sequential pattern mining but not find the closed patterns. Then, you will not miss patterns and be able to use that constraint without missing patterns.

>

> 2. For the input data, should timestamp always

> start with timestamp 0? Or can I pass in a

> sequence like: <4> 48 <7> 18 23 31 45 48 -2? And

> the algorithm will convert first timestamp to 0

> internally?

It does not matter because the algorithm calculates the relative time between the itemsets. So you could start from 0, 1 or any other numbers, and I think that it should work.

>

> 3. Now we know sequence ids for output patterns.

> But since the timestamps in output patterns are

> relative, it seems pretty hard to pinpoint the

> exact timestamps of certain pattern in the

> original sequence for each seq_id?

>

Yes, you could take the file containing the patterns and sequence IDs, and then do a post-processing step where you would compare the patterns with the sequences to get the real time stamps. I think if you write a small program to do that, it should not be too difficult.

I will answer your question below.

> So how large this large should be set to make it

> exact not approximate algorithm? Does it depends

> on data size?

The problem with the maximal time interval constraint is that if you apply this constraint when doing closed sequential pattern mining, you may miss some patterns. If you don't care about missing a few patterns maybe, you can still use that constraint.

Now, how to avoid this? The solution is to not use that constraint. If you set the maximal time interval to a value that is larger than the time length of all your sequences (for example 1 million), it will be as if this constraint is not used. In that case, there will be no problem.

If you do closed sequential pattern mining and set the maximal time interval to a small value, then you may miss patterns. And yes, it depends on your data.

Otherwise, if you want to use that constraint and not miss any patterns, you can do sequential pattern mining but not find the closed patterns. Then, you will not miss patterns and be able to use that constraint without missing patterns.

>

> 2. For the input data, should timestamp always

> start with timestamp 0? Or can I pass in a

> sequence like: <4> 48 <7> 18 23 31 45 48 -2? And

> the algorithm will convert first timestamp to 0

> internally?

It does not matter because the algorithm calculates the relative time between the itemsets. So you could start from 0, 1 or any other numbers, and I think that it should work.

>

> 3. Now we know sequence ids for output patterns.

> But since the timestamps in output patterns are

> relative, it seems pretty hard to pinpoint the

> exact timestamps of certain pattern in the

> original sequence for each seq_id?

>

Yes, you could take the file containing the patterns and sequence IDs, and then do a post-processing step where you would compare the patterns with the sequences to get the real time stamps. I think if you write a small program to do that, it should not be too difficult.

Posted by:
**
Victor
**

Date: June 15, 2018 06:48AM

Cool!

Since I only want to use min_time_interval and max_time_interval but not min_whole_interval and max_whole_interval, so I can still get the whole list of closed sequences if I set max_whole_interval to a larger value than my whole time range.

Thanks for your reply!

BTW, there is a tiny typo in the document for this algorithm (http://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php)

*For instance, the last pattern S7 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern ***S16** is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.

Here it should be S6 instead of S16

Since I only want to use min_time_interval and max_time_interval but not min_whole_interval and max_whole_interval, so I can still get the whole list of closed sequences if I set max_whole_interval to a larger value than my whole time range.

Thanks for your reply!

BTW, there is a tiny typo in the document for this algorithm (http://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php)

Here it should be S6 instead of S16

Posted by:
**
webmasterphilfv
**

Date: June 15, 2018 08:52AM

Great. I will fix the error in the documentation. Thanks for reporting it.

Posted by:
**
Victor
**

Date: June 15, 2018 11:15AM

Well.. I don't know if my understanding of **max_time_interval** is wrong but I try to run **java -jar spmf.jar run Fournier08-Closed+time spmf_input_ex.txt output1.txt 60% 0 5 0 1000000 true** in CML, where spmf_input_ex.txt is:

*<10> 42 45 <11> 31 42 45 <20> 18 23 31 42 45 <36> 48 -2
*

<10> 2 3 4 5 48 <11> 48 -2

<88> 2 3 4 5 6 <89> 48 -2

My value for**max_time_interval** is 5, so I expect to see <0> 2 3 4 5 <1> 48 as a frequent pattern with support of 2 (should appear in last two sequences). But the result does't contain this pattern:

*<0> 2 48 -1 #SUP: 2 #SID: 1 2
*

<0> 2 5 48 -1 #SUP: 2 #SID: 1 2

<0> 2 4 48 -1 #SUP: 2 #SID: 1 2

<0> 2 4 5 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 5 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 4 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 4 5 48 -1 #SUP: 2 #SID: 1 2

<0> 48 -1 #SUP: 3 #SID: 0 1 2

What's the reason?

Also, will <0> 1 2 <1> 3 4 and <0> 1 2 <2> 3 4 as different patterns?

Thanks,

Victor

<10> 2 3 4 5 48 <11> 48 -2

<88> 2 3 4 5 6 <89> 48 -2

My value for

<0> 2 5 48 -1 #SUP: 2 #SID: 1 2

<0> 2 4 48 -1 #SUP: 2 #SID: 1 2

<0> 2 4 5 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 5 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 4 48 -1 #SUP: 2 #SID: 1 2

<0> 2 3 4 5 48 -1 #SUP: 2 #SID: 1 2

<0> 48 -1 #SUP: 3 #SID: 0 1 2

What's the reason?

Also, will <0> 1 2 <1> 3 4 and <0> 1 2 <2> 3 4 as different patterns?

Thanks,

Victor

Posted by:
**
webmasterphilfv
**

Date: June 15, 2018 07:54PM

Hi,

1) the likely reason is that the input format is not correct.

At the end of each itemset, there should be a -1 to separate. For example, the first sequence should be in this format:

<10> 42 45**-1** <11> 31 42 45 **-1** <20> 18 23 31 42 45 **-1** <36> 48 **-1** -2

It is the same for the other sequences.

2) Yes, if you have a pattern:

<0> 1 2 <1> 3 4

it will be considered as different from:

<0> 1 2 <2> 3 4

Edited 2 time(s). Last edit at 06/15/2018 07:56PM by webmasterphilfv.

1) the likely reason is that the input format is not correct.

At the end of each itemset, there should be a -1 to separate. For example, the first sequence should be in this format:

<10> 42 45

It is the same for the other sequences.

2) Yes, if you have a pattern:

<0> 1 2 <1> 3 4

it will be considered as different from:

<0> 1 2 <2> 3 4

Edited 2 time(s). Last edit at 06/15/2018 07:56PM by webmasterphilfv.

Posted by:
**
Victor
**

Date: June 17, 2018 07:16PM

Thanks for your reply!

So is there a closed mining algorithm implementation in your library that can add time constraints but will treat the time interval between adjacent itemsets the same as long as they pass the time constraint? (e.g <0> 1 2 <1> 3 4 and <0> 1 2 <2> 3 4 are the same and they both looks like 1 2 -1 3 4 -1)

Since I want to add time gap between two itemsets. But whether the gap is 1 or 2 doesn't matter too much. And if we treat them as different patterns, it dilutes supports of both patterns.

I previously used a c-spade algorithm implemented in R (https://www.rdocumentation.org/packages/arulesSequences/versions/0.2-19/topics/cspade), which can add time constraints and treat both <0> 1 2 <1> 3 4

and <0> 1 2 <2> 3 4 as 1 2 -1 3 4 -1. Well.. But it can't do for closed pattern mining so I am seeking for other resources.

Thanks!

Victor

So is there a closed mining algorithm implementation in your library that can add time constraints but will treat the time interval between adjacent itemsets the same as long as they pass the time constraint? (e.g <0> 1 2 <1> 3 4 and <0> 1 2 <2> 3 4 are the same and they both looks like 1 2 -1 3 4 -1)

Since I want to add time gap between two itemsets. But whether the gap is 1 or 2 doesn't matter too much. And if we treat them as different patterns, it dilutes supports of both patterns.

I previously used a c-spade algorithm implemented in R (https://www.rdocumentation.org/packages/arulesSequences/versions/0.2-19/topics/cspade), which can add time constraints and treat both <0> 1 2 <1> 3 4

and <0> 1 2 <2> 3 4 as 1 2 -1 3 4 -1. Well.. But it can't do for closed pattern mining so I am seeking for other resources.

Thanks!

Victor

Posted by:
**
webmasterphilfv
**

Date: June 17, 2018 10:55PM

Hi Victor,

I see. There is no such implementation in SPMF that does exactly that. It could be done, I think, but it would require some programming to modify the algorithm and it can be more or less complicated. If one modifies it, then it would need to check to make sure that the algorithm remains correct, and sometimes combining two ideas results in an algorithm that cannot find all the patterns. This is the case for example when combining closed sequential pattern mining with BIDE and the max-whole-interval constraint.

Best,

Philippe

I see. There is no such implementation in SPMF that does exactly that. It could be done, I think, but it would require some programming to modify the algorithm and it can be more or less complicated. If one modifies it, then it would need to check to make sure that the algorithm remains correct, and sometimes combining two ideas results in an algorithm that cannot find all the patterns. This is the case for example when combining closed sequential pattern mining with BIDE and the max-whole-interval constraint.

Best,

Philippe

Posted by:
**
Victor
**

Date: June 18, 2018 06:43AM

I see. Thanks for your reply!