The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining workshop
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Memory Errors in some sequential pattern mining algorihtms
Posted by: Cristian V
Date: October 16, 2019 06:01AM

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefixHelper(PseudoSequenceBIDE.java:153)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefixHelper(PseudoSequenceBIDE.java:172)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefixHelper(PseudoSequenceBIDE.java:172)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefixHelper(PseudoSequenceBIDE.java:172)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefixHelper(PseudoSequenceBIDE.java:172)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefixHelper(PseudoSequenceBIDE.java:172)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getAllInstancesOfPrefix(PseudoSequenceBIDE.java:133)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getFirstInstanceOfPrefixSequence(PseudoSequenceBIDE.java:201)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getIthLastInFirstApearanceWithRespectToPrefix(PseudoSequenceBIDE.java:358)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.PseudoSequenceBIDE.getIthSemiMaximumPeriodOfAPrefix(PseudoSequenceBIDE.java:309)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.checkBackScanPruning(AlgoBIDEPlus.java:93)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.recursion(AlgoBIDEPlus.java:341)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.recursion(AlgoBIDEPlus.java:344)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.recursion(AlgoBIDEPlus.java:344)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.recursion(AlgoBIDEPlus.java:344)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.recursion(AlgoBIDEPlus.java:344)
at ca.pfv.spmf.algorithms.sequentialpatterns.BIDE_and_prefixspan_with_strings.AlgoBIDEPlus.runAlgorithm(AlgoBIDEPlus.java:63)

Options: ReplyQuote
Re: Memory Errors in some sequential pattern mining algorihtms
Date: October 16, 2019 06:58AM

Hi,

That means that the algorithm ran out of memory.

This is because the search space is very large (the number of potential patterns to consider).

The search space may be large because your dataset contains some very long patterns or the sequences in your datasets are very similar. It also depends on how you set the threshold. The lower you set the threshold, the number of patterns may increase exponentially.

So, how to fix this problem:
- Use a faster algorithm. You can try CM-Clasp or CloFast for example, which may be faster than BIDE. Besides, as mentionned in the download page, the BIDE algorithm has a bug that has not been fixed. So it is a good reason to change for another algorithm.
- You can apply some constraints to reduce the search space. Some algorithms will let you set a maximum lengths for patterns or a maximum gap between itemsets of a sequential pattern. If you use such constraints, the search space and the number of patterns found will be much smaller.
- You can start with a higher minsup values that will give you some result, and then decrease it. Most likely a part of the problem is that you have used a minsup value that is too low.
- You may also transform the dataset if needed, or do sampling if the dataset is too large.

Options: ReplyQuote
Re: Memory Errors in some sequential pattern mining algorihtms
Posted by: Cristian V
Date: October 16, 2019 07:17AM

Hi Philipe,

Thanks for answering.

But BIDE+ and PrefixSpan with integers (not Strings) seems to work OK.

Can you comment with respect with BIDE bug? Is also present in Integer version or only in _with Strings package ?

Regards,
Cristian


webmasterphilfv Wrote:
-------------------------------------------------------
> Hi,
>
> That means that the algorithm ran out of memory.
>
> This is because the search space is very large
> (the number of potential patterns to consider).
>
> The search space may be large because your dataset
> contains some very long patterns or the sequences
> in your datasets are very similar. It also depends
> on how you set the threshold. The lower you set
> the threshold, the number of patterns may increase
> exponentially.
>
> So, how to fix this problem:
> - Use a faster algorithm. You can try CM-Clasp or
> CloFast for example, which may be faster than
> BIDE. Besides, as mentionned in the download page,
> the BIDE algorithm has a bug that has not been
> fixed. So it is a good reason to change for
> another algorithm.
> - You can apply some constraints to reduce the
> search space. Some algorithms will let you set a
> maximum lengths for patterns or a maximum gap
> between itemsets of a sequential pattern. If you
> use such constraints, the search space and the
> number of patterns found will be much smaller.
> - You can start with a higher minsup values that
> will give you some result, and then decrease it.
> Most likely a part of the problem is that you have
> used a minsup value that is too low.
> - You may also transform the dataset if needed, or
> do sampling if the dataset is too large.

Options: ReplyQuote
Re: Memory Errors in some sequential pattern mining algorihtms
Date: October 16, 2019 07:46AM

Hi,

BIDE+ is the only algorithm in SPMF that has a known bug that we did not fix yet. The problem is that BIDE is a very complex algorithm and hard to implement. I have implemented twice for SPMF, the second time starting completely from scratch, and after spending many days, it still has some issue. The bug happens when the input file contains sequences where itemsets have more than one item. If each itemset has only a single item, I think the result is correct. But if itemsets of a sequences contains more than one item, it is possible that some patterns may be missed by the algorithm.

I have thought for a while about fixing the BIDE algorithm but since there are other algorithms in SPMF that have the same input and output, and BIDE is quite complicated to debug, I think I will eventually remove BIDE from SPMF rather than fixing it. That is my plan, but I have not removed BIDE yet.... But since my implementation of BIDE has no bug for sequences of items rather than sequences of itemsets, I may perhaps keep that part of the implementation and remove the ability to apply BIDE on sequence of itemsets in the future.

In any case, I dont think that I will remove BIDE soon, so dont worry. You can still use it if you want ;-)

Best regards,

Philippe



Edited 1 time(s). Last edit at 10/16/2019 07:48AM by webmasterphilfv.

Options: ReplyQuote
Re: Memory Errors in some sequential pattern mining algorihtms
Date: October 16, 2019 07:49AM

To answer the other question:

The bug is most likely in both version of BIDE.

The _string version of BIDE is based on my first implementation.

The integer version of BIDE that is used by default in SPMF is based on my second implementation of BIDE that I have done from scratch.

The second version should be more efficient because I have done a better implementation with some additional optimizations. However, I think both of them have some issues with respect to the final result.

I have spent a lot of time on that algorithm. For the first implementation, more than 1 month at the time... and the second implementation, perhaps a week...BIDE is a very complex algorithm when it comes to implementating it for sequences of itemsets. And it also quite complex to debug...



Edited 3 time(s). Last edit at 10/16/2019 07:52AM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.