The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Posted by: john
Date: July 28, 2017 07:22AM

my training data set is as below:

--- training sequences ---
Number of sequences : 10,000,000
Number of distinct items: 1000,000
Itemsets per sequence: 20
Distinct item per sequence: 17
Occurences for each item: 1.13
Size of the dataset in MB: 3.7E7

with String optionalParameters = "CCF:true CBS:true CCFmin:1 CCFmax:6 CCFsup:2 splitMethod:5 minPredictionRatio:1.0 noiseRatio:1.0";

java -Xmx40960m -cp src ca.pfv.spmf.test.MainTestCPTPlus

I am getting the following error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181)
at java.util.ArrayList.grow(ArrayList.java:261)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
at java.util.ArrayList.add(ArrayList.java:458)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.FIFRaw.findFrequentItemsets(FIFRaw.java:84)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.CPTPlusPredictor.Train(CPTPlusPredictor.java:132)
at ca.pfv.spmf.test.MainTestCPTPlus.main(MainTestCPTPlus.java:58)

Any suggestion for processing such dataset using CPT+ algorithm?

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Date: July 28, 2017 07:47AM

Hello,

It means that you are running out of memory. I think there are different way of solving this problem:
1) use only a subset of the dataset. Actually, perhaps that 100,000 or 1 million is enough to perform accurate prediction and having more sequences may not necessarily increase the prediction accuracy much. Actually, you can start with some part of the dataset, and increase until it run out of memory and see if the accuracy increase when increasing the number of sequences
2) increase the amount of memory on your computer. If you use an old version of Java, you may increase the RAM allowed for the Java Virtual machine by using the -xmx parameter of the virtual machine when running the algorithm.

But actually, I see that the error that you are getting is that the algorithm run out of memory when performing the CCF optimization.

CCF is an optimization that aims at finding frequent subsequences to compress the model (tree) using these frequent subsequences. If the parameters of CCF is not properly set, CCF can run for a very long time and run out of memory. This is what is happening when you ran CPT+.

To solve this issue, you could first try to set CCF:false to see if it solves the problem. This will deactivate the CCF strategy. If it works, then the problem is the CCF strategy that finds too many patterns and run out of memory. In that case, you could reactivate CCF but with different parameters.

I think that CCFmax:6 should be set to no more than 3 or 4. And I think that the value of CCFsup is way too low. It is currently set to 2, which means that anything appearing in at least two sequences ouf of 10,000,000 will be a frequent patterns. So this is one of the reason why the CCF strategy run out of memory. So, CCFSup should probably be set to some large value like 100,000 or 1,000,000 perhaps, rather than 2.

I think that you could try these different ideas to see if it helps.

Besides, you could also split your sequences to reduce the model size. This can be done by setting the split length using the parameter splitLength:4 which means to keep only 4 items per sequence. This can greatly reduce the size of the model. You can also try that parameter with different values. It should help.

Best regards,



Edited 1 time(s). Last edit at 07/28/2017 07:48AM by webmasterphilfv.

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Posted by: john
Date: July 31, 2017 05:01AM

Thank you for the suggestions. I tried a couple of options, but no success. I still prefer using the full data set instead of the 10% sample with CPT+. Any suggestions are appreciated.

1) I tested with CCF:false, and 40GB RAM for jvm. I am getting "java.lang.NullPointerException" error .
String optionalParameters = "CCF:false CBS:true CCFmin:1 CCFmax:4 CCFsup:10000 splitMethod:5 minPredictionRatio:1.0 noiseRatio:1.0";


---------------------
Exception in thread "main" java.lang.NullPointerException
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.profile.Profile.paramInt(Profile.java:37)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.Paramable.paramInt(Paramable.java:61)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.CPTPlusPredictor.Train(CPTPlusPredictor.java:146)
at ca.pfv.spmf.test.MainTestCPTPlus.main(MainTestCPTPlus.java:58)


2) with CCFmax:4, I still get OutOfMemoryError.

String optionalParameters = "CCF:true CBS:true CCFmin:1 CCFmax:4 CCFsup:10000 splitMethod:5 minPredictionRatio:1.0 noiseRatio:1.0";
java -Xmx40960m -cp src ca.pfv.spmf.test.MainTestCPTPlus

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap.newNode(HashMap.java:1742)
at java.util.HashMap.putVal(HashMap.java:641)
at java.util.HashMap.put(HashMap.java:611)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.FIFRaw.findFrequentItemsets(FIFRaw.java:94)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.CPTPlusPredictor.Train(CPTPlusPredictor.java:132)
at ca.pfv.spmf.test.MainTestCPTPlus.main(MainTestCPTPlus.java:58)

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Date: July 31, 2017 05:29AM

1) For the null pointer exception, the problem is that CPT+ is trying to read the parameter "SplitLength" but it is not defined. To fix this error, you need to set the split length using this:

splitLength:4

This will solve the null pointer exception. It will indicate to split the sequence to keep only the last 4 items. This will make the model smaller.

When you set : splitMethod:5 then you must also set the splitLength parameter, otherwise it generates a null exception.

2) In that case, the problem may be that CCFsup:10000 is perhaps still too low. How to set that parameter really depends on the characteristics of your data. It is dataset specific. For example, if you have 1,000,000 distinct items in your database, then assuming that you set CCFMax:4, there could be about 1,000,0000 to the power of 4 frequent subsequences of 4 items or less iin the worst case, which might perhaps make your computer run out of memory even with 40 GB of memory. So, I think that you should consider setting that parameter very high and then decrease it gradually.

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Posted by: john
Date: July 31, 2017 09:25AM

1) Added splitLength:4. Now null pointer error is gone. But got the Java heap space error.

java -Xmx40960m -cp src ca.pfv.spmf.test.MainTestCPTPlus


String optionalParameters = "CCF:false CBS:true CCFmin:1 CCFmax:4 CCFsup:10000 splitMethod:5 splitLength:4 minPredictionRatio:1.0 noiseRatio:1.0";


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at java.util.BitSet.ensureCapacity(BitSet.java:337)
at java.util.BitSet.expandTo(BitSet.java:352)
at java.util.BitSet.set(BitSet.java:447)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.Bitvector.setBit(Bitvector.java:90)
at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.CPTPlusPredictor.Train(CPTPlusPredictor.java:172
at ca.pfv.spmf.test.MainTestCPTPlus.main(MainTestCPTPlus.java:58)

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Date: July 31, 2017 01:24PM

at ca.pfv.spmf.algorithms.sequenceprediction.ipredict.predictor.CPT.CPTPlus.CPTPlusPredictor.Train(CPTPlusPredictor.java:172

That means that the model is running out of memory during the training phase of the model. So as I said, a possibility is to first try with a subset of all sequences.

For the memory, the CPT model has to keep a few structures in memory: (1) the prediction tree, (2) the inverted index and (3) a lookup table.

The inverted index will have a size of about n * m bits, where n is the number of sequences and m is the number of items. You have n = 10,000,000. I don't know what is the value of m, but if it is something very large like 1,000,000 maybe that it can use a huge amount of memory.
Then there is also the size of the tree that is important.

Perhaps that some memory optimization could also be done to reduce the size of the CPT model in terms of Java programming.

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Posted by: john
Date: August 01, 2017 07:09AM

Thanks for the explanation. So is there any sequence prediction algorithm in SPMF library or other library that can handle large data set (10 million rows with about 1 million distinct items) ? As I mentioned I prefer using the full data set.

Options: ReplyQuote
Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Date: August 01, 2017 12:41PM

I think you could try the smaller models such as DG and PPM. I guess that these models could handle a lot of sequences and items more easily because these models do not keep the full sequences in a tree. For TDAG, I don't remember how it works, but maybe it could also be used.

A model that would probably not work is AKOM has it has an exponential size when we increase the order "k".

Best,

Philippe

Options: ReplyQuote


Your Name: 
Your Email: 
Subject: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
  ******    **    **  **      **   *******   ******** 
 **    **   **   **   **  **  **  **     **  **       
 **         **  **    **  **  **  **         **       
 **   ****  *****     **  **  **  ********   ******   
 **    **   **  **    **  **  **  **     **  **       
 **    **   **   **   **  **  **  **     **  **       
  ******    **    **   ***  ***    *******   ******** 
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.