Re: CPT+ large data set, Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Date: July 28, 2017 07:47AM
It means that you are running out of memory. I think there are different way of solving this problem:
1) use only a subset of the dataset. Actually, perhaps that 100,000 or 1 million is enough to perform accurate prediction and having more sequences may not necessarily increase the prediction accuracy much. Actually, you can start with some part of the dataset, and increase until it run out of memory and see if the accuracy increase when increasing the number of sequences
2) increase the amount of memory on your computer. If you use an old version of Java, you may increase the RAM allowed for the Java Virtual machine by using the -xmx parameter of the virtual machine when running the algorithm.
But actually, I see that the error that you are getting is that the algorithm run out of memory when performing the CCF optimization.
CCF is an optimization that aims at finding frequent subsequences to compress the model (tree) using these frequent subsequences. If the parameters of CCF is not properly set, CCF can run for a very long time and run out of memory. This is what is happening when you ran CPT+.
To solve this issue, you could first try to set CCF:false to see if it solves the problem. This will deactivate the CCF strategy. If it works, then the problem is the CCF strategy that finds too many patterns and run out of memory. In that case, you could reactivate CCF but with different parameters.
I think that CCFmax:6 should be set to no more than 3 or 4. And I think that the value of CCFsup is way too low. It is currently set to 2, which means that anything appearing in at least two sequences ouf of 10,000,000 will be a frequent patterns. So this is one of the reason why the CCF strategy run out of memory. So, CCFSup should probably be set to some large value like 100,000 or 1,000,000 perhaps, rather than 2.
I think that you could try these different ideas to see if it helps.
Besides, you could also split your sequences to reduce the model size. This can be done by setting the split length using the parameter splitLength:4 which means to keep only 4 items per sequence. This can greatly reduce the size of the model. You can also try that parameter with different values. It should help.
Edited 1 time(s). Last edit at 07/28/2017 07:48AM by webmasterphilfv.