The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining workshop
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Estimating Amount of Heap Space Needed for FPGrowth
Posted by: k
Date: December 07, 2017 12:09AM

So I've been running FPGrowth_association_rules for ~20 hours when I ran out of heap space. The error message produced is: java.lang.OutOfMemoryError: Java heap space .

The parameters given were minsup = 0.5%, minconfidence = 50%.

The size of the original dataset is:
344 000 rows x 946 columns
There are 946 unique items.

When converted to the input format that SPMF takes, the file is ~49M,

It appears that heap size can be increased using the "-Xmx" flag when executing the algorithm. But I'm not so sure what would be a reasonable size for the heap. How would you estimate a reasonable amount for the heap space?

On another note, is there a way to estimate the running time of the algorithm? It seems like it would be the running time to construct and traverse the FP-tree, but it goes way faster when minsup is increased. For example, it is almost instantaneous when minsup >50%, around 10 minutes when minsup = 25%, and >20hours for minsup = 0.5%.

The complete error log is this:
Quote

An error while trying to run the algorithm.
ERROR MESSAGE = java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.FPTree.<init>(FPTree.java:48)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:348)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.fpgrowth(AlgoFPGrowth.java:360)
at ca.pfv.spmf.algorithms.frequentpatterns.fpgrowth.AlgoFPGrowth.runAlgorithm(AlgoFPGrowth.java:199)
at ca.pfv.spmf.algorithmmanager.descriptions.DescriptionAlgoFPGrowthAssociationRules.runAlgorithm(DescriptionAlgoFPGrowthAssociationRules.jav
a:73)
at ca.pfv.spmf.gui.CommandProcessor.runAlgorithm(CommandProcessor.java:386)
at ca.pfv.spmf.gui.Main.processCommandLineArguments(Main.java:128)
at ca.pfv.spmf.gui.Main.main(Main.java:54)

Options: ReplyQuote
Re: Estimating Amount of Heap Space Needed for FPGrowth
Posted by: Phil
Date: December 07, 2017 12:46AM

Hello,

Thanks for using SPMF.

To generate association rules, there are two steps: (1)generate the frequent itemsets (using fpgrowth for example) and (2) using these itemsets to generate the rules. The first step is more costly and is where you are having the out of memory exception.

Let me explain this. The execution time and memory usage of itemset mining algorithms like FPGrowth depends a lot on the number of itemsets in the search space, which depends on your input database and how the minsup threshold is set.

In the best case, if you set minsup to a high value, then FPGrowth will be able to avoid looking at most of the search space and will be very fast, as it will prune most of the search space containing infrequent itemsets.

Now in the worst case,if you set minsup = 0, FPGrowth may look at all the possible itemsets. If there are 946 items in your database, then there are 2^946-1 possible itemsets (5.948068e+284).In that case, the algorithm may never terminate or just run out of memory or storage space.

And if you set minsup between 0 and 100%, then FPGrowth will explore some part of the search space that may be more or less large. In general, when the minimum support is decreased, the number of patterns can increase exponentially. It can go from a few patterns to millions of patterns, and the runtime and memory usage may increase greatly.

Can we predict the amount of memory and runtime? Not really. The reason is that there are such as the nature of your database. A database having long transactions will generally be more difficult to mine then a database having short transactions as longer itemsets will need to be considered in the search space. The number of patterns in the search space also depends on how similar the transactions are in your database and the total number of items.

In general, how to set the minsup threshold depends on the database. For some databases, minsup = 0.01% may return no patterns, while for another database minsup = 0.9 may generate millions of pattterns. Thus, in general, to set that parameter, I start from a high value and gradually decrease it until I find enough patterns.

Above, I have discussed the problem of itemset mining with FPGrowth. But if you are generating association rules with FPGrowth, the search space will be even larger. Instead of 2^n-1, it will be much larger.

Thus, a solution to make it faster is to add some constraints. IN SPMF, you can use the maximum length constraint to only explore rules containing no more than a few items in the antecedent and consequent of each rule. Those are optional parameters that can make the search for rules much faster. I think you should try these optional parameters.

By the way, about FPGrowth, Yes, it creates a tree. But the cost is much more than just scanning a tree. Actually, for EACH frequent patterns in the search space, FPGrowth will build a projected tree. Thus, if there are 1 million patterns, FPGrowth will create about 1 million trees (not all of them will be kept in memory at the same time though, but each tree can use quite a large amount of memory, though).

Options: ReplyQuote
Re: Estimating Amount of Heap Space Needed for FPGrowth
Posted by: k
Date: December 07, 2017 01:23AM

Hi Phil,

Thanks for responding so fast.

The maximum length constraint seems interesting. But I am not too sure how to execute it from command line. Looking at the documentation: http://www.philippe-fournier-viger.com/spmf/FPGrowth.php there is no mention of it.

It seems like it's set on this line:
Quote

fpgrowth.setMaximumPatternLength(maxAntecedentLength + maxConsequentLength);

It also seems like setting minimum item support might also speed up the process. As far as I can tell, it seems like CFPGrowth would be ideal for databases where you would like items that do not occur as often to appear in your rules. This occurs in the database I am working on. For example, some items that I would like to appear in the rules only have 0.3% support. So would CFPGrowth include those items if I set their MIS as 0.3%?

If the above is true, it seems like the commented out code in the source for generating MIS for items would be ideal (found from http://forum.ai-directory.com/read.php?5,3297,3301 , http://forum.ai-directory.com/read.php?5,2615 , http://www.philippe-fournier-viger.com/spmf/MISApriori.pdf). Would you suggest generating MIS in such a way, or by intuitively thinking of which items are important?

Options: ReplyQuote
Re: Estimating Amount of Heap Space Needed for FPGrowth
Posted by: k
Date: December 07, 2017 01:40AM

Sorry, I spoke too soon about the command line arguments. I found how to execute via command line by the lines here:
Quote

parameters[0] = new DescriptionOfParameter("Minimum support (%)", "(e.g. 0.5 or 50%)", Double.class, false);
parameters[1] = new DescriptionOfParameter("Minimum confidence (%)", "(e.g. 0.9 or 90%)", Double.class, false);
parameters[2] = new DescriptionOfParameter("Minimum lift ", "(e.g. 1.0)", Double.class, false);
parameters[3] = new DescriptionOfParameter("Max antecedent length", "(e.g. 2 items)", Integer.class, true);
parameters[4] = new DescriptionOfParameter("Max consequent length", "(e.g. 2 items)", Integer.class, true);

I am still interested about your thoughts on CFPGrowth however.

Options: ReplyQuote
Re: Estimating Amount of Heap Space Needed for FPGrowth
Posted by: Phil
Date: December 07, 2017 01:46AM

Yes, you are right, length constraints for FPGrowth is a new feature, and I forgot to update the documentation. I will do it soon. Thanks for letting me know about this. In the mean time, here is a simple example. To use it from the commandline for generating association rules with FPGrowth, you can follow this example:

java -jar spmf.jar run FPGrowth_association_rules contextIGB.txt output.txt 50% 60% 1 2

which means:

input is : contextIGB.txt
output will be written to: output.txt
minsup = 50 %
minconf = 60 %
max antecedent length = 1
max consequent length = 2

> So would CFPGrowth include those items if I set their MIS as 0.3%?

Yes, it seems like a good solution. You could set a very low threshold for such item and CPFgrowth could include them in the rules. I assume that such items would be more likely to appear in the antecedent of a rule, since if they are rare, the confidence may be low if they appear in the consequent. But it seems like a good idea to use CFPGrowth.

For CFPGrowth, it does not support the length constraint yet, though.

Yes, I think it is better to use the code for automatically setting the thresholds for each item since setting the threshold for each item may be very time-consuming. But if you want to have more control, you could always set the threshold as you like.

Options: ReplyQuote
Re: Estimating Amount of Heap Space Needed for FPGrowth
Posted by: k
Date: December 07, 2017 03:17AM

Hi Phil,

Thanks again for all the help.

Just another question. What's a good rule for choosing max antecedent length and max consequent length?

I've currently tried 15 and 15 but ran into this error: https://stackoverflow.com/questions/1393486/error-java-lang-outofmemoryerror-gc-overhead-limit-exceeded

I'm currently trying 5 and 5.

I quickly looked up a paper (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.6295) and it appears that lengths of 20-30 are not unusual. So perhaps CFPGrowth may be a better algorithm. However, since length cannot be limited it may lead to memory issues again. Do you have a general idea of how the implementation would be like for a max length CFPGrowth?

Options: ReplyQuote
Re: Estimating Amount of Heap Space Needed for FPGrowth
Posted by: Phil
Date: December 07, 2017 05:40PM

Personally, if I was looking to find association rules in data, I would try to find small rules because a too large rules may be hard to interpret or use, and may perhaps only be applied in very specific cases. So I would only look at rules containing no more than 4 or 5 items. I would probably set max antecedent to something like 3 items and max consequent to 1 or 2 items. For the consequent, some people argue that using a single item as consequent is the most useful as maybe you just want to predict a single thing at a time.

About CFPGrowth++, I think it would not be difficult to add length constraints in CFPGrowth. Since it is based on FPGrowth, it should be almost the same as adding these constraints in FPGrowth. I did not do it just because of time. This week-end I have a few urgent things to finish related to writing papers. But if you really need this feature, you could add it or I could perhaps do it for your next week.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.