The Data Mining Forum                             open-source data mining software data science journal data mining conferences machine learning in software engineering MLISE 2021 utility mining workshop at ICDM 2021
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Temp files generated using some of the algorithms
Posted by: Solomon Ioannou
Date: November 01, 2021 05:30AM

Dear Philippe,
Firstly, allow me to say that the work put in SPMF is amazing. Grateful to have such an open source tool available for research.

I would like to describe an issue I am facing while running SPADE and PrefixSpan algorithms for sequential pattern mining.

When running the algorithms on a database, a .tmp file is created after some time has elapsed (I have not determined how much time this is), in the directory and with file name as specified in the OutputFilePath argument. I was wondering why this is happening and at which step of the algorithm this .tmp file generation is created.

The above issue kept happening while running the aforementioned algorithms on a database with total number of sequences 32K, 3K distinct items, with mean (sd) of sequence length 490(459) and median number of itemsize 5. The size of the .tmp file created was 3TB after which I had to stop the algorithm from proceeding.

Your help to understand why this is happening would be valuable.

Kind regards, Solomon

Options: ReplyQuote
Re: Temp files generated using some of the algorithms
Posted by: Solomon
Date: November 10, 2021 09:03AM

With the help of a kind colleague, I was able to understand the behaviour described above.

To recap:
When calling from the command line, the identified frequent patterns are stored in the "`outputFilePath`.tmp" folder and when everything is collected, this file is returned as the output, without the .tmp extension.

The huge .tmp file (3Tcool smiley was created when the flag `show sequence Ids` was turned on, as every sequence id that fulfilled the pattern was also saved, resulting in a file with millions of lines.

Now, I am open to solution for the above smiling smiley

Options: ReplyQuote
Re: Temp files generated using some of the algorithms
Date: November 24, 2021 06:42PM

Good morning,

Thanks for using SPMF! I am sorry for the late reply. I have been very busy recently at work.

Yes, the size of output file can be very large. It depends on how many patterns exist in your data for the parameters that you have set. And it also depends on how much information is stored for each patterns (e.g. if you store the sequence identifiers in the results, as you have noticed).

The number of patterns that you find, depends on the parameters of the algorithm. If you are finding sequential patterns, the more you will lower the minsup parameter, the more patterns you will find. If you increase the minsup threshold, you will find less patterns and the file will be smaller.

Also, you may consider using some algorithms like CM-SPAM that allow to set constraints on the patterns to be found. For example, with CM-SPAM you can set a maximum pattern length. If you use that constraint, it can greatly reduce the number of patterns that you will find. And in fact, for many domains, it is not necessary to find the very long patterns. Also, algorithms like CM-SPAM have a gap constraint. If you use the maximum gap constraint, it will also reduce the output of the algorithm!

So I would recommend to try adjusting the paramters if you are finding too many patterns. This can make a huge difference!

Best regards,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.