The Data Mining Forum
This forum is about data mining
, data science
and big data
: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger
. No registration is required to use this forum!
Association Rule Mining
Date: September 29, 2017 10:27AM
I am currently operating a specific project for my university.
What I will be doing in the project is building a cross-selling model with association rule mining.
In the result, I have tons of rules but I am not sure how to rank them which would be the best.
Which option would be better if
Option 1: Confidence=20% Lift= 5
Option 2: Confidence = 50% Lift = 2
I know confidence is important, but I have heard Lift is very important as well. Should I be sacrificing some confidence for more lift or keep it balance?
Re: Association Rule Mining
Date: September 30, 2017 05:04AM
It really depends. There are some good reasons to say that the lift is better than the confidence. But there are also some cases where the lift is not the best measure and some other measures could be used. In fact, each measures has some cases where it works well and some cases where it does not work well. If you want to know more about this you can read section 6.7 of this chapter:
It shows some examples of when the confidence and lift does not work well and also discusses other alternative measures. It is a good read to understand more about the evaluation of patterns with measures.
Personally, I would try to find rules that have both a high lift and confidence, to make sure that they are reliable.
If a pattern as a high lift both low confidence or high confidence but low lift, I think it is not really good.
Hope that this helps a little bit.
Re: Association Rule Mining
Date: September 30, 2017 08:01AM
Glad you like the tool.
In terms of file format, the format is explained in the documentation for each algorithm. Moreover, the ARFF format, which is used by some other data mining software is also supported for some algorithms.
However, in general, there is a strict file format that must be used. So if you want to use some algorithm, you likely need to transform your data to the proper format. This is necessary because I simply cannot support all possible file formats. So the decision for this software has been to focus on the algorithms rather than offering a lot of preprocessing tools.
But there are some preprocessing tools in SPMF for some specific functions. For example:
A tool for generating a synthetic transaction database
A tool for generating a synthetic sequence database
A tool for generating a synthetic sequence database with timestamps
A tool for calculating statistics about a transaction database
A tool for calculating statistics about a sequence database
A tool for converting a sequence database to a transaction database
A tool for converting a transaction database to a sequence database
A tool for converting a text file to a sequence database (each sentences becomes a sequence)
A tool for converting a sequence database in various formats (CSV, KOSARAK, BMS, IBM...) to a sequence database in SPMF format
A tool for converting a transaction database in various formats (CSV...) to a transaction database in SPMF format
A tool for converting time-series to a sequence database
A tool to generate utility values for a transaction database
A tool to add timestamps to a sequence database
A tool for removing utility information from a database having utility information
A tool to resize a database in SPMF format (a text file) using a percentage of lines of data from an original database.
A tool for visualizing time-series
Besides, if you work with time series, there are also some additional preprocessing options:
an algorithm for calculating the moving average of a time series (to remove noise) new
an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series) new
an algorithm for calculating the linear regression line of a time series (using the least squares method) new
an algorithm for splitting a time series into segments of a given length new
an algorithm for splitting a time series into a given number of segmentsnew
But if you need something else, you can always ask. If it is something that can be useful for more than one person, maybe I can implement it. Or if you want to provide some code for a new tool or features, it is also possible.
Edited 1 time(s). Last edit at 09/30/2017 08:03AM by webmasterphilfv.