The Data Mining Forum
This forum is about data mining
, data science
and big data
: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger
. No registration is required to use this forum!
Difference between equivalence class and sequence prefix tree
Date: April 19, 2021 01:04PM
I have one question regarding theory about algorithms SPADE and SPAM. I already read both papers about these algorithms, but one thing is still not clear, and that is the difference between equivalence classes (SPADE) and lexicographic tree (SPAM). As I understand it, they both divide the search space using prefixes. Is there any significant difference between these two representations?
Re: Difference between equivalence class and sequence prefix tree
Date: April 19, 2021 09:52PM
To design an algorithm for pattern mining, a key issue is to represent the search space in such way that it can be explored without missing any patterns and that the same pattern will not be visited twice (that would waste time).
Both SPAM and SPADE are similar but they do have some differences.
First, a similarity is that both SPAM and SPADE use a vertical database representation (the ID-lists). In SPAM, the difference is that the ID-lists are represented as bit vectors, while in SPADE, they are represented as list of integers. However, there also exist a version of SPADE called BITSPADE, which represents ID-lists as bit vectors just like SPAM.
So what is the main difference between BitSPADE/SPADE and SPAM?
The main difference is how they generate candidate patterns.
SPAM will recursively try to append single items to each frequent pattern to make larger patterns. For this, SPAM performs two types of extensions : i-extensions and s-extensions.
SPADE on the other hand will try to merge two frequent patterns to make larger patterns.
The approach of SPADE is better because there will be less combinations to try.....
For example, if you have two frequent patters like ABC and ABD, SPADE will combine them to make ABCD or ABDC for example.
But SPAM, maybe will try to combine ABC with many single items to generate many possibilities like:
Maybe some of these possibilities are useless....
The idea in SPADE is to only combine two patterns if they are in the same equivalence class.
In SPAM, there is no concept of equivalence class. SPAM will just try to combine many single items with each pattern and this may make a lot more candidates...
That is the main idea. Hope this is clear.
You can also check our paper about CM-SPADE and CM-SPAM where we improved them with some techniques called CMAP to further reduced the search space and make both algorithms more efficient.
Edited 1 time(s). Last edit at 04/19/2021 09:52PM by webmasterphilfv.