This forum is about

SPMF 0.81 is released! - Improved SPAM implementation

Posted by:
**
webmasterphilfv
**

Date: April 13, 2012 06:37AM

Hello everyone,

This is just to let you know that I have published a new version of my SPMF open source data mining tool (0.81).

In this version, I have improved the SPAM implementation.

Before my SPAM implementation used a fixed number of bits for each sequence. I have modified the code so that it now uses a variable number of bits for each sequence. This is the main difference.

The implementation is therefore more memory efficient and can run on larger database with longer sequences. It is also faster on some datasets because it uses less bits.

Philippe

Edited 1 time(s). Last edit at 07/15/2012 09:03AM by webmasterphilfv.

This is just to let you know that I have published a new version of my SPMF open source data mining tool (0.81).

In this version, I have improved the SPAM implementation.

Before my SPAM implementation used a fixed number of bits for each sequence. I have modified the code so that it now uses a variable number of bits for each sequence. This is the main difference.

The implementation is therefore more memory efficient and can run on larger database with longer sequences. It is also faster on some datasets because it uses less bits.

Philippe

Edited 1 time(s). Last edit at 07/15/2012 09:03AM by webmasterphilfv.

Posted by:
**
Derek
**

Date: April 18, 2012 02:37PM

Could you implement the GSP, AprioriALL and WINEPI algorithms?

That would be awesome. Also if you have Visual Basic source code that would be great.

That would be awesome. Also if you have Visual Basic source code that would be great.

Posted by:
**
webmasterphilfv
**

Date: April 18, 2012 08:38PM

Hello Derek,

I don't have these algorithms and I do not plan to implement them anytime soon. GSP and AprioriAll are actually slower than newer algorithms like PrefixSpan and SPAM. Therefore, I recommend to use these algorithms instead.

For VB, i'm not interested either. I don't think that it is the best langage for data mining.

Best,

Philippe

I don't have these algorithms and I do not plan to implement them anytime soon. GSP and AprioriAll are actually slower than newer algorithms like PrefixSpan and SPAM. Therefore, I recommend to use these algorithms instead.

For VB, i'm not interested either. I don't think that it is the best langage for data mining.

Best,

Philippe

Posted by:
**
Dvijesh88
**

Date: June 27, 2012 09:11PM

i could not understand what is mean by

it now uses a variable number of bits for each sequence

it now uses a variable number of bits for each sequence

Posted by:
**
webmasterphilfv
**

Date: June 28, 2012 07:28AM

Hi Dvijesh,

Hope you are going well.

As you probably know, SPAM uses bit vectors to represents the set of itemsets and sequences that contains a sequential pattern.

A bit vector used by SPAM contains a bit for each itemset in each sequence of the sequence database.

In my previous implementation, I used a fixed number of bits to represent

each sequence in the bit vector. Each sequence was using 32 bits. This means that if a sequence database had 100 sequence, then a bit vector would contains 100 * 32 = 3200 bits.

I use 32 bits because it was easier to implement. But this solution has some problems. First, if some sequence contains less than 32 itemsets, then some memory is wasted. For example, if a sequence contains 5 itemsets, then we should only use 5 bits... if we use 32 bits, then 27 bits are wasted! Second, my implementation could not accept sequences larger than 32 itemsets unless that number is changed to a larger value manually.

So for these reasons, I have changed my implementation so that the number of bits that is used to represent each sequence in a bit vector is variable. This means that if a sequence has 5 itemsets, then 5 bits are used for that sequence. And if another sequence contains 56 itemsets, then 56 bits are used for that sequence.

This makes the algorithm use much less memory, because SPAM has to create a lot of bit vectors. It can also make the algorithm faster.

There are some further optimization that would be possible with bit vectors for example to compress the bit vectors if there is a lot of zeros. But I did not do that yet because it is not explained in the paper.

Hope this is clear,

Philippe

The advantage of using a fixed number of bits for each sequence was that it was easier to do.

I mean that in the previous version, my implementation of SPAM was using a

Hope you are going well.

As you probably know, SPAM uses bit vectors to represents the set of itemsets and sequences that contains a sequential pattern.

A bit vector used by SPAM contains a bit for each itemset in each sequence of the sequence database.

In my previous implementation, I used a fixed number of bits to represent

each sequence in the bit vector. Each sequence was using 32 bits. This means that if a sequence database had 100 sequence, then a bit vector would contains 100 * 32 = 3200 bits.

I use 32 bits because it was easier to implement. But this solution has some problems. First, if some sequence contains less than 32 itemsets, then some memory is wasted. For example, if a sequence contains 5 itemsets, then we should only use 5 bits... if we use 32 bits, then 27 bits are wasted! Second, my implementation could not accept sequences larger than 32 itemsets unless that number is changed to a larger value manually.

So for these reasons, I have changed my implementation so that the number of bits that is used to represent each sequence in a bit vector is variable. This means that if a sequence has 5 itemsets, then 5 bits are used for that sequence. And if another sequence contains 56 itemsets, then 56 bits are used for that sequence.

This makes the algorithm use much less memory, because SPAM has to create a lot of bit vectors. It can also make the algorithm faster.

There are some further optimization that would be possible with bit vectors for example to compress the bit vectors if there is a lot of zeros. But I did not do that yet because it is not explained in the paper.

Hope this is clear,

Philippe

The advantage of using a fixed number of bits for each sequence was that it was easier to do.

I mean that in the previous version, my implementation of SPAM was using a

Posted by:
**
Dvijesh88
**

Date: June 29, 2012 07:47AM

hello sir

yes i am fine and hope you are fine too

thank you sir for answer. yes sir i got all. but i did face one error which is about this when i try to run the SPAM(i dont know the conform detail but will provide after i found out).

yes i am fine and hope you are fine too

thank you sir for answer. yes sir i got all. but i did face one error which is about this when i try to run the SPAM(i dont know the conform detail but will provide after i found out).