The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
The BMSWebview Sequential pattern mining dataset
Posted by: Vartika
Date: August 23, 2017 08:14AM

Dear Philippe,

I have checked that many researcher used BMS-Webview 1 datasets as the transaction dataset for their research.
I am also looking for the transaction dataset or any sparse dataset for my Phd reasearch.

I am trying to understand the representation of BMS webview 1 dataset but not able to understand it.

Below is the first few lines of the dataset.
My doubts are:
1. Is every individual line representing a different transactions?
2. -1 and -2 representing what?

what is this representation

10307 -1 10311 -1 12487 -1 -2
12559 -1 -2
12695 -1 12703 -1 18715 -1 -2
10311 -1 12387 -1 12515 -1 12691 -1 12695 -1 12699 -1 12703 -1 12823 -1 12831 -1 12847 -1 18595 -1 18679 -1 18751 -1 -2
10291 -1 12523 -1 12531 -1 12535 -1 12883 -1 -2
12523 -1 12539 -1 12803 -1 12819 -1 -2

It would be great if you can help me on this.
Thanks in advance.

Regards,
Vartika



Edited 1 time(s). Last edit at 09/11/2017 07:12PM by webmasterphilfv.

Options: ReplyQuote
Re: Sequential pattern mining datasets
Date: August 23, 2017 08:20AM

Hello,

The dataset above is in the SPMF format. This format is used by the SPMF library and is defined as follows:

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

So basically, each line in the file is a sequence. And the -1 are used to separate the items in the sequence. And the -2 is used to indicate the end of the sequence.
BMS is a dataset of sequence of webpages visited by some user. I think that each sequence is a user and that each positive number is a webpage.

Actually, the dataset that you have downloaded is in SPMF format. There was an original BMS dataset with more information provided in KDD CUP 2000 I remember but this website does not exist anymore. So, that is all the information that I have about that dataset.

It is a good dataset for benchmark for pattern mining as many people are using it. It is a sparse dataset.

Options: ReplyQuote


Your Name: 
Your Email: 
Subject: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
  *******   **    **  **     **  ********   **        
 **     **  ***   **  ***   ***  **     **  **    **  
 **     **  ****  **  **** ****  **     **  **    **  
  ********  ** ** **  ** *** **  ********   **    **  
        **  **  ****  **     **  **         ********* 
 **     **  **   ***  **     **  **               **  
  *******   **    **  **     **  **               **  
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.