The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Need help on formatting my data and choose an algorithm
Posted by: Nema
Date: February 20, 2017 01:20AM

My data is a list of arbitrary strings and I would like to extract patterns from the strings and create a mapping between each string and its pattern. Please advise how I should format my data and which algorithm I should use. Here are examples of input and output.




h2ello -> hello
hell3o -> hello
worl4d -> world
w5orld -> world
hello67world -> helloworld
hellowor89ld -> helloworld

Options: ReplyQuote
Re: Need help on formatting my data and choose an algorithm
Date: February 20, 2017 07:13PM

You could extract sequential patterns with an algorithm such as CM-SPAM. I will give you a brief example. And there are some example in the documentation of SPMF which explains the details about the input and output formal.

An algorithm such as CM-SPAM will discover sequential patterns (subsequences that appears often in a set of sequences.). In your example, this means finding patterns that appears in several lines.

To prepare your file in the correct input format,
you would need to map each letter to an integer.
For example:
1 = h
2 = 2
3 = e
4 = l
5 = o
6 = 3
7 = w
8 = r
9 ....

Then, you would need to represent each line of your fileas a sequence in SPMF format:

For example


would be represented as:

1 -1 2 -1 3 -1 4 -1 4 -1 5 -1 6 -1 -2

Having a file in this correct format, you could apply the CM-SPAM sequential patter mining algorithm. This will discover a set of patterns that appears in your file. Let me give you an example of patterns that could be found:

1 -1 3 -1 4 -1 4 -1 5 -1 6 -1 -2 #SUP: 4 #SID: 0 1 2 3

This means that the pattern "hello" was found four times and that it appears in the sequences number #0 #1 #2 and #3 in your input file. This is just an example. But it is the kind of patterns that you could find using that algorithm.
By the way, to see the #SID in the output file you need to use the parameter (Show sequence identifiers (optional).

Options: ReplyQuote
Re: Need help on formatting my data and choose an algorithm
Posted by: Nema
Date: February 20, 2017 11:02PM

Thank you for your help. I am now able to format the data in the SPMF format.

I notice from the sequence identifiers that the patterns are 1-to-n mappings. i.e. One string can belong to more than one pattern. What can I do if I want to get 1-to-1 mappings? i.e. One input is mapped to only one pattern.

If I want to use the mined patterns to generate new strings, which files should I look at?

Options: ReplyQuote
Re: Need help on formatting my data and choose an algorithm
Date: February 21, 2017 06:48AM


A 1-to-n mapping is because a string may contain multiple patterns. If you want to get a 1-to-1 mapping, then you need to determine what would be the criteria to select only one pattern from a given string?

Some ideas could be to:
- find only the maximal patterns (using VMSP for example). This will reduce the number of patterns by finding only the largest one. But still, it will be 1-to-n.
- After that, for each string you could choose the pattern having the highest support as the pattern representing that string. So in other words, if you have 3 patterns for a given string, you could decide to choose the most frequent one. Or maybe that it would make more sense to choose the longest one... You can define your own criteria in your own code about how to choose the most representative pattern. Even, you could try different criteria and compare the results for your application...
- Besides that, you can also test the mingap and maxgap parameters of algorithms such as VMSP and CM-SPAM to reduce the number of patterns by not allowing large gaps between characters in strings.


Options: ReplyQuote

Your Name: 
Your Email: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
 ********   ********  **     **  ********    ******  
 **     **  **    **  **     **  **     **  **    ** 
 **     **      **    **     **  **     **  **       
 ********      **     **     **  **     **  **       
 **           **      **     **  **     **  **       
 **           **      **     **  **     **  **    ** 
 **           **       *******   ********    ******  
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.