The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining workshop
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
sequential pattern mining of a factory log data
Posted by: Stefan
Date: April 09, 2019 10:04AM

Hi,

I am analyzing log event data from a factory. I have the log data of around 180 robots that perform a wide variety of tasks. In addition, the robots perform vastly different amount of tasks (from a total of 59 tasks for robot 15 to a total of more than 81.000 tasks for robot 33). I want to make a graph from all these events and make a graph summarization of it (an node will be the action a robot can be in and an edge will be the transfer from a action of a robot to another action). In total, there will be over 4 million edges in the original unfiltered graph and around 4000 nodes (4000 unique events can happen).

I have provided an example dataframe (where row 0 are all the observed log events of robot_0 in order, row 1 of robot_1, etc):

    action 1        action 2        action 3        .... action 80.999   action 81.000
0   71518_robot0    50376_robot0    71518_robot0    .... Nan             NaN 
1   10125_robot1    10156_robot1    10012_robot1    .... NaN             NaN 
2   10399_robot2    10566_robot2    54333_robot2    .... 10033_robot2    10044_robot2 (total of 81.000 actions)
...
...
181 50049_robot_180 10433_robot_180 22999_robot_180 .... Nan             NaN


info about row 0: total of 59 actions, it will be 80941 NaN's after 59 observations
info about row 1: total of 44.000 actions, it will be 37000 NaNs after 44.000 observations
info about row 2: will be completely filled with observations, total of 81.000

I have some methodological questions about this dataset, hopefully some of you have experience with such data!
My main goal: Find the most interesting patterns. This means not the most appearing patterns since that will be normal behavior of the factory. This also does not mean the least frequently appearing patterns since that will just be happening in an ordered sequence by chance. I am mostly focussed on only keeping the interactions that fall in-between. The occasionally frequent sequential patterns!

(1) What would be an feasible and working algorithm? I searched the site but most are made for sequences of integers, and my actions have string values.

(2) Is there an algorithm that finds sequential patterns in exactly one big sequence [71518, 50376, 71400, ... , 43022]? This could be interesting if calculating it for all the 180 sequences is to computationally expensive.

(3) when i tried altering the pandas dataframe (which is csv originally) to put into the spmf tool, i could not find a way to get it into a form the spmf accepts. How could i alter my dataset to work in this tool? is there a way to transform a normal .csv file to a suited format?

(4) is there a more efficient way to store this dataframe without having it bloated with Nan's?

I have some more questions but maybe some suggestions for these will already get me in the right direction!

kind regards,

Stefan

Options: ReplyQuote
Re: sequential pattern mining of a factory log data
Posted by: Dang Nguyen
Date: April 09, 2019 02:47PM

Hi Stefan,

If you need a sequential pattern mining algorithm for sequences of strings, you can use my implementation here -- file "sp_miner.exe":

https://github.com/nphdang/Sqn2Vec

I tested it on some datasets where the number of items in a sequence is > 1400. Hopefully, it will also work well on your data.

BUT based on your sampled data, I don't think you will find any frequent pattern since each item in a sequence is a combination of the robot ID and the action ID, thus they should be unique across different sequences.

Cheers,
Dang Nguyen

Options: ReplyQuote
Re: sequential pattern mining of a factory log data
Posted by: Stefan
Date: April 10, 2019 12:55AM

Wow that looks great, do you have a method that gets a .csv Pandas Dataframe like mine in the format that your code needs? Because if I recall correctly, it means that:

0    1 2 3 4 5
0    2 3 4 
1    8 9 0 3 2
2    3 8 9 0 3

The first two lines will belong to person 0, even if they are not of equal length?

Options: ReplyQuote
Re: sequential pattern mining of a factory log data
Posted by: Dang Nguyen
Date: April 10, 2019 02:07AM

The first column is the label of a sequence, not the sequence ID.
So each line is a sequence of symbols and it has the format: <sequence_label> <tab> <set_of_symbols>
Mapping to your dataset, each line will be a set of actions of a robot.
Since your dataset does not have <sequence_label>, you can use a dummy label, e.g., you set the first column of all rows to be 0.
You can modify this code to generate the dataset which meets the format of my code.

with open("robot.txt", "w"winking smiley as f:
    for row_id in range(your_df.shape[0]):
        row = your_df.iloc[row_id, :]
        row = row[row.notnull()].tolist()
        label = "0"
        words = " ".join(row[1:])
        f.write(label + "\t" + words + "\n"winking smiley

your_df means your data frame.

Options: ReplyQuote
Re: sequential pattern mining of a factory log data
Posted by: Stefan
Date: April 11, 2019 05:46AM

Hi Dang Nguyen,

Thanks for the help already! the code works great indeed! However, i can't get the program to run (the .exe file). Should i run it in terminal or could i also run the code in python? It would be great to run it in python because then i can make a script out of it.

Options: ReplyQuote
Re: sequential pattern mining of a factory log data
Posted by: Dang Nguyen
Date: April 13, 2019 12:30AM

Yes, you should run it in console as it requires some parameters.

If you want to run it in Python code, you can use the following code:

import subprocess
# mine SPs from sequences
def mine_SPs(file_seq, minSup, gap, file_seq_sp):
    subprocess.run("sp_miner.exe -dataset {} -minsup {} -gap {} -seqsp {}".
                   format(file_seq, minSup, gap, file_seq_sp))

Hope this helps.

Options: ReplyQuote
Re: sequential pattern mining of a factory log data
Date: April 11, 2019 06:59AM

Hi Stefan,

Did not saw your message. Thanks for using SPMF. My answers are below.

> (1) What would be an feasible and working
> algorithm? I searched the site but most are made
> for sequences of integers, and my actions have
> string values.

Hi, yes, you can use SPMF with sequences of strings. To define the string values corresponding to integers, you can use that format:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

That format is explained in the documentation of the sequential pattern mining algorithms offered in SPMF (e.g. CM-SPAM). You can try various algorithms. Some of the algorithms offered in SPMF like CM-SPAM will also let you set constraints such as minimum and maximum gap between itemsets in sequential patterns. There are also various types of patterns such as closed, maximal sequential patterns, compressing sequential patterns, statistically significant sequential patterns (using Skopus), etc. And you can also try the sequential rule mining algorithms, which also use that text format.

This format works for the command line or GUI of SPMF. If you want to use it in the source code, it is also possible but a bit more complicated.


>
> (2) Is there an algorithm that finds sequential
> patterns in exactly one big sequence [71518,
> 50376, 71400, ... , 43022]? This could be
> interesting if calculating it for all the 180
> sequences is to computationally expensive.

For a single sequence, we call that "episode mining" instead of sequential pattern mining. There are a few algorithms in SPMF that can deal with a single sequence. Besides the episode mining alorithms, PFPM can be used to find periodic patterns in a single sequence.

>
> (3) when i tried altering the pandas dataframe
> (which is csv originally) to put into the spmf
> tool, i could not find a way to get it into a form
> the spmf accepts. How could i alter my dataset to
> work in this tool? is there a way to transform a
> normal .csv file to a suited format?

Because different users have different needs, there is no tool to convert all formats to the SPMF format. But typically, it is a few lines of code to do that.


>
> (4) is there a more efficient way to store this
> dataframe without having it bloated with Nan's?
>
> I have some more questions but maybe some
> suggestions for these will already get me in the
> right direction!

I don't use Python so I cannot provide much help about that. But there is someone who defined a Python wrapper to use the VMSP lgorithm from SPMF to mine the maximal sequential patterns in Python:

https://github.com/fandu/maximal-sequential-patterns-mining

Maybe that if you look at the code, it would give you some idea about how to use it from Python.

But you can always just call SPMF from the command line with a text file.

Best regards,

Philippe



Edited 4 time(s). Last edit at 04/11/2019 07:04AM by webmasterphilfv.

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.