The Data Mining Forum                             open-source data mining software data science journal data mining conferences high utility mining book
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 09, 2017 03:30AM

Hello,

I want to predict the next user action based on a user-action database I gathered from system log files. Therefore I am looking into the Sequence Predictions provided here in SPMF, specifically trying to use the CPT or CPT+ Algorithm.

I got other Algorithms working (like the CMRules for example) by following the developers guide, which states to call the runAlgorithm()-method in the Algo~ class in the respective package of the algorithm implementaiton.
However, in case of the Sequence Prediction Algorithms there does not seem to be such a class nor a method like this and I cannot find additional information about how to use them.

Could someone maybe give me a simple code example on how to use the CPT Algorithm on something like a 2 dimensional array containing all the sequences that shall be used for training and another array containing the sequence for which the next symbol shall be predicted?

I would be very greatful for any help smiling smiley

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Date: June 09, 2017 08:05AM

Hello,

Welcome to the forum. Glad you are using SPMF and that CMRules is working already.

For CPT+ and the other algorithms for sequence prediction in SPMF, there is indeed no "runAlgorithm()" method. The reason is that it was designed by a student T. Gueniche, who first designed this code as a separate project.

But actually, it is not so difficult to run the code.

For CPT the main example for running the sequence prediction algorithm is in the file MainTestCPT in the package ca.pfv.spmf.tests


The code looks like this:

// Load the set of training sequences
		String inputPath = fileToPath("contextCPT.txt"winking smiley;  
		SequenceDatabase trainingSet = new SequenceDatabase();
		trainingSet.loadFileSPMFFormat(inputPath, Integer.MAX_VALUE, 0, Integer.MAX_VALUE);
		
		// Print the training sequences to the console
		System.out.println("--- Training sequences ---"winking smiley;
		for(Sequence sequence : trainingSet.getSequences()) {
			System.out.println(sequence.toString());
		}
		System.out.println();
		
		// Print statistics about the training sequences
		SequenceStatsGenerator.prinStats(trainingSet, " training sequences "winking smiley;
		
		// The following line is to set optional parameters for the prediction model. 
		// We want to 
		// activate the CCF and CBS strategies which generally improves its performance (see paper)
		String optionalParameters = "CCF:true CBS:true CCFmin:1 CCFmax:6 CCFsup:2 splitMethod:0 minPredictionRatio:1.0 noiseRatio:1.0";
		// 			List<List<Item>> itemsets = finder.findFrequentItemsets(trainingSequences, parameters.paramInt("CCFmin"winking smiley, parameters.paramInt("CCFmax"winking smiley, parameters.paramInt("CCFsup"winking smiley);
		
		// Train the prediction model
		CPTPlusPredictor predictionModel = new CPTPlusPredictor("CPT+", optionalParameters);
		predictionModel.Train(trainingSet.getSequences());
		
		// Now we will make a prediction.
		// We want to predict what would occur after the sequence <1, 3>.
		// We first create the sequence
		Sequence sequence = new Sequence(0);
		sequence.addItem(new Item(1));
		sequence.addItem(new Item(2));
		
		// Then we perform the prediction
		Sequence thePrediction = predictionModel.Predict(sequence);
		System.out.println("For the sequence <(1),(2)>, the prediction for the next symbol is: +" + thePrediction);


Now let me give you some explanation. In that code above, we will read an input file called "contextCPT.txt", which is encoded in the same file format as CMRules.

In your case, I understand that you want instead to run CPT using some sequences that are already in memory rather than loading a file. So to do that, instead of loading the data from a file, you could create the sequences in memory. Let's say that I want to create a database with two sequences:
1, 2, 3, 4
2, 2, 3, 4

You can do this as follows:

// Create a sequence database
		SequenceDatabase trainingSet = new SequenceDatabase();
		
		// Create a first sequence
		List<Item> listItemsSequence1 = new ArrayList<Item>();
		listItemsSequence1.add(new Item(1));
		listItemsSequence1.add(new Item(2));
		listItemsSequence1.add(new Item(3));
		listItemsSequence1.add(new Item(4));
		Sequence sequence1 = new Sequence(1, listItemsSequence1); // the id of the sequence is 1
		
		// Add the sequence to the database
		trainingSet.getSequences().add(sequence1);
		
		// Create a second sequence
		List<Item> listItemsSequence2 = new ArrayList<Item>();
		listItemsSequence2.add(new Item(2));
		listItemsSequence2.add(new Item(2));
		listItemsSequence2.add(new Item(3));
		listItemsSequence2.add(new Item(4));
		Sequence sequence2 = new Sequence(2, listItemsSequence2); // the id of the sequence is 2

		// Add the sequence to the database
		trainingSet.getSequences().add(sequence2);

Then after that you can apply the same code to train CPT:

// The following line is to set optional parameters for the prediction model. 
		// We want to 
		// activate the CCF and CBS strategies which generally improves its performance (see paper)
		String optionalParameters = "CCF:true CBS:true CCFmin:1 CCFmax:6 CCFsup:2 splitMethod:0 minPredictionRatio:1.0 noiseRatio:1.0";
		// 			List<List<Item>> itemsets = finder.findFrequentItemsets(trainingSequences, parameters.paramInt("CCFmin"winking smiley, parameters.paramInt("CCFmax"winking smiley, parameters.paramInt("CCFsup"winking smiley);
		
		// Train the prediction model
		CPTPlusPredictor predictionModel = new CPTPlusPredictor("CPT+", optionalParameters);
		predictionModel.Train(trainingSet.getSequences());

And then you can use the code for making a prediction. In the following, we want to predict what will be the next item following a sequence "1, 2".

// Now we will make a prediction.
		// We want to predict what would occur after the sequence <1, 3>.
		// We first create the sequence
		Sequence sequence = new Sequence(0);
		sequence.addItem(new Item(1));
		sequence.addItem(new Item(2));
		
		// Then we perform the prediction
		Sequence thePrediction = predictionModel.Predict(sequence);
		System.out.println("For the sequence <(1),(2)>, the prediction for the next symbol is: +" + thePrediction);


Ok, so that is all. Hope it is clear.

So you can also change the parameters of CPT. In the above example, the parameters are set here:

"CCF:true CBS:true CCFmin:1 CCFmax:6 CCFsup:2 splitMethod:0 minPredictionRatio:1.0 noiseRatio:1.0"

Besides, CPT, there is also CPT+ which works more or less in the same way but has additional optimizations. You could also check it. Maybe it would work better on your data. Or otherwise, there are also other sequence prediction algorithms in SPMF such as TDAG, AllKMarkov, DG, etc.

In the documentation of SPMF, example 118 to 125 are about the algorithms for sequence prediction (http://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php#cptPlus)

Hope this helps.

Best regards,

Philippe



Edited 2 time(s). Last edit at 06/09/2017 08:08AM by webmasterphilfv.

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 11, 2017 11:06PM

Hello Philippe,

thank you very much for your effort. This helped me a lot and I got it working now smiling smiley

I have one further question though:
Wheneiver I try and run the algorithm with a sequence with 7 or more items in it, the program crashes. Here is a picture of the output of the MainTestCPT program with a modified input file containing one sequence longer than 6 items:

image upload


This does not happen if I only put 6 or less items in a sequence. Could this be a bug within the implementation or am I missing something here?


Best wishes,
Günther

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Date: June 11, 2017 11:42PM

Dear Günther,

Glad it has helped.

There should not be a limitation for 7 items. We actually tested with longer sequences. But it is always possible that there could be some bug or maybe that it is related to something in your code.

Can you send me your code in MainTestCPT ? and if you have used an input file, could you send me your input file? I can help you to check what is the problem.

You may send to my e-mail: philfv8 AT yahoo DOT com

Best

Philippe

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 13, 2017 11:18PM

Hello Philippe,

I have sent you an e-mail containing the MainTestCPT as well as the input file.

Also, I have one further question regarding the usage of the sequence prediction algorithms:
As of now, the prediction algorithms like CPT seem to only give me the item which they found to be the most likely one to follow next given the input sequence. Is there any way I could also get the exact probability of this item (e.g. Item '2' was found to be most likely to be the next symbol with a probility of p = 0,78)?

For context, I want to detect anomalous user behavior at run time based on the previous user behavior recorded in a log file. Herefore I can compare the newest user action 'actionA' recorded to the prediction the CPT/AKOM algorithms gave me ('actionB'). Now, it might be a false conclusion to view the user action as a potentially dangerous anomaly just because it differs from the algorithms prediction ('actionA' != 'actionB'), if 'actionB' only had a probability of let's say 31% and the actually recorded 'actionA' was the 2nd most likely with a probaility of 28%. So it would probabably make sense to only act upon an anomalous event if the prediction probability was above a certain threshhold (let's say > 85%) and the recorded user action differed from that to reduce false positives.


To summarize: Is there any way of obtaining the exact probability of the predicted item besides just the predicted item itself? I thought maybe the probability for each possible item to follow gets calculated and stored while applying the respective algorithm and just do not get returned to the user as of now.

Best regards smiling smiley

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Date: June 14, 2017 03:03PM

Hello again,

It has not been implemented yet. But I think it is a good idea for a novel feature. Actually, when the CPT model makes a prediction, it internally uses a structure called the "count table". In that structure, the model will assign a score to each symbol, and then select the symbol with the highest score as the prediction. So, if we access that structure, we can get the information that you want about all the symbols and their scores, and then you could compare the values to determine if A is much more likely then B as in your example.

I have updated the code on the SPMF website to add this feature, just now. You can download the new version, and update MainTestCPT with the following lines:

System.out.println();
		System.out.println("To make the prediction, the scores were calculated as follows:"winking smiley;
		 Map<Integer, Float> countTable = predictionModel.getCountTable();
		 for(Entry<Integer,Float> entry : countTable.entrySet()){
			 System.out.println("symbol"  + entry.getKey() + "\t score: " + entry.getValue());
		 }

Then it will show you the scores for each symbols for the latest prediction that you made:

To make the prediction, the scores were calculated as follows:
symbol2	 score: 0.6666667
symbol3	 score: 0.6666667
symbol6	 score: 0.33333334

Note that the concept of score is similar to a conditional probability but it may not exactly be a probability because several factors are taken into account in that score such as noise, etc. So the values may perhaps be greater than 1 (I would need to read the paper again to check that).

I have also added that feature to CPT+.The results looks like this with CPT+:

To make the prediction, the scores were calculated as follows:
symbol3	 score: 1.7780446
symbol4	 score: 1.3333833
symbol5	 score: 1.2501
symbol6	 score: 1.3333668

Here the scores can be greater than 1 because CPT+ uses some noise reduction strategy when computing the score, and it will take many things into account to obtain the final scores.

Best,

Philippe

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 15, 2017 12:36AM

Hello,

thank you very much for your time and effort! This is really helpful and was exactly what I was looking for.

If you don't mind, I would like to pick your brain a bit about the most applicable Data Mining "Problem" (is that what the categories of itemst mining, association rule mining, sequence prediction etc are called?) for my goal of identifying anomalous user actions at runtime based on their previous behaviour.

So far I think association rule and sequential pattern mining simply is not accurate enough and not very applicable to this problem. Sequential rule mining in its normal form would probably also yield problems, because if I have very long sequences of user actions from a big log file, each rule would probably always be fulfilled, since one item will eventually follow in pretty much all cases (except for a user action that has only been done once). I could handle this by dividing the recorded user actions into smaller subsequences, but I would lose some prediction accuracy by doing this.
The better way of using sequential rule mining as far as I see it would be to use a window time constraint (like TRuleGrowth) to avoid this problem and only look for the next single (or maybe 2-3) user action(s). This way I could categorize a newly occured user action as an anomaly if there is no fitting rule for it or the support and/or confidence is below a certain threshhold.
The problem with using patterns to identify anomalies might though be that it might not work correctly for rare cases, since I have to define a a minimum support. I think you wrote about this as the original reason for the creation of the CPT algorithm, because it does not use patterns for prediction and so does not lose accuracy.

The other possibility is of course the use of sequence prediction strategies like CPT(+). This way I could categorize the most recent user action as anomalous, if it is far off from the predicted user action (with the help of your recent addition of the getCountTable function to decrease false positives).

So my current understanding is, that either sequential rule mining with a window time constraint or sequence prediction algorithms would be best fit for this task.


What do you think personally would fit best to solve my problem?

Best regards smiling smiley

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Date: June 15, 2017 08:25AM

Hello,

I think that what you wrote makes a lot of sense.

Yes, for sequential pattern mining, an issue is that there is no measure of confidence that a pattern will be followed.

Then, for sequential rule mining, there is the measure of confidence, but it will probably not work too well for rare cases because of the minsup threshold. And yes, using the window constraint would be a good idea if rules are used.

But I agree that that the idea of using CPT(+) is quite interesting. Using some training data, you can build a model. Then for another sequence, that trained model tells us if it is different from the training sequences by making a prediction as you said.

Another possibility is to use the concept of "contrast patterns". A contrast pattern is a pattern where the support vary considerably in two datasets. There are some algorithms for finding this kind of patterns, but they are not implemented in SPMF. These algorithms can find that a pattern usually frequently appearing in a database, does not appear in another database.

Besides, another possibility could be to use some clustering algorithms, or some distance measure perhaps, to comare the distance between a sequence and some patterns extracted from a training sequences.

This is just some idea. Hope it helps

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 15, 2017 11:04PM

Hello again,

thanks for your input, I will definitely consider your ideas.

I have just looked into the paper of CPT(+). Do you think that it could yield a problem if I use very long sequences for training my model?
At the moment I thought about creating one model per user group/role (e.g. one model for all users "Administrator" etc.) and giving the training set one sequence containing all the actions recorded for one single user of this user group. So let's say I have 3 "Adminstrator" users in the system, then I would create a training set with 3 sequences, each containing all recorded user actions of this specific "Administrator" user. Then I could check any new incoming action of one user of the group "Administrator" using this prediction model.

This way though, each sequence would get really long over time and it seems like (if I understand the paper correctly) the prediction tree would get incredibly deep this way and probably take up a lot of memory.
If this was true, I could also split the training sequences into smaller sub-sequences (as with the optinalParameters already given by the CPT implementation), but I do not want to lose accuracy if it happens to cut right in the middle of a typical workflow of user actions. This loss of accuracy could then again probably be minimized by not dividing into too small sub-sequences (maybe instead of dividing into sub-sequences with length 6 only divide into sub-sequences with length 20), since the probability of cutting right in the middle of an important and typical workflow would be smaller the longer the sub-sequences get, right?


Do you maybe have any data or conducted any studies about such topics like the impacts on using really long sequences for training the model or possible loss of prediction accuracy by cutting into small sub-sequences?

Best regards smiling smiley

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 17, 2017 04:10AM

I am currently playing around a bit with different sequence lengths and the optionalParameter settings of CPT. It seems to make quite a big difference:

Having created a bit longer training sequence (~70 items long) and not splitting the sequence for training at all ( "splitMethod:0" ) predicts some item A for me and the scores for the possible items in the count table are pretty large ( in the 20 - 30 range). If I now let CPT split the training sequence into smaller sub-sequences of different lengths ( "splitMethod:5" up to "splitMethod:15" ) I always get another item B predicted and the count table scores are much smaller (2 - 5).
Once the split length surpasses a certain number (in my case it was around "splitMethod:20" ), Item A got predicted again (the one that got suggested without any splitting of the very long input sequence).

Now, one of them has to be wrong, correct? I would guess, that splitting it into too small sub-sequences for training the prediction model trades prediction accuracy due to losing possible longer sequence contexts for memory usage. The fact that the count table scores get that much larger by using longer, unsplit training sequences suggests that these are somehow correlated with the depth of the prediction tree.


So I guess my options would therefore be to not split the potentially very long sequences and just accept the high costs in terms of memory for better prediction accuracy, or find some sort of sweet spot for the split length ("splitMethod:50" or something along those lines would probably keep enough context and is still a lot smaller than handling one giant sequence of ten thousands of items).


Do I see this somewhat correctly or am I misunderstanding the mechanisms of how the CPT(+) Algorithm works in regard to sequence length, splitting training sequences etc.?

Best wishes smiling smiley

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 17, 2017 04:15AM

Oups sorry, I of course meant "splitLengthangry smileyY" when I wrote "splitMethodangry smileyY" in terms of splitting length.

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Günther
Date: June 17, 2017 04:25AM

I just noticed that I still had "splitMethod:1" in my code, meaning I only took the last k items out of the sequence, while playing with the splitLength - parameter.

Seems like this influences the prediction accuracy quite a lot, but once I correctly set "splitMethod:2" and split the training sequence into k sub-sequences, it actually was pretty accurate already with pretty small values ( "splitLength:5" also got me the same result as with no splitting at all).


Sorry for bothering you this much, but I think everything starts to make sense for me now regarding the CPT algorithm smiling smiley

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Date: June 17, 2017 07:11AM

I see that you have tried many things. It is great. I did not answer you right away, as I am currently travelling and did not have much time to catch up with e-mails and messages.

To answer your questions, my student did some experiments about the split length for web click data in the ADMA 2013 paper about CPT:
http://www.philippe-fournier-viger.com/ADMA2013_Compact_Prediction_trees.pdf

In that paper, you can see the influence of splitting sequences with different lengths in Fig. 6. We found that for the Web data considered in that paper, if we set splitLength = 5, the accuracy is almost the same that if we set it to higher values. Moreover, in that same figure, we can see that setting splitLength = 5, greatly reduces the time for making predictions (the "testing time"winking smiley and also the size of the model (from about 250K nodes to about 50K nodes).

So yes, the purpose of splitting sequences is mainly to reduce the size of the model (the tree), which influences the memory used and also how fast the predictions are made.

How to set the splitLength parameter or splitMethod parameter must be found by trying various values to find what works best on your dataset, I think.

Yes, I agree with you that if you have some very long sequences, a potential issue could be that depending on where the sequences are split, some important information could be lost. But I think that if you have enough data, this may not be too much of a problem. And I am glad to read that you have tried it and the results were almost as good as when not splitting the data. :-)

Glad you are getting some good results.

The CPT is actually not very complicated. It would be possible to improve the model in different ways with additional features if you need. I mean if you have some special requirements for your project, maybe you could get some novel ideas and propose some improvements to the model.

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Posted by: Cheena Ghatoura
Date: June 23, 2017 07:43AM

Hi Günther

My Name is Cheena and I am currently pursuing post graduate course. I am working on my dissertation and the methodology for my project is Sequential pattern mining. My data is of the following form. (ID's are patients unique number and codes are patients speciality visit in the hospital.)
NHS_Number TFC1
3230673923 120
3230673921 110
3274821558 301
3274821551 350
3274821551 301
3274821552 350
3274821553 301
3274821552 350
3274821558 301

How shall I make sure that my file is recognised by SPMF. I really appreciate your guidance on this
Regards
Cheena

Options: ReplyQuote
Re: Usage of Sequence Prediction Algorithms | CPT Example
Date: June 23, 2017 08:17AM

Hello,

You could first have a look at the documentation ("documentation" section on the SPMF website). There is the list of all the algorithms offered in SPMF. Since you have several patients, you could consider applying some sequential pattern mining, sequential rule mining or sequence prediction algorithms. Depending on the algorithm that you choose, the input file format is explained in the "documentation" page.

So the next step would be to transform your file to that format (as described in the documentation. Examples of input files are provided in SPMF. You can also check the example to see the format.

In your case, you have several patients. So I think that:
- You could create a sequence for each patients. The sequence would be the list of visit made by that patient.
- Then you could apply some algorithms either to find patterns common to several patients, or to make some sequence prediction, depending on what is your goal.

Hope this helps,

Philippe

Options: ReplyQuote


Your Name: 
Your Email: 
Subject: 
Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically.
 ********   **    **  **     **   ******   **      ** 
 **     **   **  **    **   **   **    **  **  **  ** 
 **     **    ****      ** **    **        **  **  ** 
 ********      **        ***     **        **  **  ** 
 **     **     **       ** **    **        **  **  ** 
 **     **     **      **   **   **    **  **  **  ** 
 ********      **     **     **   ******    ***  ***  
This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.