This forum is about

AKOM Algorithm functionality

Posted by:
**
Brett
**

Date: July 30, 2017 02:32AM

Hello everyone,

I am currently looking into how some sequence prediction algorithms are working and read the papers of algorithms implemented in SPMF. The paper of the AKOM algorithm does not go into too much detail of how the algorithm works exactly, so I do not quite understand the difference to for example PPM.

As I understand it, AKOM trains several Markov models of the orders 1 up to k and uses all these to make the prediction. How are they trained though? And how is the prediction made with all models being used? Is the predicted item the one with the highest average probability from all models together?

Also, in my opinion the PPM algorithm also trains all Markov models up to the maximal order k by keeping track of all the tables for different context length (explained quite well with an example training sequence here). It only uses the highest order tables possible to make its prediction and only falls back to lower orders if no prediction can be made on the higher order, but the training part should also be about as complex as the one of AKOM. In the CPT-paper's evaluation part the AKOM algorithm's complexity is massively higher than all other algorithms though and uses many times more nodes. Why is that? I think I do not understand this because of the lack of explanation in the AKOM paper as to how the training of the markov models is being done.

Can someone with some insights into how the AKOM algorithm works maybe shed some light on this topic for me? I cannot find any good explanations or example training phases for AKOM anywhere.

Any help would be greatly appreciated!

Thanks a lot

I am currently looking into how some sequence prediction algorithms are working and read the papers of algorithms implemented in SPMF. The paper of the AKOM algorithm does not go into too much detail of how the algorithm works exactly, so I do not quite understand the difference to for example PPM.

As I understand it, AKOM trains several Markov models of the orders 1 up to k and uses all these to make the prediction. How are they trained though? And how is the prediction made with all models being used? Is the predicted item the one with the highest average probability from all models together?

Also, in my opinion the PPM algorithm also trains all Markov models up to the maximal order k by keeping track of all the tables for different context length (explained quite well with an example training sequence here). It only uses the highest order tables possible to make its prediction and only falls back to lower orders if no prediction can be made on the higher order, but the training part should also be about as complex as the one of AKOM. In the CPT-paper's evaluation part the AKOM algorithm's complexity is massively higher than all other algorithms though and uses many times more nodes. Why is that? I think I do not understand this because of the lack of explanation in the AKOM paper as to how the training of the markov models is being done.

Can someone with some insights into how the AKOM algorithm works maybe shed some light on this topic for me? I cannot find any good explanations or example training phases for AKOM anywhere.

Any help would be greatly appreciated!

Thanks a lot

Posted by:
**
webmasterphilfv
**

Date: July 30, 2017 03:46AM

Hello,

The implementation of AKOM in SPMF is supposed to work as follows. If we set k = 5, it will create the markov models of order 1,2,3,4 and 5. AKOM will keep all these five models in memory. Then for doing a prediction, AKOM will look at the model of the highest order (5) to make a prediction. If the current sequence to be predicted does not match that model, then AKOM will use the model of order 4, and so on.

How the training is done? It is simple. Each sequence is read one by one. Then for a given sequence, it is read from left to right to update the models of order 1 to 5 either by adding some new entries or increasing the values in the model.

I suggest to look at that PPT presentation for the CPT+ model for some comparisons of AKOM and PPM:

http://www.philippe-fournier-viger.com/PAKDD2015_sequence_prediction_PPT.pdf

On Slide 5, there are two sequences:

S1: {A,B,C,A,C,B,D}

S2: {C,C,A,B,C,B,C,A}

On Slide 5, you can see the DG model for these sequences.

On Slide 7, you can see the PPM model for these sequences (order 1)

On slide 8 ...................... (order 2)

On slide 10, you can see the AKOM model for these sequences for K = 2. The idea is simple. The AKOM model of order 2 is the combination of PPM (order 1) with PPM (order 2).

Actually, AKOM is called ALL-K-Order-Markov because it combines ALL markov models of order 1 to k. This is the difference with PPM.

By the way, in this PPT, all these models are represented as graphs. But they could be stored in tables instead or in hash maps for example, in terms of implementation.

You can see the source code of some implementations of these models in SPMF. These models are very simple in terms of implementation.

By the way, I did not read these papers for a very long time, so I may miss some details. And the implementations of these models in SPMF were not done by me but by my student at that time, T. Gueniche.

Hope this helps.

Philippe

The implementation of AKOM in SPMF is supposed to work as follows. If we set k = 5, it will create the markov models of order 1,2,3,4 and 5. AKOM will keep all these five models in memory. Then for doing a prediction, AKOM will look at the model of the highest order (5) to make a prediction. If the current sequence to be predicted does not match that model, then AKOM will use the model of order 4, and so on.

How the training is done? It is simple. Each sequence is read one by one. Then for a given sequence, it is read from left to right to update the models of order 1 to 5 either by adding some new entries or increasing the values in the model.

I suggest to look at that PPT presentation for the CPT+ model for some comparisons of AKOM and PPM:

http://www.philippe-fournier-viger.com/PAKDD2015_sequence_prediction_PPT.pdf

On Slide 5, there are two sequences:

S1: {A,B,C,A,C,B,D}

S2: {C,C,A,B,C,B,C,A}

On Slide 5, you can see the DG model for these sequences.

On Slide 7, you can see the PPM model for these sequences (order 1)

On slide 8 ...................... (order 2)

On slide 10, you can see the AKOM model for these sequences for K = 2. The idea is simple. The AKOM model of order 2 is the combination of PPM (order 1) with PPM (order 2).

Actually, AKOM is called ALL-K-Order-Markov because it combines ALL markov models of order 1 to k. This is the difference with PPM.

By the way, in this PPT, all these models are represented as graphs. But they could be stored in tables instead or in hash maps for example, in terms of implementation.

You can see the source code of some implementations of these models in SPMF. These models are very simple in terms of implementation.

By the way, I did not read these papers for a very long time, so I may miss some details. And the implementations of these models in SPMF were not done by me but by my student at that time, T. Gueniche.

Hope this helps.

Philippe

Posted by:
**
Brett
**

Date: July 31, 2017 01:08AM

Hello and thanks for your response!

I think what still confuses me is the so called escape-mechanism proposed by the PPM algorithm. Here is the part in the paper I am talking about:

So, as I understand it, PPM trains all Markov Models up to the maximal order as well, and uses the highest order possible to make the prediction, only switching to a lower order if a prediction cannot be made using the higher one. I have actually created a figure demonstrating the training phase of PPM for the sequence 'ABABAC' using hash tables. It only shows the relevant tables for each encoding step, leaving out all the other order 2 and order 1 tables that are not being updated during that step:

It shows the training of a PPM order 2, incrementing the counts of the encountered symbols in the respective tables for each order and each context observed.

Let's say after training this model, one wants to predict the next item of the sequence <B,B>. Since BB has not been observed in the training sequences, the highest entry in the order-2 table for the BB-context (O2-BB ) would be the 1 for the escape-symbol, all others are still 0. So it escapes order 2 and moves on to order 1. Now it throws away the first B and looks into the order 1 table for the B-context (O1-B ), where the Symbol A has the highest score of 2 (updated in the 2nd to last step shown in my graphic). It will therefore predict A as the next symbol.

To me this sounds just exactly like you described how AKOM works, which is why I do not understand why AKOM has such a massively higher complexity and memory usage.

Maybe you did not implement PPM in a way where it actually uses lower order models if it has not observed a symbol in a larger context size yet. So your implementation in SPMF only uses the maximal order specified, which is why it is so much smaller.

Do you see any other difference between PPM and AKOM then at all if PPM is seen the way I described here?

I hope my explanations and my graphics were understandable

I think what still confuses me is the so called escape-mechanism proposed by the PPM algorithm. Here is the part in the paper I am talking about:

So, as I understand it, PPM trains all Markov Models up to the maximal order as well, and uses the highest order possible to make the prediction, only switching to a lower order if a prediction cannot be made using the higher one. I have actually created a figure demonstrating the training phase of PPM for the sequence 'ABABAC' using hash tables. It only shows the relevant tables for each encoding step, leaving out all the other order 2 and order 1 tables that are not being updated during that step:

It shows the training of a PPM order 2, incrementing the counts of the encountered symbols in the respective tables for each order and each context observed.

Let's say after training this model, one wants to predict the next item of the sequence <B,B>. Since BB has not been observed in the training sequences, the highest entry in the order-2 table for the BB-context (O2-BB ) would be the 1 for the escape-symbol, all others are still 0. So it escapes order 2 and moves on to order 1. Now it throws away the first B and looks into the order 1 table for the B-context (O1-B ), where the Symbol A has the highest score of 2 (updated in the 2nd to last step shown in my graphic). It will therefore predict A as the next symbol.

To me this sounds just exactly like you described how AKOM works, which is why I do not understand why AKOM has such a massively higher complexity and memory usage.

Maybe you did not implement PPM in a way where it actually uses lower order models if it has not observed a symbol in a larger context size yet. So your implementation in SPMF only uses the maximal order specified, which is why it is so much smaller.

Do you see any other difference between PPM and AKOM then at all if PPM is seen the way I described here?

I hope my explanations and my graphics were understandable

Posted by:
**
webmasterphilfv
**

Date: July 31, 2017 03:14AM

I think that you are right about PPM. I have checked the article quickly and it seems that PPM indeed works by checking order K, then K-1 until order 0, as you said.

Thus, I have to conclude that the implementation in SPMF is not faithful to the article. I have informed the person who implemented PPM about that. But as he is very busy (not a student anymore), I am not sure if he will have time to fix it. But if you want to contribute to SPMF, I think that you could easily fix it and share the code. It would be great. Since you have already understood the paper and make some example, I think that it should not be too hard for you to implement it. Personally, I currently do not have time to re-implement it. It would be great if someone could re-do it in a more faithful way.

Thus, I have to conclude that the implementation in SPMF is not faithful to the article. I have informed the person who implemented PPM about that. But as he is very busy (not a student anymore), I am not sure if he will have time to fix it. But if you want to contribute to SPMF, I think that you could easily fix it and share the code. It would be great. Since you have already understood the paper and make some example, I think that it should not be too hard for you to implement it. Personally, I currently do not have time to re-implement it. It would be great if someone could re-do it in a more faithful way.

Posted by:
**
webmasterphilfv
**

Date: July 31, 2017 03:59AM

Actually, I have talked with my previous student who implemented the code, and it is apparently a problem of how we named the algorithm. It is an implementation of PPM with k = 1. The implementation does not support k > 1. So, on the website, to be correct, the implementation should be called "PPM order 1" rather than "PPM" because that implementation only does k = 1.

I will fix the website tomorrow to make this clear and avoid this confusion.

But it would be actually great to implement the more general PPM. If you have time, it would be welcome.

I will fix the website tomorrow to make this clear and avoid this confusion.

But it would be actually great to implement the more general PPM. If you have time, it would be welcome.

Posted by:
**
Brett
**

Date: August 01, 2017 12:02AM

This makes a lot of sense. I have actually looked back into the CPT-paper a bit closer and it was stated in the evaluation section I was referring to that PPM 1st order was used. So basically the AKOM implementation used there is the PPM algorithm with higher orders as its described in the original paper.

I then just find the literature a bit confusing, because the AKOM paper from Pitkow, 1999 always seems to get cited for the concept of using a markov model with all orders up to some maximal order k. Actually though, the PPM paper from Cleary, 1984 already used that exact concept for the PPM algorithm.

Thanks for your time though, I think I can sleep well again now after clearing this up!

I do not have the spare time at the moment to implement the general PPM, but once my schedule clears up I might get around to it. I will let you know if I manage to implement it.

I then just find the literature a bit confusing, because the AKOM paper from Pitkow, 1999 always seems to get cited for the concept of using a markov model with all orders up to some maximal order k. Actually though, the PPM paper from Cleary, 1984 already used that exact concept for the PPM algorithm.

Thanks for your time though, I think I can sleep well again now after clearing this up!

I do not have the spare time at the moment to implement the general PPM, but once my schedule clears up I might get around to it. I will let you know if I manage to implement it.

Posted by:
**
Brett
**

Date: August 10, 2017 06:21AM

Hello once more,

I have come up with a question regarding the functionality of the CPT prediction phase this time and hope you can maybe help me out quickly:

What happens in the prediction phase, when there are no symbols in the Prediction Tree after all elements of the sequence whose next item shall get predicted have occured?

For example, if we look at slide 34 of the presentation you posted above, lets say we want now to predict the next item of the sequence <A,B,C> instead. So the sequences similar to <A,B,C> would still be s1, s2 and s3, but since C is the last item in both of these, there would be no items to put into the count table, correct?

What would the prediction be in that case?

Thanks a lot

I have come up with a question regarding the functionality of the CPT prediction phase this time and hope you can maybe help me out quickly:

What happens in the prediction phase, when there are no symbols in the Prediction Tree after all elements of the sequence whose next item shall get predicted have occured?

For example, if we look at slide 34 of the presentation you posted above, lets say we want now to predict the next item of the sequence <A,B,C> instead. So the sequences similar to <A,B,C> would still be s1, s2 and s3, but since C is the last item in both of these, there would be no items to put into the count table, correct?

What would the prediction be in that case?

Thanks a lot

Posted by:
**
webmasterphilfv
**

Date: August 10, 2017 12:40PM

Hello,

It depends on how you set the parameters of CPT. If we want to predict strictly what appears after <A,B,C>, then, yes we don't have training data to make any predictions. However, in the CPT and CPT+ models, there are some strategies to deal with noise and remove items that prevent us from doing a prediction to be more noise tolerant.

In the CPT model, the strategy is called "Recursive Divider". Depending on how you set the parameter of that strategy, CPT will remove one, two or more items and try to make some predictions without these items, if it cannot make a prediction. For example, in the case of ABC, as CPT cannot make a prediction it would try removing some items:

AB

BC

AC

to make some predictions.

If you use that strategy, the prediction for ABC using AB can be "D" for example.

The Recursive divider strategy is discussed very briefly in the PPT on page 45, and is described in the paper. As a user, you can set the maximum number of levels (how many items can be removed from the sequence to be predicted). On the bottom of the slide on p.45 we can see some results. If we set**maxLevel** to 3 we obtain greater accuracy for the dataset used than if we don't use this strategy, as CPT become more noise tolerant. On the right side, you can see the influence of maxLevel on the coverage. The coverage means the percentage of time that we were able to make a prediction. If we set maxLevel = 2, we can only make prediction 70 % of the time. If we set it to about 4 or 5, the coverage is about 100 % for that dataset.

In CPT+, there is also a strategy to handle noise but it works a little bit differently. It requires to set a parameter called the "noise ratio" which assumes that less frequent items are more likely to be noise. Thus, these items are first removed before removing other items, and this gave best results in our experiments. This one is on page 59 of the PPT but described in more details in the paper, I think.

Best regards,

Edited 1 time(s). Last edit at 08/10/2017 12:40PM by webmasterphilfv.

It depends on how you set the parameters of CPT. If we want to predict strictly what appears after <A,B,C>, then, yes we don't have training data to make any predictions. However, in the CPT and CPT+ models, there are some strategies to deal with noise and remove items that prevent us from doing a prediction to be more noise tolerant.

In the CPT model, the strategy is called "Recursive Divider". Depending on how you set the parameter of that strategy, CPT will remove one, two or more items and try to make some predictions without these items, if it cannot make a prediction. For example, in the case of ABC, as CPT cannot make a prediction it would try removing some items:

AB

BC

AC

to make some predictions.

If you use that strategy, the prediction for ABC using AB can be "D" for example.

The Recursive divider strategy is discussed very briefly in the PPT on page 45, and is described in the paper. As a user, you can set the maximum number of levels (how many items can be removed from the sequence to be predicted). On the bottom of the slide on p.45 we can see some results. If we set

In CPT+, there is also a strategy to handle noise but it works a little bit differently. It requires to set a parameter called the "noise ratio" which assumes that less frequent items are more likely to be noise. Thus, these items are first removed before removing other items, and this gave best results in our experiments. This one is on page 59 of the PPT but described in more details in the paper, I think.

Best regards,

Edited 1 time(s). Last edit at 08/10/2017 12:40PM by webmasterphilfv.