The Data Mining Forum                             open-source data mining software data science journal data mining conferences iea aie 2020
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: January 05, 2020 11:19PM

Dear friends, colleagues, scientists,

First of all, happy new year to you all! Let me explain my problem as brief as I can with examples:

I have input data which is a categorical time categorical series (uniformly timed, i.e., t2-t1 = t3-t2), obtained from multiple sensor readings (even though the readings might be a number, I use discretize it to turn the time series into a categorical one. It looks like this:

Time S1 S2 S3 S4 S5
--------------------
t0 a c e f g
t1 a c e f g
t2 a d e f h
t3 b d e f g
...

A user selects the sensors that he/she is interested in. So, if the user selects S1, S2, S3, S5, the input becomes

Time S1 S2 S3 S5
-----------------
t0 a c e g
t1 a c e g
t2 a d e h
t3 b d e g
..

AND I horizontally trace the input (so basically put the rows one after the other) and turn the input into a big single string (I was initially hoping to use the Voting Expert algorithm of Cohen and Adams, that's why. An unsupervised way to segment the text and propose them to a subject matter expert, or might even creating a Markov-chain).

So, the input becomes:

Si = acegacegadehadehbdeg....

Then, I place a size 4 (because the number of selected sensors or the number of columns) at the beginning of the text and if the pattern repeats itself in the text it expands by 4 (Again for this example) and add the pattern to repetitive pattern set. If not, the start of the window moves by 4 (so the next row in the original format) and do the same check. I assume the longest pattern can be as big as the half of the input string as it has to repeat itself at least once.

E.g. Si = |aceg|acegadehadehbdeg --> does aceg exist in Si?
Yes. Then, window size +=1. Found rep pattern = {aceg}
Si = |acegaceg|adehadehbdeg
...

And I create a prefix tree (trie)...This is a pretty common approach so, so far nothing special. Plus, as you can imagine, this is a pretty expensive thing to do without a lower or upper boundary for the window size. A lot of overlapping patterns are found by the algorithm too.

The goal is to find a way to make an elimination (maybe by a distance measure etc.). In Voting Expert Alg, it checks the inner and boundary entropy, etc.

However, I'm curious about what kind of algorithms I can use within SMPF to do what I want to do?

I'm not a data scientist per se so I really appreciate a direction.

KR,

Charles

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: January 06, 2020 12:25AM

In addition to the above question, what would be the pros and cons of:

1) Keeping the data as it is and applying episode mining techniques?
2) Or trying to apply sequential pattern mining algorithms? However, I guess that requires the long sequence data to be converted into an itemset data?. Is that necessary or can a sequential pattern mining algorithm be directly applied to a long single sequence?
3) any other way mining task (clustering) could be beneficial? Although, the paper of Eamonn Keogh Jessica Lin suggests it's meaningless to cluster subsequences.

I feel like the options are there but I am not sure which direction is a way to move forward with my type of data and my goal of discovering frequent patterns to form Markov Chains for subject matter experts to re-study systems.

Apologies if any of my questions or assumptions are nonsense.

KR,

Charles

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Date: January 06, 2020 06:00AM

Dear Charles,

Happy new year!

Very interesting topic and problem. And I also see that you have put a lot of efforts and thoughts about this.

Since this data is a very long sequence of itemsets, you could definitely apply an episode mining algorithm to that data. You could apply a frequent episode mining algorithm to find episodes that appear multiple times in your sequence.

In episode mining, if you have data like this:

t0 a c e f g
t1 a c e f g
t2 a d e f h
t3 b d e f g

you could find for example that (ac) followed by (g) has two minimal occurrences. The first one is from t0 to t1. The second one is t1 to t3. Thus, the support of (ac) followed by (g) is 2. This would be the result of some typical algorithm like EMMA or MINEPI+ on the above data. But there are also other variations in episode mining and other way to calculate the support of episodes.

However, a drawback of episode mining algorithm is that most of them use a fixed window length. So if you want to find long episodes, you need to use a large window size. Setting a large window size means that the algorithm may be less able to reduce the search space, but an episode mining algorithm may still use the support to reduce the search space.

If you want to try frequent episode mining on your data, the good news is that I will soon release the source code of these algorithms in SPMF:
- MINEPI for frequent episode mining
- MINEPI+ for frequent episode mining
- EMMA for frequent episode mining
- HUE-SPAN for high utility episode mining (not relevant for your data)
The code of these algorithms is ready. I just need to integrate them in SPMF, which may take 1 week or two because I have many things to do. But if you want access to the code early of those algorithms only, I should share it with you by e-mail.

That is for my answer to episode mining.

I will think a little about it and maybe add another message below with some more comments on your problem.

Best,
Philippe

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: January 06, 2020 06:57AM

Dear Philippe, Dear Professor,

Thank you so much for your prompt reply and extensive input.

Your feedback gave me the confidence that I was not that far off with my initial understandings, but indeed my doubts are still relevant due to the fit of my input data to the existing episode mining algorithms. I also find your suggestion of the adaptation of the frequent sequence mining algorithms for this task very intriguing. I shall indeed approach it like that to see the initial outcome. I can wait a week or two for the integration of the algorithms to SPMF, so please take your time professor. Thanks in advance for your support.

About the fix frame size aspect of the episode mining algorithms, I indeed have a big dilemma there. My ambition in there to introduce a dynamic frame expansion until the frequency of the subsequence in the frame becomes 1 (non-repetitive) comes with the cost of overblown candidate patterns. However, the goal is to find a higher level (abstract) system behaviors and the size of those behaviors in complex systems varies drastically. Some are length 3, some are length 14 and some are length 8. So, dividing the data with a fixed frame might cause not discovering a certain behavior. But I still yet to test all aspects.

All the best,

KR,

Charles

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 22, 2020 09:58PM

Dear Philippe, Dear Professor,

Apologies beforehand for bothering you with yet another question/request. In your previous answer, you mentioned that you have the implementations of MINEPI, MINEPI+, EMMA ready and you need more time for their SPMF integration.

I can imagine that you are busy due to other projects and may even be due to the current situation in China, etc. so I believe I will maybe go for the implementations instead of waiting for the SPMF if that's ok for you. I think it's only fair for you to take your time and prioritize other things.

Would you mind sharing those codes with me?

Kind regards,

Charles

webmasterphilfv Wrote:
-------------------------------------------------------
> Dear Charles,
>
> Happy new year!
>
> Very interesting topic and problem. And I also
> see that you have put a lot of efforts and
> thoughts about this.
>
> Since this data is a very long sequence of
> itemsets, you could definitely apply an episode
> mining algorithm to that data. You could apply a
> frequent episode mining algorithm to find episodes
> that appear multiple times in your sequence.
>
> In episode mining, if you have data like this:
>
> t0 a c e f g
> t1 a c e f g
> t2 a d e f h
> t3 b d e f g
>
> you could find for example that (ac) followed by
> (g) has two minimal occurrences. The first one is
> from t0 to t1. The second one is t1 to t3. Thus,
> the support of (ac) followed by (g) is 2. This
> would be the result of some typical algorithm like
> EMMA or MINEPI+ on the above data. But there are
> also other variations in episode mining and other
> way to calculate the support of episodes.
>
> However, a drawback of episode mining algorithm is
> that most of them use a fixed window length. So
> if you want to find long episodes, you need to use
> a large window size. Setting a large window size
> means that the algorithm may be less able to
> reduce the search space, but an episode mining
> algorithm may still use the support to reduce the
> search space.
>
> If you want to try frequent episode mining on your
> data, the good news is that I will soon release
> the source code of these algorithms in SPMF:
> - MINEPI for frequent episode mining
> - MINEPI+ for frequent episode mining
> - EMMA for frequent episode mining
> - HUE-SPAN for high utility episode mining (not
> relevant for your data)
> The code of these algorithms is ready. I just need
> to integrate them in SPMF, which may take 1 week
> or two because I have many things to do. But if
> you want access to the code early of those
> algorithms only, I should share it with you by
> e-mail.
>
> That is for my answer to episode mining.
>
> I will think a little about it and maybe add
> another message below with some more comments on
> your problem.
>
> Best,
> Philippe

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 22, 2020 09:59PM

I realized I didn't fill the e-mail field in my previous reply so doing it now.

Thanks.

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Philipp
Date: February 23, 2020 07:57AM

Hi Charles,

Sorry for the delay. Yes, I have been busy with many things, and the situation surrounding the virus has delayed some of my work.

It is good that you remind me. I will try to release the new version of SPMF tomorrow or worst case after tomorrow with the episode mining algorithms. I have been starting to work on it tonight but it requires a few more hours of work. Will finish tomorrow.

Best,

Philippe

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 23, 2020 07:03PM

Thanks Professor. I can only imagine the stress and the busy-ness of the situation over there. Take care.

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Philippe
Date: February 25, 2020 09:00AM

Charles Wrote:
-------------------------------------------------------
> Thanks Professor. I can only imagine the stress
> and the busy-ness of the situation over there.
> Take care.


Hi Charles, Just to let you know that I have almost finished. I still need to fix the documentation. Tomorrow, I will publish the new version with episode mining.
But now, it is 2pm. I need to go to sleep.

Best,

Philippe

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 25, 2020 05:45PM

Thanks Professor! Please take your rest, too!
It sounds amazing and I'm really looking forward to it.

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Philippe
Date: February 25, 2020 10:54PM

Hi Charles,

The new version of SPMF has been published on the website with the following new algorithms for mining frequent episodes:

EMMA
MINEPI
MINEPI+

and this new algorithm for mining high utility episodes:

HUE-SPAN

I have uploaded also the documentation pages for these algorithms.

Best regards,

Philippe

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 26, 2020 01:25AM

Thanks a lot Professor! Just downloaded!
I'm immediately going to do some testing smiling smiley exciting!

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 27, 2020 03:06AM

Dear Professor, Dear Philippe,

I have been thinking about this problem for a while and I stuck at some point. I would like to borrow your hat if that's ok with you (apologies for the long-text beforehand but i realized i need your expertise).

So far, I've been in and out with the SPMF library and what you have done is amazing. Obviously, it would have been amazing if I had a sequence database for my problem so that I could use one of the algorithms in SPMF but I think I need modifications. I would like to know what you think:

As I mentioned earlier, my data is like this (I am going to use the input style that you have given in EMMA algorithm):

e1 e2 ts (timestamp)
1 10|1
1 10|2
2 11|3
2 12|4
2 11|5
1 10|6
2 11|7
2 11|8

1) The data is simply a state output from a queueing model. In the first column, I have 2 possible values: 1 is idle, 2 is not idle.

2) The second column is the number of people. 10 being 0, it increases 1 by 1. Nothing spectacular.

So, I tried this data with EMMA algorithm and obviously the SPMF implementation treats the data as itemsets within a sequence database (if I'm not wrong), and therefore it finds frequent episodes that contain non-successive itemsets (at least when I run the algorithm, I have encountered those).

To overcome the fact that the algorithm might find episodes like this 10 -1 12 -1 1 -1 2 11 -1 where the first column and second column values are shuffled, I removed the space in my data. So, 1 10|1 is 110|1. I ran it with the following parameters:

----
minsup: 1
max time duration: 10
has no timestamps: false
----


211 -1 4.0
110 -1 3.0
211 -1 211 -1 3.0
110 -1 211 -1 3.0
110 -1 211 -1 211 -1 3.0
211 -1 110 -1 2.0
110 -1 212 -1 2.0
110 -1 110 -1 2.0
211 -1 211 -1 211 -1 2.0
211 -1 110 -1 211 -1 2.0
110 -1 211 -1 212 -1 2.0
110 -1 211 -1 110 -1 2.0
110 -1 212 -1 211 -1 2.0

Question 1) Above bold rows might be true from the reasoning of the EMMA algorithm with the ambition to find frequent itemsets with no vertical ordering (at the end, 110 111 112 and 110 112 111 might mean the same thing for a market basket transaction with no order). Am I correct? Since in my particular challenge, the episode like 110 112 111 is not desired (as there is a strict ordering relation in the data), what kind of approach should I take?

Question 2) I remember TKS algorithm has a parameter called "max gap" and setting it to "1" allows only finding patterns with consecutive itemsets (i believe). Would that be a direction? In that case, should I modify the getFrequentKepisodes method in the algorithm?

Thank you for your help and patience.

KR,

Charles

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 27, 2020 03:17AM

I realized when I read my post that one way to tackle the un-ordered episodes I addressed in my Question 1) at the bottom of my post can be addressed with mincof. Am I right?

Because of the fact that even the SUP of 110 211 212 and 110 212 211 might be the same, their lift measure should be different. Am I correct? The probability of 110 followed by 211 and 110 followed by 212 should be different and that should give it away? Apologies, if I am totally off as this is not entirely my expertise.

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: February 27, 2020 03:28AM

Not minconf but lift measure I mean.

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: cagri
Date: February 27, 2020 07:37PM

Hi Professor, Dear Philippe,

Sorry for thinking out loud in here smiling smiley

While figuring out a way to explain my problem, I also ended up reading your post
Introduction to sequential rule mining and the subsequent questions and replies.

It indeed boils down to the choice of “standard sequential rules” (SSR) or “partially ordered sequential rules” (POSR) that the particular application needs. I can most certainly see that only considering the rule {A,B} -> {C,D,F,E} is much more rigid than also unifying with other rules containing {C,F,D,E} or {C,D,E,F} etc.

However, my particular problem requires me to be more rigid and generate SSR.

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Date: February 28, 2020 05:13AM

Hi Charles,

Great you found some idea. Today and yesterday, I have been quite busy. I need to review many papers for some conference where I am part of the orgnization committee, so I did not had time to do anything else. I still has to do some reviews tonight and tomorrow. I will check what you wrote in more details probably tomorrow and try to give you some idea maybe.

Best,

Philippe

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.