The Data Mining Forum                             open-source data mining software data science journal data mining conferences iea aie 2020
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: January 05, 2020 11:19PM

Dear friends, colleagues, scientists,

First of all, happy new year to you all! Let me explain my problem as brief as I can with examples:

I have input data which is a categorical time categorical series (uniformly timed, i.e., t2-t1 = t3-t2), obtained from multiple sensor readings (even though the readings might be a number, I use discretize it to turn the time series into a categorical one. It looks like this:

Time S1 S2 S3 S4 S5
--------------------
t0 a c e f g
t1 a c e f g
t2 a d e f h
t3 b d e f g
...

A user selects the sensors that he/she is interested in. So, if the user selects S1, S2, S3, S5, the input becomes

Time S1 S2 S3 S5
-----------------
t0 a c e g
t1 a c e g
t2 a d e h
t3 b d e g
..

AND I horizontally trace the input (so basically put the rows one after the other) and turn the input into a big single string (I was initially hoping to use the Voting Expert algorithm of Cohen and Adams, that's why. An unsupervised way to segment the text and propose them to a subject matter expert, or might even creating a Markov-chain).

So, the input becomes:

Si = acegacegadehadehbdeg....

Then, I place a size 4 (because the number of selected sensors or the number of columns) at the beginning of the text and if the pattern repeats itself in the text it expands by 4 (Again for this example) and add the pattern to repetitive pattern set. If not, the start of the window moves by 4 (so the next row in the original format) and do the same check. I assume the longest pattern can be as big as the half of the input string as it has to repeat itself at least once.

E.g. Si = |aceg|acegadehadehbdeg --> does aceg exist in Si?
Yes. Then, window size +=1. Found rep pattern = {aceg}
Si = |acegaceg|adehadehbdeg
...

And I create a prefix tree (trie)...This is a pretty common approach so, so far nothing special. Plus, as you can imagine, this is a pretty expensive thing to do without a lower or upper boundary for the window size. A lot of overlapping patterns are found by the algorithm too.

The goal is to find a way to make an elimination (maybe by a distance measure etc.). In Voting Expert Alg, it checks the inner and boundary entropy, etc.

However, I'm curious about what kind of algorithms I can use within SMPF to do what I want to do?

I'm not a data scientist per se so I really appreciate a direction.

KR,

Charles

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: January 06, 2020 12:25AM

In addition to the above question, what would be the pros and cons of:

1) Keeping the data as it is and applying episode mining techniques?
2) Or trying to apply sequential pattern mining algorithms? However, I guess that requires the long sequence data to be converted into an itemset data?. Is that necessary or can a sequential pattern mining algorithm be directly applied to a long single sequence?
3) any other way mining task (clustering) could be beneficial? Although, the paper of Eamonn Keogh Jessica Lin suggests it's meaningless to cluster subsequences.

I feel like the options are there but I am not sure which direction is a way to move forward with my type of data and my goal of discovering frequent patterns to form Markov Chains for subject matter experts to re-study systems.

Apologies if any of my questions or assumptions are nonsense.

KR,

Charles

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Date: January 06, 2020 06:00AM

Dear Charles,

Happy new year!

Very interesting topic and problem. And I also see that you have put a lot of efforts and thoughts about this.

Since this data is a very long sequence of itemsets, you could definitely apply an episode mining algorithm to that data. You could apply a frequent episode mining algorithm to find episodes that appear multiple times in your sequence.

In episode mining, if you have data like this:

t0 a c e f g
t1 a c e f g
t2 a d e f h
t3 b d e f g

you could find for example that (ac) followed by (g) has two minimal occurrences. The first one is from t0 to t1. The second one is t1 to t3. Thus, the support of (ac) followed by (g) is 2. This would be the result of some typical algorithm like EMMA or MINEPI+ on the above data. But there are also other variations in episode mining and other way to calculate the support of episodes.

However, a drawback of episode mining algorithm is that most of them use a fixed window length. So if you want to find long episodes, you need to use a large window size. Setting a large window size means that the algorithm may be less able to reduce the search space, but an episode mining algorithm may still use the support to reduce the search space.

If you want to try frequent episode mining on your data, the good news is that I will soon release the source code of these algorithms in SPMF:
- MINEPI for frequent episode mining
- MINEPI+ for frequent episode mining
- EMMA for frequent episode mining
- HUE-SPAN for high utility episode mining (not relevant for your data)
The code of these algorithms is ready. I just need to integrate them in SPMF, which may take 1 week or two because I have many things to do. But if you want access to the code early of those algorithms only, I should share it with you by e-mail.

That is for my answer to episode mining.

I will think a little about it and maybe add another message below with some more comments on your problem.

Best,
Philippe

Options: ReplyQuote
Re: Looking for an algorithm to segment multivariate categorical time series
Posted by: Charles
Date: January 06, 2020 06:57AM

Dear Philippe, Dear Professor,

Thank you so much for your prompt reply and extensive input.

Your feedback gave me the confidence that I was not that far off with my initial understandings, but indeed my doubts are still relevant due to the fit of my input data to the existing episode mining algorithms. I also find your suggestion of the adaptation of the frequent sequence mining algorithms for this task very intriguing. I shall indeed approach it like that to see the initial outcome. I can wait a week or two for the integration of the algorithms to SPMF, so please take your time professor. Thanks in advance for your support.

About the fix frame size aspect of the episode mining algorithms, I indeed have a big dilemma there. My ambition in there to introduce a dynamic frame expansion until the frequency of the subsequence in the frame becomes 1 (non-repetitive) comes with the cost of overblown candidate patterns. However, the goal is to find a higher level (abstract) system behaviors and the size of those behaviors in complex systems varies drastically. Some are length 3, some are length 14 and some are length 8. So, dividing the data with a fixed frame might cause not discovering a certain behavior. But I still yet to test all aspects.

All the best,

KR,

Charles

Options: ReplyQuote


This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.