Re: SPMF format for sentiment analysis
Date: June 17, 2018 11:17PM
Sorry for the delay to answer. I saw your e-mail but actually was too busy in the last few days. I will provide some answer/opinion/suggestion below.
How to represent the data is always a good question because depending on how you represent the data, you may obtain different results using a data mining algorithms.
A possibility could be that each sequence represents a sequence of emotions associated to the post of a politican. Then you would also have several sequences, as you have several politicians. If you do like that, then you do not need to encode the user id. The first sequence could be the first user. The second sequence could be your second user, etc.
I am not sure if this is what makes the most sense, but it is my idea when reading your message.
> 2- What about time stamp, i need to keep that
> field? does it make sense if I map chronological
> time stamps to 1, 2,3,4 ....for each userID?
In a sequence, you can have the sequential order between posts.
Now, it is worthy to be more specific and also have the timestamps? Maybe not. Actually, in my software SPMF, few algorithms can actually use the timestamps. Most of them just care about the sequential order.
Some algorithm that uses the timestamps in SPMF such as the Hirate-Yamana algorithm are very strict about how the time is handled. For example, a pattern (a time = 1)(b time = 2) is considered to be different from a pattern (a time=1) (b time =3) because the time difference between "a" and "b" in the two patterns is not the same. Thus, using timestamps may not always give you good results. So my suggestion is perhaps to first not use them, unless you really want to use the algorithms with timestamps.
> 3- as for the emotion tags I am mapping them to
> integer values 1-7.
This seems reasonable.
I think, yes, you could have sequences like this:
1 -1 2 -1 3 -1 -2
which means that a post with emotion 1, was followed by a post with emotion 2, which was followed by a post with emotion 3.
Or as you said, you can try to include timestamps.
By the way, not all algorithms in SPMF have the same format. The format that I have described above is the one used by most sequential pattern mining algorithms. But you could also consider finding other types of patterns such as periodic patterns in a single sequence. In that case, another format must be used. I think you can check the various algorithm available in the documentation to see what is best for what you want to do.
Edited 1 time(s). Last edit at 06/20/2018 01:25AM by webmasterphilfv.