The Data Mining Forum
This forum is about data mining
, data science
and big data
: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger
. No registration is required to use this forum!
Right approach for problem unclear
Date: July 08, 2020 12:39PM
first of all thank you for the great tool and effort behind it.
I struggle to find the correct approach to tackle the following problem:
I have a sequence of nodes that represent applications/tools that are used by a (human) user to identify characteristics of a "target computer". There are about 250 tools in total possible.
Each sequence shows the past use of the tools, e.g. A->B->C->D (first use tool A, then B...).
Each tool can (or not) yield "valuable" information, meaning add attributes to the target. Example:
After tool A, you get "operating system = Linux", tool B adds "Ubuntu", tool C adds nothing, tool D adds "SSH Version 3.4" etc.
Some tools use the gained attributes as input, others do not.
I want to create a "prediction" / recommendation for which tool to use next based on the previous tool use in the user's current sequence, and a repository of previously observed sequences (tools + gained attributes).
Which approach would be smart here? Frequent pattern mining (rank / compare to antecedent), association rule mining, HMMs? I am slightly overwhelmed with the many potential routes. Thank you!
Re: Right approach for problem unclear
Date: July 17, 2020 01:12AM
Sorry for the long delay to answer. My schedule has been very busy this week. I tried to answer you before but I lost the message and had to write again.
Happy that the software is useful. And welcome to the forum.
I think that there are many algorithms that could be applied and it depends a bit on what you do and also how you prepare the data.
I think that the basic data that you have is some sequence. You could consider the format of sequence database, as used by sequential rule mining and sequential pattern mining algorithm.
In SPMF a sequence is an ordered list of itemsets (sets of items), where items in an itemset are considered to appear simultaneously.
So in your case, a sequence could be like this:
<(Tool1), (Tool2,Windows), (Tool3, SSHPortOPEN, HTTPPortOPEN)>
which means that Tool1 was applied, then Tool2 was applied and we found that the operating system is windows, and then Tool3 is applied and we found that two ports are open.
To encode this in SPMF for sequential pattern mining that sequence would look like this:
1 -1 2 3 -1 4 5 6 -1 -2
where 1 would represent Tool1, 2 would represent Tool2, 3 would represent windows, 4 would represent Tool3, 5 would represent SSHPortOpen and 5 would represent HTTPPortOpen. -1 is a separator and -2 indicates the end of the sequence.
So you could have an input file with many sequence like that and then apply a sequential pattern mining algorithm like TKS to find frequent sequences of tools that peole use:
(tool1)(tool2, windows) for example
Or you could apply a sequential rule mining algorithm to find rules like this:
(tool1) --> (tool2, windows) support: 24 confidence 60 %
Or you could apply the sequence prediction algorithm like CPT+ etc offered in SPMF to predict what is the next tool that someone will use.
or if you think the sequential order is not important, then you could consider also the association rule or itemset mining algorithms... but they dont consider the time or ordering.
Maybe also you can find some other idea.
Hope this gives you some idea!