The Data Mining Forum                             open-source data mining software open-source data mining software data science journal data mining conferences
This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.  
Any idea or suggestion for my projet data mining?
Posted by: Ju PENG
Date: April 22, 2013 12:57AM

I am doing my job in te filed classification et data mining, here is my issue:

There are about 2 million invoice need to be classified, all these invoices are in format image, but we have already extracted the data from the image and export in the file XML, so we have 2 million files in format XML. Each invoice has a provider, and among them, most of them have a number key to identify the provider but others dont. And there are also other informations useful for example the adresse mail, the site(not all the papers have). For each provider, the model is not the same, so the structure is different. My job is to classify all these invoices with their provider.

At first, i used the number of provider to classify all the documents, and it worked on 80% of all the files. And then, i used the site or adresse mail which has be associated with the number, and it solves 10% more. But i have no idea what to do next. Because the data was extracted by the method OCR from files images. So there are some words in bad format(images not clear, handwriting).

Now i think it is better to classify the files by their structure, but i dont work it out. So do u have any good idea? Thank u!!!

Options: ReplyQuote
Re: Any idea or suggestion for my projet data mining?
Date: April 22, 2013 05:41AM


It is an interesting problem.

One idea could be to use an algorithm like the K-Nearest Neighbor. This algorithm works as follow. You have a set of instances that have some attributes with some values. To know what is the type of a new instances, the k-nn algorithm will compare to the most similar instances. To use KNN, you need to define a similarity measure for the instances. The similarity could be the number of attributes that are the same.

You could also use try other classifiers such as neural networks, decision trees, etc.

Another idea would be look if there are some other data is specific to a provider. For example, is there a product that is only offered by one provider. Then, we can conclude all invoices having the same product are from the same provider. It would be possible to write a simple algorithm to scan if there is some values that are specific to a provider.

Besides, you said that you are comparing the addresses, emails, etc. Do you use some partial matching? I mean, to compare the emails, adresses, etc., you could try to detect if some letters are missing or replaced by other letters by the OCR.

Hope this helps,


Options: ReplyQuote
Re: Any idea or suggestion for my projet data mining?
Posted by: Ju PENG
Date: April 22, 2013 06:37AM

Hi, Philippe

Thanks a lot for your answer!

I thinked about to use K-Nearest Neighbor algorithm, it can be a solution, but there is also a problem: in the 2 million invoice, there are more than 6000 provider, that is to say, i have to built 6000 classes and in each class there should be several instance(if 5, we have 30000 instance in the training set), the training set seems to be too large. For each new instance, should I calcul the similarity with all those instance in the training set? I think its too much. And there are some providers who havn't the key number, which we can not build a category for them, if we used KNN, they will be classfied to the wrong class.

You are right, because the data comes from the OCR, so there are some words in bad form, some lettres missing or replaced. I have met the problem when i compared the email and adresse.

I am really confused by this problem, if the words are well written, there would be less problem.

Thanks for ur helps, I will think about other algorithme you have talk about.


Options: ReplyQuote

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.