Pages

Sunday, May 26, 2013

Text Mining Performance in RapidMiner

Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.

I set up Java to use 6500 MB of memory (max).

I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.

The text processing was purposefully simple: tokenize the document and get the binary word vector.

I then stored the results in the RapidMiner repository, which creates a binary file.

In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.


# Records
Time to process + store (s)
Peak memory (GB)
Stored results file size (MB)
Time to apply (s)
100
0
0.400
0.223
1
1,000
1
0.576
2.1
0
10,000
8
1.3
21
1
20000
15
2.4
42

30000
23
2.6
63

40000
30
2.9
84

50000
39
3.8
105
5
60000
48
4.0
126
5
70000
56
4.1
148

80000
66
4.5
168

90000
71
4.7
190

100,000
88
5.3
211


The store operator was much faster than the Write Database operator.

1 comment:

  1. Dear Neil,

    It is about a year that I am following you. Recently, I did a binary classification by Rapidminer and I applied three different algorithms (SVM, K-NN and Naive Bayesian). I got the results but my supervisor has asked me to report the threshold for the classification. How can I report the threshold that Rapidminer has used for each of them? I know we have an operator by name of Find Threshold (Meta). I was not sure that it was the correct one. It gave me the same threshold for algorithms! I will be grateful if you direct me in this case.

    Regards,
    Reza

    ReplyDelete