Vancouver Data Blog by Neil McGuigan: Text Analytics with RapidMiner Part 1 of 6

Monday, November 8, 2010

Text Analytics with RapidMiner Part 1 of 6 - Loading Text

I'll be releasing a new video on text mining with RapidMiner every day this week.

They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!

Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.

*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.

Later this week:

Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.

Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.

Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.

Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.

NEW: Applying A Model To New Documents

Hope you enjoy them.

See my other data mining videos here

56 comments:

ReneNovember 9, 2010 at 3:24 AM
thank you very much for the tutorial!
ReplyDelete
Replies
UnknownNovember 9, 2010 at 10:57 AM
Good start to this set of series Neil!
ReplyDelete
Replies
AnonymousNovember 9, 2010 at 2:14 PM
WOW!!!
I've been working with Rapid Miner for two mounths already to make a positive/negative classification... It's been so hard as there was no help in text mining :(
And you just posted a full scenario!!!
Can't wait till friday!
Thanks a lot!
ReplyDelete
Replies
Neil McGuiganNovember 9, 2010 at 2:43 PM
@Anonymous. No worries. I couldn't find a lot of in depth text mining videos either, which is why I decided to make 'em. Good luck with your project. Let us know more about it when you're done!
ReplyDelete
Replies
AnonymousNovember 18, 2010 at 9:41 AM
we need link where dataset found
ReplyDelete
Replies
Neil McGuiganNovember 18, 2010 at 10:27 AM
Here is the sample data.

It's in zipped xls format

Neil
ReplyDelete
Replies
VaniDecember 1, 2010 at 8:47 AM
Great Video!
Can you tell me the process of exporting my rapidminer result into an excel file? I have more than 500,000 records in my result.

I would greatly appreciate your response.
ReplyDelete
Replies
RKDecember 18, 2010 at 11:44 AM
Hi Neil,

I have all the IT customer feedback information for various cases in one excel sheet.If I want to do few operations like tokenizing,stemming etc, what would the root operator which can read the data from excel. Hope you got my question. Please ask me if you do not get it. Early response would be really helpful.

Thanks RK.
ReplyDelete
Replies
Neil McGuiganDecember 28, 2010 at 6:30 PM
@RK, sorry for the delay, was out of town.

You should be able to use the Read Excel operator (use the search function to find it). You can only use .xls files, and not .xlsx.
ReplyDelete
Replies
Neil McGuiganDecember 28, 2010 at 6:32 PM
@Vani, you can use the Write Excel operator to output your results to Excel. It may be too large for Excel to handle though, in which case you may want to consider a database, such as the free MS SQL Express
ReplyDelete
Replies
Michael KahlerJanuary 1, 2011 at 9:08 PM
Thank you very much for these videos on text analytics! They are not only informative, but you have made them easy to follows! You were looking for input on videos for either web crawling or web scraping. I would like to put my vote in for web scraping, although I can see how web crawling would be useful first. Thank you again.
Michael Kahler
ReplyDelete
Replies
DMXApril 30, 2011 at 11:31 AM
Thank you very much for this tutorial. I have been trying to extract information from texts and I am not being able to do. Actually, I wanted to extract protein names from the biological full text articles. Can you give me some hint on how to do that. I would really appreciate. Is there any text plugins for rapidminer to perform extract of words from the full-text articles?
ReplyDelete
Replies
Neil McGuiganMay 21, 2011 at 12:07 PM
@DMX check out the information extraction plugin for rapidminer. I believe there is a link to it on the RapidMiner forum.
ReplyDelete
Replies
DhieraJune 28, 2011 at 6:31 AM
Hi neil,

i would like to procees twitter messages and my dataset was in excel. but i could not fine any video for further guidence. I tried to explore myself but it does'nt work. Do you have any note/video on that? any help is much appreciated. Thank you
ReplyDelete
Replies
AnonymousJune 29, 2011 at 1:19 PM
Hi Neil,
When I tried was a process document from files operator for more than one file in the results is only a file handling. How can I solve this problem? Thank you...
ReplyDelete
Replies
Peter SimardAugust 31, 2011 at 12:10 PM
Hi;

I've gone through your videos on RapidMiner's Text Mining capabilities and found therm very interesting. Agility is currently under review of different systems that provide Text Analytic capabilities. We are reviewing a couple; one of them being the Calais system. This system has an example application (http://viewer.opencalais.com/) that demonstrates some of its caapbilities. I was wondering if you are familiar with Calais and if you felt it was comparable to Calais with respect to the type of outpuyt generated from the Calais Test Application and RapidMiner?

Peter
ReplyDelete
Replies
zahid saeedSeptember 23, 2011 at 10:17 PM
plz extend video time from 10 to 15 minutes but be a little slow so that we will be able to follow you as beginners.
ReplyDelete
Replies
GunjanOctober 11, 2011 at 11:04 PM
Hi Neil,

Do you have any video on "Clustering" through Rapid Miner?

Regards
Gunjan
ReplyDelete
Replies
Neil McGuiganOctober 23, 2011 at 10:16 AM
@gunjan video 4 briefly discusses clustering
ReplyDelete
Replies
AnonymousDecember 3, 2011 at 3:43 PM
Hi Neil, nice tut. I am working with the course co-training dataset (http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/). But when i use the operator "Process document for files" for loading the pages that are in the folder it show me an error 'The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen..
'

do you have any suggest?
ReplyDelete
Replies
sanaDecember 27, 2011 at 10:16 PM
hi neil,

i tried to enter some text using your suggestion for create document - shows only code. don't know what i am doing that is incorrect. have just downloaded rapidminer so your videos are really a blessing. keep up the good work please
ReplyDelete
Replies
UnknownJanuary 19, 2012 at 3:11 PM
Very useful videos for beginners. Cannot thank you enough
ReplyDelete
Replies
zxFebruary 26, 2012 at 6:46 AM
Hi, Neil

Your videos about text mining with rapid miner is really good, but i am from China and it is difficult to access youtube since it is blocked in China. Would you please share these videos with me? my gmail account is zhxupc@gmail.com.

Thanks,
Xiao
ReplyDelete
Replies
Neil PatelMarch 8, 2012 at 7:11 PM
So I was trying to do the association learning example on wikipedia with the bread, butter, milk and beer example set and I got it to work..just not by using excel as my imported data source. I had to create 5 different text files with each customer's grocery list. When I tried to create a table in excel (first column=customer's name, second column- sale1, third column sale2, fourth column sale 3, fifth column sale4 | First row= labels, Rows 2-6 are each customer's sales) I could not create a binary table of true and false values in rapidminer. I think I was able to import my data correctly because when i put a break point after the process documents operator and examined it, it looked exactly like how it looks in excel. I was wondering if you could do it perhaps tell me what i'm doing wrong. The only thing I can think of is that i'm splitting up the data in excel across too many columns.
Kindest,

Neil P
ReplyDelete
Replies
JulianApril 5, 2012 at 7:00 PM
Hi, your videos are great. I was wondering if you have any idea how to combine Google Analytics reports with text mining techniques that you present here. Thanks!
ReplyDelete
Replies
AnonymousApril 18, 2012 at 12:59 PM
Hi, your videos are a great resource.. I'm working on Sentiment Analysis, where I have a text/sentence which is labeled as positive, negative, mediocre, in CSV format.. I'm applying Process document(transform case, tokenize, stopwords, stemming, and n-gram) --> X-validation -->Using Naive Bayes -->Apply model -->Performance(classification)..

My problem is that the above process gives me the prediction, whereas I'd like to get the probability distribution of text given category.. Is there any other operator that I can use to calculate probability distribution for each category(positive, negative, mediocre).. Also, the cross entropy and log loss measures of performance, I'm getting infinite..

Any pointers?? I appreciate your help..
ReplyDelete
Replies
arnoldsccJune 10, 2012 at 6:40 AM
This comment has been removed by the author.
ReplyDelete
Replies
arnoldsccJune 10, 2012 at 6:41 AM
Hi Neil, a few weeks ago I tried to replicate the example you presented in the video 3 (Association rules with text) from an Excel data source which contained the comments of people in each cell (first column), but when create table does not apply binary operator tokenize. also try it from segmenting each comment in a excel file and there it works. would greatly appreciate if you help me solve this complication.
ReplyDelete
Replies
AnonymousJuly 19, 2012 at 11:20 AM
Much better then training from Rapidminer guys:-(..Thanks a lot Neil..
ReplyDelete
Replies
NOIAugust 11, 2012 at 7:04 PM
This comment has been removed by the author.
ReplyDelete
Replies
NOIAugust 11, 2012 at 7:07 PM
Hi Neil,

Thanks a lot for the videos, watched them all and they are very concise, to the point and really helpful for beginners to get started quickly with the tool. I was wondering if you can give me some additional hints regarding text clustering. This area is often side-lined in ML since it’s unsupervised. (working with Weka at the moment. )
I am trying to cluster a whole set of definitions of text using K-means, for ex: flu - common symptom of cold etc etc. I have a couple of thousand definitions like this. The idea is that I present a document or abstract and based on the clusters I created from the definitions, it decides what the document is predominantly talking about by seeing to which cluster it falls closest too. Like you suggest in the video, the process is to use the bag of words, filter for stop words and create vectors. However being an unsupervised method I’m not quite sure how to go about it.
Again your videos are great.

Thanks,

Daniel
ReplyDelete
Replies
AnonymousSeptember 10, 2012 at 7:56 AM
Hi Neil!
Thanks for the videos.
I'm new in using the RapidMiner, so I want to classify the customers' reviews in the online-shops. I think RapidMiner fits well. What can You suggest? How can I start, because I tried to copy the reviews as a text, but it doesn't have the desired result. Please help! :)
Thanks.

ReplyDelete
Replies
AnonymousOctober 5, 2012 at 3:41 AM
Hi Neil
Thanks for the videos as they are quite helpful.
I am quite new to rapidminer and i am working on a project in which i am working on text clustering with k means algorithm. Can you please suggest me how to do this in rapidminer. Can i use database in excel for text clustering with k means

Regards
ReplyDelete
Replies
Apache08November 21, 2012 at 10:54 PM
Neil God bless you, nice video ever.very helpful ...Neil What about wordnet Extension could you help me to figure it...i need to use it to apply clustering with synonym word inside document.
ReplyDelete
Replies
danyNovember 28, 2012 at 3:42 AM
Hi Neil
I'm doing a project in rapid miner and I'm trying to connect these three operators : read excel-get pages-database. It should be easy, but I'm getting some errors like "Write Database....java.lang.NullPointerException: Identifier must not be null" or MySQL though an error exception. I'm quite new in these filed so I don't know where I did wrong:on my database setup-connection or is given my this error because all the data that I'm trying to get is form www.daft.ie and probably there server is stopping me.
Can you help me with? Do you have any ideas that could guide me?
Many thanks.
ReplyDelete
Replies
AnonymousDecember 27, 2012 at 8:47 AM
Hi Neil,
Thanks for the great videos :). I have a question which I hope you can answer.
When reading from csv file the column ID and column MESSAGE, is it possible to keep the ID field when using the Process Document from Data operator? So when tokonize keep the relation between the word and sentence? Thanks!
ReplyDelete
Replies
UnknownJanuary 1, 2013 at 10:02 PM
Hello Neil,
I am trying to extract certain words on the basis of wordlist Dictionary i created.
But somehow i am unable to do it.
Can you please suggest how to do this in RapidMiner?
Thanks!!!!
ReplyDelete
Replies
UnknownJanuary 2, 2013 at 6:43 AM
Hello,
I am using Process documents from files operator.
Is there any way to extract the tokens which match one of the words in the list of words in a wordlist?
ReplyDelete
Replies
UnknownFebruary 25, 2013 at 6:25 AM
Neil,
After completing different processes of text mining, I get 10,000+ tokens in the output from 500 documents. I am trying to export them and create a database. Can you please help me by suggesting the steps to export the process results in a database / excel sheet?

Thanks a lot.
ReplyDelete
Replies
UnknownFebruary 27, 2013 at 1:03 PM
@NEIL
Hi
i am working on the Extraction and analysis of faculty performance of management discipline from student feedback project. can you please help me via suggest me about the methods that gonna help me to extract the data and tokenization. please help me i am in hurry....
ReplyDelete
Replies
UnknownMarch 18, 2013 at 9:15 AM
Hello Neil,

Thanks for all your videos and help. You're great! :) I've got a question concerning the "Process Documents from Data" operator. I want Rapid Miner to open downloaded html files on my hard disk and to process them. I let it "Read a CSV" file that contains about 50 file paths of the html-files I'd like to process. That works well but it doesn't open the files in the CSV to process their content. Is there any possibility to make Rapid Miner open multiple file paths (taht are not in the same directory), read the html-files and process them?

I would be very thankful if you could give me some advice.

Best regards,
Enrico
ReplyDelete
Replies
UnknownApril 4, 2013 at 7:42 PM
Hello M. McGuigan,

Thanks for the great videos ! Very, very, very helpfull!!! One short question: I'am doing k-means clutering from I00objects that were created using the «cut document» operator. When looking at the results in the exampleset sheet,the column «text» only shows the tokens used for classification process , but not the integral text of the objects. Is there someting I can do to have acces to the integral text of items from each cluster ?

Thanks for your support
John D.
ReplyDelete
Replies
AnonymousApril 9, 2013 at 3:26 AM
Hello, I want to check similarity between 2 files, which one the operator that best to use? data to similarity operator or cross distance operator?
Thanks
ReplyDelete
Replies
LIn~PeI LInApril 14, 2013 at 9:25 AM
Hi Neil,

I feel great have your guidance for using rapid miner to processing text, but due to the new version for rapid miner, i can't able to process the document data by using tokenize. May i know what is the problem?

The new version of process document data operate doesn't have "create word vector".
ReplyDelete
Replies
AnonymousApril 14, 2013 at 11:06 PM
Thank you so much Neil, this video is so helpful!
ReplyDelete
Replies
nate lindstedtMay 2, 2013 at 11:45 AM
Hi,

Thanks for this. Part 6: "Applying A Model To New Documents" on YouTube should be included in the text analytics playlist on that site. Right now, it is not. That means you have to either find it or stumble across it separately on your YouTube channel. I only now discovered that there even was a Part 6 to this tutorial series due to that. I think for ease of use, you should fix this. I know that Part 6 was done after the fact, but it naturally belongs with the rest of the series.

Other than that small issue, these tutorials have been great and very helpful! Thank you!
ReplyDelete
Replies
AnonymousMay 29, 2013 at 2:59 AM
Wau, thanks a lot for this tutorials. I'm new to RapidMiner and this was exactly what I was looking for on the internet.
ReplyDelete
Replies
jesusJune 7, 2013 at 2:37 AM

Thanks for your excellent guide man

Data Mining

ReplyDelete
Replies
UnknownJuly 22, 2013 at 11:25 PM
How to export the results in excel format.....
ReplyDelete
Replies
MarcoAugust 1, 2013 at 3:27 AM
Very, very useful! Thank you for the effort!
ReplyDelete
Replies
prabhashanthNovember 20, 2013 at 9:48 PM
i am new to rapid miner but i have installed rapid miner in windows 8 in that i don't have update rapid miner so that i can update text processing and web mining i have only update rapid miner marketplace how can i update text processing and web mining
ReplyDelete
Replies

Add comment

Vancouver Data Blog by Neil McGuigan

Pages

Monday, November 8, 2010

Text Analytics with RapidMiner Part 1 of 6 - Loading Text

56 comments:

Archive