Pages

Monday, November 8, 2010

Text Analytics with RapidMiner Part 1 of 6 - Loading Text

I'll be releasing a new video on text mining with RapidMiner every day this week.

They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!

Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.

*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.

Later this week:

Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.

Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.

Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.

Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.

NEW: Applying A Model To New Documents

Hope you enjoy them.



See my other data mining videos here

56 comments:

  1. thank you very much for the tutorial!

    ReplyDelete
  2. Good start to this set of series Neil!

    ReplyDelete
  3. WOW!!!
    I've been working with Rapid Miner for two mounths already to make a positive/negative classification... It's been so hard as there was no help in text mining :(
    And you just posted a full scenario!!!
    Can't wait till friday!
    Thanks a lot!

    ReplyDelete
  4. @Anonymous. No worries. I couldn't find a lot of in depth text mining videos either, which is why I decided to make 'em. Good luck with your project. Let us know more about it when you're done!

    ReplyDelete
  5. we need link where dataset found

    ReplyDelete
  6. Here is the sample data.

    It's in zipped xls format

    Neil

    ReplyDelete
  7. Great Video!
    Can you tell me the process of exporting my rapidminer result into an excel file? I have more than 500,000 records in my result.

    I would greatly appreciate your response.

    ReplyDelete
    Replies
    1. You can use Write Excel as Neil points out, but if you've already created the table and it took a while to run, you probably don't want to re-run with the Write Excel operator just to save the data. Unfortunately, there is no built-in support for exporting the table to excel in the free version (there's a post to this effect somewhere on the rapidminer forum), but you can copy and paste the entire table into Excel (you'll have to get the headers some other way, though, because I don't know how to copy them). --Pat

      Delete
  8. Hi Neil,

    I have all the IT customer feedback information for various cases in one excel sheet.If I want to do few operations like tokenizing,stemming etc, what would the root operator which can read the data from excel. Hope you got my question. Please ask me if you do not get it. Early response would be really helpful.

    Thanks RK.

    ReplyDelete
  9. @RK, sorry for the delay, was out of town.

    You should be able to use the Read Excel operator (use the search function to find it). You can only use .xls files, and not .xlsx.

    ReplyDelete
  10. @Vani, you can use the Write Excel operator to output your results to Excel. It may be too large for Excel to handle though, in which case you may want to consider a database, such as the free MS SQL Express

    ReplyDelete
    Replies
    1. Can you provide a simple guide on how to configure the write excel operator, because I can't get it to work. Thanks.

      Delete
  11. Thank you very much for these videos on text analytics! They are not only informative, but you have made them easy to follows! You were looking for input on videos for either web crawling or web scraping. I would like to put my vote in for web scraping, although I can see how web crawling would be useful first. Thank you again.
    Michael Kahler

    ReplyDelete
  12. Thank you very much for this tutorial. I have been trying to extract information from texts and I am not being able to do. Actually, I wanted to extract protein names from the biological full text articles. Can you give me some hint on how to do that. I would really appreciate. Is there any text plugins for rapidminer to perform extract of words from the full-text articles?

    ReplyDelete
  13. @DMX check out the information extraction plugin for rapidminer. I believe there is a link to it on the RapidMiner forum.

    ReplyDelete
  14. Hi neil,

    i would like to procees twitter messages and my dataset was in excel. but i could not fine any video for further guidence. I tried to explore myself but it does'nt work. Do you have any note/video on that? any help is much appreciated. Thank you

    ReplyDelete
  15. Hi Neil,
    When I tried was a process document from files operator for more than one file in the results is only a file handling. How can I solve this problem? Thank you...

    ReplyDelete
  16. Hi;

    I've gone through your videos on RapidMiner's Text Mining capabilities and found therm very interesting. Agility is currently under review of different systems that provide Text Analytic capabilities. We are reviewing a couple; one of them being the Calais system. This system has an example application (http://viewer.opencalais.com/) that demonstrates some of its caapbilities. I was wondering if you are familiar with Calais and if you felt it was comparable to Calais with respect to the type of outpuyt generated from the Calais Test Application and RapidMiner?

    Peter

    ReplyDelete
  17. plz extend video time from 10 to 15 minutes but be a little slow so that we will be able to follow you as beginners.

    ReplyDelete
  18. Hi Neil,

    Do you have any video on "Clustering" through Rapid Miner?

    Regards
    Gunjan

    ReplyDelete
  19. @gunjan video 4 briefly discusses clustering

    ReplyDelete
  20. Hi Neil, nice tut. I am working with the course co-training dataset (http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/). But when i use the operator "Process document for files" for loading the pages that are in the folder it show me an error 'The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen..
    '

    do you have any suggest?

    ReplyDelete
  21. hi neil,

    i tried to enter some text using your suggestion for create document - shows only code. don't know what i am doing that is incorrect. have just downloaded rapidminer so your videos are really a blessing. keep up the good work please

    ReplyDelete
  22. Very useful videos for beginners. Cannot thank you enough

    ReplyDelete
  23. Hi, Neil

    Your videos about text mining with rapid miner is really good, but i am from China and it is difficult to access youtube since it is blocked in China. Would you please share these videos with me? my gmail account is zhxupc@gmail.com.

    Thanks,
    Xiao

    ReplyDelete
  24. So I was trying to do the association learning example on wikipedia with the bread, butter, milk and beer example set and I got it to work..just not by using excel as my imported data source. I had to create 5 different text files with each customer's grocery list. When I tried to create a table in excel (first column=customer's name, second column- sale1, third column sale2, fourth column sale 3, fifth column sale4 | First row= labels, Rows 2-6 are each customer's sales) I could not create a binary table of true and false values in rapidminer. I think I was able to import my data correctly because when i put a break point after the process documents operator and examined it, it looked exactly like how it looks in excel. I was wondering if you could do it perhaps tell me what i'm doing wrong. The only thing I can think of is that i'm splitting up the data in excel across too many columns.
    Kindest,

    Neil P

    ReplyDelete
  25. Hi, your videos are great. I was wondering if you have any idea how to combine Google Analytics reports with text mining techniques that you present here. Thanks!

    ReplyDelete
  26. Hi, your videos are a great resource.. I'm working on Sentiment Analysis, where I have a text/sentence which is labeled as positive, negative, mediocre, in CSV format.. I'm applying Process document(transform case, tokenize, stopwords, stemming, and n-gram) --> X-validation -->Using Naive Bayes -->Apply model -->Performance(classification)..

    My problem is that the above process gives me the prediction, whereas I'd like to get the probability distribution of text given category.. Is there any other operator that I can use to calculate probability distribution for each category(positive, negative, mediocre).. Also, the cross entropy and log loss measures of performance, I'm getting infinite..

    Any pointers?? I appreciate your help..

    ReplyDelete
    Replies
    1. Use the Naive Bayes Process

      Delete
  27. This comment has been removed by the author.

    ReplyDelete
  28. Hi Neil, a few weeks ago I tried to replicate the example you presented in the video 3 (Association rules with text) from an Excel data source which contained the comments of people in each cell (first column), but when create table does not apply binary operator tokenize. also try it from segmenting each comment in a excel file and there it works. would greatly appreciate if you help me solve this complication.

    ReplyDelete
  29. Much better then training from Rapidminer guys:-(..Thanks a lot Neil..

    ReplyDelete
  30. This comment has been removed by the author.

    ReplyDelete
  31. Hi Neil,

    Thanks a lot for the videos, watched them all and they are very concise, to the point and really helpful for beginners to get started quickly with the tool. I was wondering if you can give me some additional hints regarding text clustering. This area is often side-lined in ML since it’s unsupervised. (working with Weka at the moment. )
    I am trying to cluster a whole set of definitions of text using K-means, for ex: flu - common symptom of cold etc etc. I have a couple of thousand definitions like this. The idea is that I present a document or abstract and based on the clusters I created from the definitions, it decides what the document is predominantly talking about by seeing to which cluster it falls closest too. Like you suggest in the video, the process is to use the bag of words, filter for stop words and create vectors. However being an unsupervised method I’m not quite sure how to go about it.
    Again your videos are great.

    Thanks,

    Daniel

    ReplyDelete
    Replies
    1. Hi DMAN,

      Glad you liked the videos. If you check out video 4, it shows how to do text clustering. Also, the upcoming RapidMiner book has some good stuff on clustering as well.

      Cheers,

      Neil

      Delete
    2. Hey Neil,

      Thank you for the reply.

      I looked at all the videos, you have no idea how much they helped me. Just asked the question, to get over the top the normal functionality, so to speak. Clustering techniques are a bit shadowed, especially evaluation measures that give you a real understanding of the quality of results.

      What book are you talking about please...

      regards,

      Daniel

      Delete
  32. Hi Neil!
    Thanks for the videos.
    I'm new in using the RapidMiner, so I want to classify the customers' reviews in the online-shops. I think RapidMiner fits well. What can You suggest? How can I start, because I tried to copy the reviews as a text, but it doesn't have the desired result. Please help! :)
    Thanks.

    ReplyDelete
  33. Hi Neil
    Thanks for the videos as they are quite helpful.
    I am quite new to rapidminer and i am working on a project in which i am working on text clustering with k means algorithm. Can you please suggest me how to do this in rapidminer. Can i use database in excel for text clustering with k means


    Regards

    ReplyDelete
  34. Neil God bless you, nice video ever.very helpful ...Neil What about wordnet Extension could you help me to figure it...i need to use it to apply clustering with synonym word inside document.

    ReplyDelete
  35. Hi Neil
    I'm doing a project in rapid miner and I'm trying to connect these three operators : read excel-get pages-database. It should be easy, but I'm getting some errors like "Write Database....java.lang.NullPointerException: Identifier must not be null" or MySQL though an error exception. I'm quite new in these filed so I don't know where I did wrong:on my database setup-connection or is given my this error because all the data that I'm trying to get is form www.daft.ie and probably there server is stopping me.
    Can you help me with? Do you have any ideas that could guide me?
    Many thanks.

    ReplyDelete
  36. Hi Neil,
    Thanks for the great videos :). I have a question which I hope you can answer.
    When reading from csv file the column ID and column MESSAGE, is it possible to keep the ID field when using the Process Document from Data operator? So when tokonize keep the relation between the word and sentence? Thanks!

    ReplyDelete
  37. Hello Neil,
    I am trying to extract certain words on the basis of wordlist Dictionary i created.
    But somehow i am unable to do it.
    Can you please suggest how to do this in RapidMiner?
    Thanks!!!!

    ReplyDelete
  38. Hello,
    I am using Process documents from files operator.
    Is there any way to extract the tokens which match one of the words in the list of words in a wordlist?

    ReplyDelete
  39. Neil,
    After completing different processes of text mining, I get 10,000+ tokens in the output from 500 documents. I am trying to export them and create a database. Can you please help me by suggesting the steps to export the process results in a database / excel sheet?

    Thanks a lot.

    ReplyDelete
  40. @NEIL
    Hi
    i am working on the Extraction and analysis of faculty performance of management discipline from student feedback project. can you please help me via suggest me about the methods that gonna help me to extract the data and tokenization. please help me i am in hurry....

    ReplyDelete
  41. Hello Neil,

    Thanks for all your videos and help. You're great! :) I've got a question concerning the "Process Documents from Data" operator. I want Rapid Miner to open downloaded html files on my hard disk and to process them. I let it "Read a CSV" file that contains about 50 file paths of the html-files I'd like to process. That works well but it doesn't open the files in the CSV to process their content. Is there any possibility to make Rapid Miner open multiple file paths (taht are not in the same directory), read the html-files and process them?

    I would be very thankful if you could give me some advice.

    Best regards,
    Enrico

    ReplyDelete
  42. Hello M. McGuigan,

    Thanks for the great videos ! Very, very, very helpfull!!! One short question: I'am doing k-means clutering from I00objects that were created using the «cut document» operator. When looking at the results in the exampleset sheet,the column «text» only shows the tokens used for classification process , but not the integral text of the objects. Is there someting I can do to have acces to the integral text of items from each cluster ?

    Thanks for your support
    John D.

    ReplyDelete
  43. Hello, I want to check similarity between 2 files, which one the operator that best to use? data to similarity operator or cross distance operator?
    Thanks

    ReplyDelete
  44. Hi Neil,

    I feel great have your guidance for using rapid miner to processing text, but due to the new version for rapid miner, i can't able to process the document data by using tokenize. May i know what is the problem?

    The new version of process document data operate doesn't have "create word vector".

    ReplyDelete
  45. Thank you so much Neil, this video is so helpful!

    ReplyDelete
  46. Hi,

    Thanks for this. Part 6: "Applying A Model To New Documents" on YouTube should be included in the text analytics playlist on that site. Right now, it is not. That means you have to either find it or stumble across it separately on your YouTube channel. I only now discovered that there even was a Part 6 to this tutorial series due to that. I think for ease of use, you should fix this. I know that Part 6 was done after the fact, but it naturally belongs with the rest of the series.

    Other than that small issue, these tutorials have been great and very helpful! Thank you!

    ReplyDelete
  47. Wau, thanks a lot for this tutorials. I'm new to RapidMiner and this was exactly what I was looking for on the internet.

    ReplyDelete


  48. Thanks for your excellent guide man





    Data Mining

    ReplyDelete
  49. How to export the results in excel format.....

    ReplyDelete
  50. Very, very useful! Thank you for the effort!

    ReplyDelete
  51. i am new to rapid miner but i have installed rapid miner in windows 8 in that i don't have update rapid miner so that i can update text processing and web mining i have only update rapid miner marketplace how can i update text processing and web mining

    ReplyDelete