EzDevInfo.com

mining

Business Intelligence (BI) in Python, OLAP Business analytics and business intelligence leaders - openmining

Get it on Github
Language : Python

http://openmining.io

Adding CURE clustering algorithm to WEKA

I have written a java program to perform CURE clustering. I wish to add this program to weka as a clustering algorithm and visualize the clustering. Has anyone already implemented it on weka?Any links to that would be very much helpful. How do I proceed with it? Regards, JSpider

Source: (StackOverflow)

Which is the best clustering algorithm to find outliers?

Basically I have some hourly and daily data like

Day 1

Hours,Measure (1,21) (2,22) (3,27) (4,24)

Day 2 hours,measure (1,23) (2,26) (3,29) (4,20)

Now I want to find outliers in the data by considering hourly variations and as well as the daily variations using bivariate analysis...which includes hourly and measure...

So which is the best clustering algorithm is more suited to find outlier considering this scenario? .

Source: (StackOverflow)

Advertisements

R text file and text mining...how to load data

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.

I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....

stemDocument(x, language = map_IETF(Language(x)))

So assume that this is my doc "this is a test for R load"

How do I load the data for text processing and to create the object x?

Source: (StackOverflow)

algorithm to calculate similarity between texts

I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?

I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really hate dogs", we need to classify this case as totally different.

Thanks

Source: (StackOverflow)

Web mining -classification algorithms

my senior project is determining the dominant category of a web page.I crawled dmoz. now i am trying to build arff. After that i will use some feature extraction methods and classification algorithms. Do you know which feature extraction method performs good with any classification algorithm for web mining?

Source: (StackOverflow)

Starting with Data Mining

I have started learning Data Mining and wish to create a small project in C++/Java that allows me to utilize a database, say from twitter and then publish a particular set of results (for eg. all the news items on a feed). I want to know how to go about it? Where should I start?

Source: (StackOverflow)

Converting a Document Term Matrix into a Matrix with lots of data causes overflow

Let's do some Text Mining

Here I stand with a document term matrix (from the tm Package)

dtm <- TermDocumentMatrix(
     myCorpus,
     control = list(
         weight = weightTfIdf,
         tolower=TRUE,
         removeNumbers = TRUE,
         minWordLength = 2,
         removePunctuation = TRUE,
         stopwords=stopwords("german")
      ))

When I do a

typeof(dtm)

I see that it is a "list" and the structure looks like

Docs
Terms        1 2 ...
  lorem      0 0 ...
  ipsum      0 0 ...
  ...        .......

So I try a

wordMatrix = as.data.frame( t(as.matrix(  dtm )) )

That works for 1000 Documents.

But when I try to use 40000 it doesn't anymore.

I get this error:

Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt

Error in vector ... : Vector can't be NA Additional: In nr * nc NAs created by integer overflow

So I looked at as.matrix and it turns out that somehow the function converts it to a vector with as.vector and than to a matrix. The convertion to a vector works but not the one from the vector to the matrix dosen't.

Do you have any suggestions what could be the problem?

Thanks, The Captain

Source: (StackOverflow)

Text Mining with SVM Classifier

I want to apply SVM classification for text-mining purpose using python nltk and get precision, recall accuracy different measurement information.For doing this, I preprocess dataset and split my dataset into two text files namely-pos_file.txt (positive label) and neg_file.txt (negative label). And now I want to apply SVM classifier with Random Sampling 70% for training the data and 30% for testing. I saw some documentation of scikit-learn, but not exactly sure how I shall apply this?

Both pos_file.txt and neg_file.txt are can be considered as bag of words. Useful links-

Sample files: pos_file.txt

stackoverflowerror restor default properti page string present
multiprocess invalid assert fetch process inform
folderlevel discoveri option page seen configur scope select project level

Sample files: neg_file.txt

class wizard give error enter class name alreadi exist
unabl make work linux
eclips crash
semant error highlight undeclar variabl doesnt work

And furthermore it would be interesting to apply the same approach for unigram, bigram and trigram. Thanks and looking forward your suggestion or sample code.

Source: (StackOverflow)

PL/SQL issue concerning Frequent Itemset

I'm trying to build a PL/SQL application to mine frequent item sets out of a set of given data and I've run into a bit of a snag. My PL/SQL skills aren't as good as I'd like them to be, so perhaps one of you can help me understand this a bit better.

So to begin, I'm using the Oracle data mining procedure: *DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL*

While reading the documentation, I came across the following example which I have manipulated to query over my data set:

CREATE OR REPLACE TYPE FI_VARCHAR_NT AS TABLE OF NUMBER;
/

CREATE TYPE fi_res AS OBJECT (
itemset      FI_VARCHAR_NT,
support      NUMBER,
length       NUMBER,
total_tranx  NUMBER
);
/

CREATE TYPE fi_coll AS TABLE OF fi_res;
/

create or replace 
PROCEDURE freq_itemset_test is
    cursor freqC is
          SELECT itemset
          FROM table(
            CAST(DBMS_FREQUENT_ITEMSET.FI_TRANSACTIONAL(CURSOR(SELECT sale.customerid, sale.productid FROM Sale INNER JOIN Customer ON customer.customerid = sale.customerid WHERE customer.region = 'Canada' )
                  ,0,2, 2, NULL, NULL) AS fi_coll));   
      coll_nt  FI_VARCHAR_NT;
    num_rows int;
    num_itms int;
  BEGIN
    num_rows := 0;
    num_itms := 0;
    OPEN freqC;
    LOOP
      FETCH freqC INTO coll_nt;
      EXIT WHEN freqC%NOTFOUND;
      num_rows := num_rows + 1;
      num_itms := num_itms + coll_nt.count;
    END LOOP;
    DBMS_OUTPUT.PUT_LINE('Rows: ' || num_rows || ' Columns: ' || num_itms);
  CLOSE freqC;
END;

My reasoning for using the Oracle FI_TRANSACTIONAL over straight SQL is that I will need to repeat this analysis for multiple dynamic values of K, so why reinvent the wheel? Ultimately, my goal is to reference each individual item sets returned by the procedure and return the set with the highest support based on some query logic. I will be incorporating this block of PL/SQL into another that basically changes the literal in the query from 'Canada' to multiple other regions based on the content of the data.

My question is: How can I actually get a programmatic reference on the data returned by the cursor (freqC)? Obviously I do not need to count the rows and columns, but that was part of the example. I'd like to print out the item sets with DBMS print line after I've found the most occurring item set. When I view this in a debugger, I see that each fetch of the cursor actually returns an item set (in this case, k=2, so two items). But how do I actually touch them programmatically? I'd like to grab the sets themselves as well as fi_res.support.

As always, thanks to everyone for sharing their brilliance!

Source: (StackOverflow)

example to train hidden markov model using mallet ( machine learning for langauge engineering)

I need to have a library of hmm for sequence modeling to model labeling of sentences in text. For this i explored to example of MALLET source code "TrainHMM" but stucked with not having training and testing file reference and description. Please help..

Regards, Ashish

Source: (StackOverflow)

Mining/Crawling/ the web console with phantomjs or something else?

I am wanting to create an application whose behavior is directly related to that of another web application. Essentially, there is an application that runs within Gmail that dynamically interacts with the the interface based on the actions of the user.

The problem I am running into is that I want to make an application that interacts with that web application, but they do not offer an open API. As such I can't just call the api for the data i need.

When I open the development console in chrome I can see the application running, and the debugging comments that run based on the activity.

Is there any way that I can crawl that dynamic activity using something like PhantomJS to base the activity of another application.

"IF the console says "X" in the command console, run "Z" script in this other application."

I am clearly not an engineer, but want to have an idea if something like this is possible.

A very hacky way to deal with a closed API. I can't see your code or use it, but if I can watch it work, doesn't it seem logical that I can record that realtime and interact with it in another application?

Source: (StackOverflow)

poclbm not reporting hashes to deepbit or slush

I run poclbm on my system but for some reason both deepbit and slush don't "see" the work being performed. My system reports about 200 megabashes per second being done. I tried mining with my cpu using the same settings, and then both deepbit and slush recognized that work was being performed.

These are the errors I am getting out of the respective mining hardware (every minute or so):

poclbm error: pit.deepbit.net:8332 22/02/2013 21:50:59, Verification failed, check hardware! (0:0:Cypress, d47b7ba0)

cgminer error: [2013-02-22 22:18:51] GPU0: invalid nonce - HW error

I am using Ubuntu 12.10 (Quantal Quetzal) with the 12.10 version poclbm with an ATI 5800 series video card. The video drivers are installed and work as far as I can tell. When I run a "aticonfig --odgc --adapter=all", the gpu does seem to be utilized with poclbm (around 70% utilization or so).

Source: (StackOverflow)

Algorithms for mapping data in data mining

I need to scrape some webpages and extract content from them. I'm planning to select some specific keywords and map the data that has some relationship b/w them. But I have no Idea, how I could do that. Could anyone suggest me some algorithms for doing it?.

For example I need to download some webpages about apples and map the relevant data about apples to it and store in database so that, if someone needs specific information about it, I could provide it fastly and accurately.

Also it would be helpful pointing out helpful libraries too. I'm planning to do it in python.

Source: (StackOverflow)

in-line ASM addressing in OSX

I am getting this issue compiling on osx

fatal error: error in backend: 32-bit absolute addressing is not supported in 64-bit mode

by this line of code in inline-ASM:

movq    grsoT0(,%%rdi,8), %%mm1

OSX does not allow me to define absolute addressing, I need to convert this to relocatable addressing. Could you help me? I don't know how.... I don't really understand this line. (trying to port a software on OSX)

Source: (StackOverflow)

Text mining using R to count frequency of words

I want to count the occurrence of the word "uncertainty" but only if "economic policy" or "legislation" or words pertaining to policies appear in the same text. Right now, I have come out with a code in R to count the frequency of all words in the text, but it does not discern if the words counted occur in the right context. Do you guys have any suggestions how to rectify this?

library(tm) #load text mining library
setwd('D:/3_MTICorpus') #sets R's working directory to near where my files are
ae.corpus<-Corpus(DirSource("D:/3_MTICorpus"),readerControl=list(reader=readPlain))
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "available", "via")
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
#library(SnowballC)
#ae.corpus <- tm_map(ae.corpus, stemDocument)

ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 3))
inspect(ae.tdm)
findFreqTerms(ae.tdm, lowfreq=2)
findAssocs(ae.tdm, "economic",.7)
d<- Dictionary (c("economic", "uncertainty", "policy"))
inspect(DocumentTermMatrix(ae.corpus, list(dictionary = d)))

Source: (StackOverflow)