EzDevInfo.com

TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. TextBlob: Simplified Text Processing — TextBlob 0.10.0-dev documentation

Get it on Github
Language : Python

https://textblob.readthedocs.org/

Is there a limit on TextBlob translation?

I have been using TextBlob, a package for Python (https://pypi.python.org/pypi/textblob) for translating articles to different language .

After reading their docs, I got to know that TextBlob makes use of Google Translate. Since google translate is not a free service, I wanted to know whether there is any usage limit on translating articles using TextBlob services?

Source: (StackOverflow)

Text search using python

I am working on a text search project, and using text blob to search for sentences from text. TextBlob pulls all the sentences with the keywords efficiently. However for effective research i also want to pull out one sentence before and one after which I am unable to figure.

Below is the code I am using:

def extraxt_sents(Text,word):
    search_words = set(word.split(','))
        sents = ''.join([s.lower() for s in Text])
        blob = TextBlob(sents)
    matches = [str(s) for s in blob.sentences if search_words & set(s.words)]
    print search_words
    print(matches)

Source: (StackOverflow)

Advertisements

How to save the result of classifier textblob NaiveBayesClassifier?

I am using TextBlob's NaiveBayesclassifier for text analysis according to the given themes that I have chosen.

The data is huge(about 3000 entries).

Though I was able to get a result, I'm not able to save it for future use without calling that function again and waiting hours till the processing gets complete.

I tried pickling by the following method

ab = NaiveBayesClassifier(data)

import pickle

object = ab
file = open('f.obj','w') #tried to use 'a' in place of 'w' ie. append
pickle.dump(object,file)

and I got an error, which is as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\pickle.py", line 1370, in dump
    Pickler(file, protocol).dump(obj)
  File "C:\Python27\lib\pickle.py", line 224, in dump
    self.save(obj)
  File "C:\Python27\lib\pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python27\lib\pickle.py", line 419, in save_reduce
    save(state)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 663, in _batch_setitems
    save(v)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "C:\Python27\lib\pickle.py", line 615, in _batch_appends
    save(x)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 562, in save_tuple
    save(element)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 662, in _batch_setitems
    save(k)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 501, in save_unicode
    self.memoize(obj)
  File "C:\Python27\lib\pickle.py", line 247, in memoize
    self.memo[id(obj)] = memo_len, obj
MemoryError

I also tried with sPickle but it also resulted in errors such as:

#saving object with function sPickle.s_dump
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sPickle.py", line 22, in s_dump
    for elt in iterable_to_pickle:
TypeError: 'NaiveBayesClassifier' object is not iterable

#saving object with function sPickle.s_dump_elt
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sPickle.py", line 28, in s_dump_elt
    pickled_elt_str = dumps(elt_to_pickle)
MemoryError: out of memory

Can anyone tell me what I have to do to save the object?

Or is there anyway by which is save the results of the classifier for future use?

Source: (StackOverflow)

Replacement by synsets in Python pattern packatge

My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.

For this I use pattern and TextBlob packages. This is what I have done so far...

from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string

s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")

simica = TextBlob(simica)
simicaTg = simica.words

synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()

Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.

Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?

Source: (StackOverflow)

Cannot install anything using Pip

I am trying to install this one: https://pypi.python.org/pypi/textblob-aptagger and it says to use this code - but I do not know where to use it (command line and Python console do not work):

$ pip install -U textblob-aptagger

I installed easy_install and pip using exe files from http://www.lfd.uci.edu/~gohlke/pythonlibs/

So when I use the command:

$ pip install -U textblob-aptagger

in the Python console I get this error:

  File "<console>", line 1
  $ pip install -U textblob-aptagger
  ^
  SyntaxError: invalid syntax

Where should I use this installation command?

Source: (StackOverflow)

Python Naive Bayes Classification of tweets into categories. Methods

I am trying to implement a Naive Bayes algorithm to read tweets from a csv file and classify them into categories i define (for example: tech, science, politics)

I want to use NLTK's naive bayes classification algorithm but the example is not anywhere close to what i need to do.

One of my biggest confusion is how do we improve the classification accuracy of NB?

**I am hoping to get some guidance on the detailed steps i need take to do the classification.

do i have to create separate csv files for each category where i manually put the tweets in there?
How do i train the algorithm if i do the above and how does the algorithm test?**

I have been researching online and found some brief examples like TextBlob which makes use if NLTK's NB algorithm to do sentiment classification of Tweets. it is simple to understand but difficult to tweak for beginners.

http://stevenloria.com/how-to-build-a-text-classification-system-with-python-and-textblob/

In his example from the link above, how does he implement the test when he already put the sentiment next to the tweets? I thought to test, we should hide the second argument.

train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

Source: (StackOverflow)

Module seemingly reimported in memoized Python function

I'm writing a Python command line utility that involves converting a string into a TextBlob, which is part of a natural language processing module. Importing the module is very slow, ~300 ms on my system. For speediness, I created a memoized function that converts text to a TextBlob only the first time the function is called. Importantly, if I run my script over the same text twice, I want to avoid reimporting TextBlob and recomputing the blob, instead pulling it from the cache. That's all done and works fine, except, for some reason, the function is still very slow. In fact, it's as slow as it was before. I think this must be because the module is getting reimported even though the function is memoized and the import statement happens inside the memoized function.

The goal here is to fix the following code so that the memoized runs are as speedy as they ought to be, given that the result does not need to be recomputed.

Here's a minimal example of the core code:

@memoize
def make_blob(text):
     import textblob
     return textblob.TextBlob(text)


if __name__ == '__main__':
    make_blob("hello")

And here's the memoization decorator:

import os
import shelve
import functools
import inspect


def memoize(f):
    """Cache results of computations on disk in a directory called 'cache'."""
    path_of_this_file = os.path.dirname(os.path.realpath(__file__))
    cache_dirname = os.path.join(path_of_this_file, "cache")

    if not os.path.isdir(cache_dirname):
        os.mkdir(cache_dirname)

    cache_filename = f.__module__ + "." + f.__name__
    cachepath = os.path.join(cache_dirname, cache_filename)

    try:
        cache = shelve.open(cachepath, protocol=2)
    except:
        print 'Could not open cache file %s, maybe name collision' % cachepath
        cache = None

    @functools.wraps(f)
    def wrapped(*args, **kwargs):
        argdict = {}

        # handle instance methods
        if hasattr(f, '__self__'):
            args = args[1:]

        tempargdict = inspect.getcallargs(f, *args, **kwargs)

        for k, v in tempargdict.iteritems():
            argdict[k] = v

        key = str(hash(frozenset(argdict.items())))

        try:
            return cache[key]
        except KeyError:
            value = f(*args, **kwargs)
            cache[key] = value
            cache.sync()
            return value
        except TypeError:
            call_to = f.__module__ + '.' + f.__name__
            print ['Warning: could not disk cache call to ',
                   '%s; it probably has unhashable args'] % (call_to)
            return f(*args, **kwargs)

    return wrapped

And here's a demonstration that the memoization doesn't currently save any time:

❯ time python test.py
python test.py  0.33s user 0.11s system 100% cpu 0.437 total

~/Desktop
❯ time python test.py
python test.py  0.33s user 0.11s system 100% cpu 0.436 total

This is happening even though the function is correctly being memoized (print statements put inside the memoized function only give output the first time the script is run).

I've put everything together into a GitHub Gist in case it's helpful.

Source: (StackOverflow)

open text file as input to textblob

hello it's my first question so my deepest apologies if i didn't go by a rule or so, i 'll try my best to make it as clear as possible

i am trying to use textBlob with a text file input all examples i found online were of input in this sense

wiki = TextBlob("Python is a high-level, general-purpose programming language.")
wiki.tage

whilst i tried something like this

from textblob import TextBlob
file=open("1.txt");
t=file.read();
print(type(t))
bobo = TextBlob(t)
bobo.tags

i hope nothing's wrong , thanx a lot

Edit: the code i tried did not work

Source: (StackOverflow)

Request for Help in installing external libraries on worker nodes in Pyspark-Cluster mode

I am currently working on pyspark for NLP processing etc. I am using TextBlob Python library.

Normally, in a standalone mode, it easy to install the external Python libraries. In case of cluster mode I am facing problem to install these libraries on worker nodes remotely. I cannot access each and every worker machine to install these libs in Python path.

I tried to use Sparkcontext pyfiles option to ship .zip files...but the problem is these Python packages needs to be get installed on worker machines.

Could anyone let me know what are different ways of doing it so that this lib-Textblob could be available in Python path?

Source: (StackOverflow)

Is TextBlob scalable?

I've recently come across TextBlob, which seems like a very neat Natural Language Processing library. http://textblob.readthedocs.org/en/dev/quickstart.html

However, I am concerned because it seems to act as a regular Python string. I have a very large text file, and for example, calling blob.correct() for a very modest text amount takes a very long time. Any feedback on the scale of TextBlob or any alternatives for natural language parsing?

Source: (StackOverflow)

Logic behind the polarity score calculated by TEXTBLOB?

How does textblob calculate polarity in sentiment analysis? What logic does it follow and can we change the logic?

Thank you.

Source: (StackOverflow)

Nonzero return code error while installing TextBlob

I have been trying to install Python TextBlob, but I am getting this error:

Now downloading textblob packages
[localhost] run: python -m textblob.download_corpora
[localhost] out: /home/naren/VirtualEnvironment/bin/python: No module named textblob
[localhost] out:

Fatal error: run() received nonzero return code 1 while executing!

Requested: python -m textblob.download_corpora
Executed: /bin/bash -l -c "cd /home/naren/VirtualEnvironment && source bin/activate && python -m textblob.download_corpora"

Aborting.
Disconnecting from localhost... done.
run() received nonzero return code 1 while executing!

Source: (StackOverflow)

TextBlob not returning the correct number of instances of string in Pandas dataframe

For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.

I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.

    import pandas as pd
    import csv
    df = pd.read_csv('tweets_hiv.csv')
    saved_column4 = df.text
    print saved_column4

Out comes the correct output:

    0                                Some example tweet text
    1                 Oh hey look more tweet text @things I hate #stuff
    ...a bunch more lines
    Name: text, Length: 8540, dtype: object

But, when I try this

    from textblob import TextBlob
    tweetstr = str(saved_column4)
    tweets = TextBlob(tweetstr).upper()
    print tweets.words.count('sex', case_sensitive=False)

My output is 22.

There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?

Source: (StackOverflow)

Train two features instead of one

I have this code. I have two features. How do I train the two features together?

from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger
import re
import nltk



def get_word_before_you_feature(mystring):
    keyword = 'you'
    before_keyword, keyword, after_keyword = mystring.partition(keyword)
    before_keyword = before_keyword.rsplit(None, 1)[-1]
    return {'word_after_you': before_keyword}


def get_word_after_you_feature(mystring):
    keyword = 'you'
    before_keyword, keyword, after_keyword = mystring.partition(keyword)
    after_keyword = after_keyword.split(None, 1)[0]
    return {'word_after_you': after_keyword}
    classifier = nltk.NaiveBayesClassifier.train(train)



lang_detector = NaiveBayesClassifier(train, feature_extractor=get_word_after_you_feature)
lang_detector = NaiveBayesClassifier(train, feature_extractor=get_word_before_you_feature)


print(lang_detector.accuracy(test))
print(lang_detector.show_informative_features(5))

This is the output I get.

word_before_you = 'do' refere : generi = 2.2 : 1.0
word_before_you = 'when' generi : refere = 1.1 : 1.0

It only seems to get the last feature. How do I get the classifier to train both features instead of one.

Source: (StackOverflow)

UnicodeDecodeError in textblob tutorial

I'm trying to run through the TextBlob tutorial in Windows (using Git Bash shell) with Python 3.3.

I've installed textblob and nltk as well as any dependencies.

The Python code is:

from text.blob import TextBlob

wiki = TextBlob("Python is a high-level, general-purpose programming language.")
tags = wiki.tags

I'm getting the following error

Traceback (most recent call last):
File "textblob.py", line 4, in <module> 
  tags = wiki.tags
File "c:\Python33\lib\site-packages\text\decorators.py", line 18, in __get__ 
  value = obj.__dict__[self.func.__name__] = self.func(obj)
File "c:\Python33\lib\site-packages\text\blob.py", line 357, in pos_tags 
  for word, t in self.pos_tagger.tag(self.raw)
File "c:\Python33\lib\site-packages\text\taggers.py", line 40, in tag
  return pattern_tag(sentence, tokenize)
File "c:\Python33\lib\site-packages\text\en.py", line 115, in tag
  for sentence in parse(s, tokenize, True, False, False, False, encoding).split():
File "c:\Python33\lib\site-packages\text\en.py", line 99, in parse
  return parser.parse(unicode(s), *args, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1213, in parse
  s[i] = self.find_tags(s[i], **kwargs)
File "c:\Python33\lib\site-packages\text\en.py", line 49, in find_tags
  return _Parser.find_tags(self, tokens, **kwargs)
File "c:\Python33\lib\site-packages\text\text.py", line 1161, in find_tags
  map = kwargs.get(     "map", None))
File "c:\Python33\lib\site-packages\text\text.py", line 967, in find_tags
  tagged.append([token, lexicon.get(token, i==0 and lexicon.get(token.lower()) or   None)])
File "c:\Python33\lib\site-packages\text\text.py", line 98, in get
  return self._lazy("get", *args)
File "c:\Python33\lib\site-packages\text\text.py", line 79, in _lazy
  self.load()
File "c:\Python33\lib\site-packages\text\text.py", line 367, in load
  dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 367, in <genexpr>
  dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
File "c:\Python33\lib\site-packages\text\text.py", line 346, in _read
  for line in f:
File "c:\Python33\lib\encodings\cp1252.py", line 23, in decode
  return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>

Any idea what is wrong here? Adding a 'u' before the string didn't help.

Source: (StackOverflow)