EzDevInfo.com

happybase

A developer-friendly Python library to interact with Apache HBase HappyBase — HappyBase 0.9 documentation

Spark can't pickle method_descriptor

I get this weird error message

15/01/26 13:05:12 INFO spark.SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "/home/user/inverted-index.py", line 78, in <module>
    print sc.wholeTextFiles(data_dir).flatMap(update).top(10)#groupByKey().map(store)
  File "/home/user/spark2/python/pyspark/rdd.py", line 1045, in top
    return self.mapPartitions(topIterator).reduce(merge)
  File "/home/user/spark2/python/pyspark/rdd.py", line 715, in reduce
    vals = self.mapPartitions(func).collect()
  File "/home/user/spark2/python/pyspark/rdd.py", line 676, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/home/user/spark2/python/pyspark/rdd.py", line 2107, in _jrdd
    pickled_command = ser.dumps(command)
  File "/home/user/spark2/python/pyspark/serializers.py", line 402, in dumps
    return cloudpickle.dumps(obj, 2)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 816, in dumps
    cp.dump(obj)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 133, in dump
    return pickle.Pickler.dump(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 254, in save_function
    self.save_function_tuple(obj, [themodule])
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 304, in save_function_tuple
    save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
    save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 254, in save_function
    self.save_function_tuple(obj, [themodule])
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 304, in save_function_tuple
    save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
    save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 254, in save_function
    self.save_function_tuple(obj, [themodule])
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 304, in save_function_tuple
    save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
    save(tmp[0])
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 249, in save_function
    self.save_function_tuple(obj, modList)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 309, in save_function_tuple
    save(f_globals)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 650, in save_reduce
    save(state)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 650, in save_reduce
    save(state)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 650, in save_reduce
    save(state)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 547, in save_inst
    self.save_inst_logic(obj)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 537, in save_inst_logic
    save(stuff)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 547, in save_inst
    self.save_inst_logic(obj)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 537, in save_inst_logic
    save(stuff)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 616, in save_reduce
    save(cls)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 467, in save_global
    d),obj=obj)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 631, in save_reduce
    save(args)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 616, in save_reduce
    save(cls)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 442, in save_global
    raise pickle.PicklingError("Can't pickle builtin %s" % obj)
pickle.PicklingError: Can't pickle builtin <type 'method_descriptor'>

My update func returns a list of tuples of type (key, (value1, value2)) and all of them are strings as seen below:

def update(doc):
    doc_id  = doc[0][path_len:-ext_len] #actual file name
    content = doc[1].lower()

    new_fi = regex.split(content)
    old_fi = fi_table.row(doc_id)

    fi_table.put(doc_id, {'cf:col': ",".join(new_fi)})

    if not old_fi:
        return [(term, ('add', doc_id)) for term in new_fi]
    else:
        new_fi = set(new_fi)
        old_fi = set(old_fi['cf:col'].split(','))
        return [(term, ('add', doc_id)) for term in new_fi - old_fi] + \
               [(term, ('del', doc_id)) for term in old_fi - new_fi]

EDIT: The problem lies on these 2 hbase functions, the row and the put. When I comment them both the code works (setting the old_fi as an empty dictionary) but if one of them runs, it produces the above error. I use happybase to operate hbase in python. Can someone explain me what goes wrong?


Source: (StackOverflow)

does happybase table.put accept none string value?

I have been testing table.put using java and python.

In java, you can write int or float values into a column. using happybase

table.put(line_item_key, {'allinone:quantity': quantity})

it bombs out with TypeError: object of type 'int' has no len()

Could this be true that happybase does not support write out anything other than string?


Source: (StackOverflow)

Advertisements

happybase connect to hbase get table information failed

I am new to hbase, want to use happybase followed the tutorial here: https://happybase.readthedocs.org/en/latest/user.html#establishing-a-connection code is as following:

connection = happybase.Connection(host='10.0.0.11', port=16000);
connection.open()
table = connection.table('users')
list(table.scan())

but I always get the thrift problem:

thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

Anyone knows how to solve this problem? I am on linux. thx


Source: (StackOverflow)

Closing connection to hbase database using happybase in python

def hbasePopulate(self,table="abc",MachineIP="xx.xx.xx.xx"):

    connection=happybase.Connection(MachineIP,autoconnect=True)
    tablename=Reptype.lower()+'rep'
    print "Connecting to table "
    print tablename
    try:
        table=connection.table(tablename)
        for key,data in table.scan():
            print key,data
        print table
    #except IOError as e:
    except:
        print "Table does not exists,creating"
        self.createTable(table=table,machineIP=machineIP)

    with table.batch() as b:
        with open('xxx.csv','r') as queryFile:

            for lines in queryFile:

                lines=lines.strip("\n")
                splitRecord=lines.split(",")
                key=splitRecord[0]
                key=key.replace("'","")
                val=",".join(splitRecord[1:])
                val=ast.literal_eval(val)
                table.put(splitRecord[0],val)

    for key,data in table.scan():
        print key,data

def createTable(self,table="abc",MachineIP=""):
    connection=happybase.Connection(MachineIP,autoconnect=True)
    print "Connection Handle",connection
    tname=table.lower()
    tablename=str(tname)
    print "Creating table : "+table+", On Hbase machine : "+MachineIP
    families={"cf":{} ,}   #using default column family
    connection.create_table(table,families=families)
    print "Creating table done "

Every time I run this script it populated data to hbase table but it leaves a connection open. When I check using netstat -an I see the connection count has increased which persists even after the script completes.

Am I missing something? Do we need to explicitly close connection?

Thanks for helping.


Source: (StackOverflow)

happybase uninterruptly not working with hbase


i'm using happybase to pass data from twitter to my hbase setup,Initially its working fine,i can create connection with hbase and table also,but now i cant able to put,scan,delete any data in hbase through happybase, by script & python prompt,

>>> import happybase
>>> cn = happybase.Connection('localhost')
>>> v = cn.table('test')
>>> <happybase.table.Table name='test'>
>>> v
>>> <happybase.table.Table name='test'>
>>> n = v.scan(row_prefix='0001')
>>> for key,data in n:
...   print key,data

When i try put or scan data, system doesn't do anything only loading upto 8 hours

Please give me suggestion.


Source: (StackOverflow)

Output separated HBase collumns using happybase

I have such HBase-table:

total date1:tCount1 date2:tCount2 ...
url1 date1:clickCount1 date2:clickCount2 ...
url2 date1:clickCount1 date2:clickCount2 ...
...

url1, url2, ... are row keys. The table has only one column family.

I have a date range (from datei to datej) as input. I need to output shares of clicks in a day for each url.

The output must have the such format:

datei url1:share1 url2:share1...
...
datej url1:share1 url2:share1...

where

datei.url1:share1 = url1.datei:clickCount1 / total datei:tCount1

I started to write happybase-script, but I don't know, how to select separate columns from row using happybase. My happybase-script is below:

import argparse
import calendar
import getpass
import happybase
import logging
import random
import sys

USAGE = """

To query daily data for a year, run:
  $ {0} --action query --year 2014

To query daily data for a particular month, run:
  $ {0} --action query --year 2014 --month 10

To query daily data for a particular day, run:
  $ {0} --action query --year 2014 --month 10 --day 27

To compute totals add `--total` argument.

""".format(sys.argv[0])

logging.basicConfig(level="DEBUG")

HOSTS = ["bds%02d.vdi.mipt.ru" % i for i in xrange(7, 10)]
TABLE = "VisitCountPy-" + getpass.getuser()

def connect():
    host = random.choice(HOSTS)
    conn = happybase.Connection(host)

    logging.debug("Connecting to HBase Thrift Server on %s", host)
    conn.open()

    if TABLE not in conn.tables():
        # Create a table with column family `cf` with default settings.
        conn.create_table(TABLE, {"cf": dict()})
        logging.debug("Created table %s", TABLE)
    else:
        logging.debug("Using table %s", TABLE)
    return happybase.Table(TABLE, conn)

def query(args, table):
    r = list(get_time_range(args))
    t = 0L
    for key, data in table.scan(row_start=min(r), row_stop=max(r)):
        if args.total:
            t += long(data["cf:value"])
        else:
            print "%s\t%s" % (key, data["cf:value"])
    if args.total:
        print "total\t%s" % t

def get_time_range(args):
    cal = calendar.Calendar()
    years = [args.year]
    months = [args.month] if args.month is not None else range(1, 1+12)

    for year in years:
        for month in months:
            if args.day is not None:
                days = [args.day]
            else:
                days = cal.itermonthdays(year, month)
            for day in days:
                if day > 0:
                    yield "%04d%02d%02d" % (year, month, day)

def main():
    parser = argparse.ArgumentParser(description="An HBase example", usage=USAGE)
    parser.add_argument("--action", metavar="ACTION", choices=("generate", "query"), required=True)
    parser.add_argument("--year", type=int, required=True)
    parser.add_argument("--month", type=int, default=None)
    parser.add_argument("--day", type=int, default=None)
    parser.add_argument("--total", action="store_true", default=False)

    args = parser.parse_args()
    table = connect()

    if args.day is not None and args.month is None:
        raise RuntimeError("Please, specify a month when specifying a day.")
    if args.day is not None and (args.day < 0 or args.day > 31):
        raise RuntimeError("Please, specify a valid day.")

    query(args, table)

if __name__ == "__main__":
    main()

So, how should I change my script (actually, the query() function) to get the separated collumns in the defined date range?


Source: (StackOverflow)

How install happybase

I cannot install happybase on Ubuntu linux 12.04, Python 2.7. I've tried pip install happybase. but I get an error. Can anyone tell me what I am doing wrong?

The error is:

error: invalid command 'egg_info'
Complete output from command python setup.py egg_info:
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 

Source: (StackOverflow)

Create Hbase on Amazon EC2 and use it from python

I want to create a database Hbase on Amazon EC2 and write some test data using python Happybase. How to do it? Please tell me links where I can read about it. Thanks


Source: (StackOverflow)

Get only the first 10 columns of a row using happybase

Is it possible to get only a limited number of columns for a column family from a row? Lets say I just want to fetch the first 10 values for ['cf1': 'col1'] for a particular row.


Source: (StackOverflow)

difference between happybase table.scan() and hbase thrift scannerGetList()

I have two version of python script that scans the table in hbase by 1000 rows in while loop. 1st one using happybase as in https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows

while variable:
    for key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):
        print key
    new_key = key

the 2nd one using hbase thrift interface as in http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/

scanner_id = hbase.scannerOpenWithStop(tablename, '', '', [])
data = hbase.scannerGetList(scanner_id, 1000) 
while len(data):
    for dbpost in data:
        print row_of_dbpost
    data = hbase.scannerGetList(scanner_id, 1000)

rows in database are numbers. so my problem is that in certain row something weird is happening:

happybase prints(rows):

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest) 
193622937692155904 
193623435597983745...

and thrift_scanner prints(rows):

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest)
100162267416506368 
10016241167 
10016296927 ...

and this is happening not in the point of next 1000 rows (row_start=new_scan or next data=scannerGetList), but in the middle of batch. And it happens every time.

I would say that 2nd script with scannerGetList is doing it right.

Why happybase doing it differently? is it considering timestamps or some other inside happybase/hbase logic? will it eventually scan the whole table, just in different order?

ps. i do know that happybase version will scan and print 1000th row two times, and scannerGetList will ignore the first row in next data. that is not the point, magic is happening in the middle of 1000 row batch.


Source: (StackOverflow)

Thrift error while generating python client file

I'm new to Hbase and I would like to comunicate with it throught a python API which works with Thrift. I've followed this tutorial in order to install it properly on my machine, everything seemed to worked fine then I generated a .thrift file with the following command:

wget http://svn.apache.org/viewvc/hbase/trunk/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift\?view\=markup 
-O hbase.thrift

Then I tried to generate my client like it's showed here but i get the following error message:

[ERROR:/home/tests/hbase/hbase.thrift:12] (last token was '<')
syntax error
[FAILURE:/home/tests/hbase/hbase.thrift:12] Parser error during include pass.

I tried to lookup on internet what was the cause of this error and found this paper, I tried to lookup in thriftl.ll to see if I could correct the error but I found that the correction was already present in the file.

What can I do more in order to make this work ?

Thank you !

EDIT: I'm using thrift 0.9.0


Source: (StackOverflow)

Filtering integers with HBase + Python

I am trying to filter rows from a HBase table (I am using HappyBase), concretely I am trying to get rows whose 'id' is less than 1000:

for key, data in graph_table.scan(filter="SingleColumnValueFilter('cf', 'id', <, 'binary:1000')"):
    print key, data

The results are the following ones:

<http://ieee.rkbexplorer.com/id/publication-d2a6837e67d808b41ffe6092db50f7cc> {'cf:type': 'v', 'cf:id': '100', 'cf:label': '<http://www.aktors.org/ontology/portal#Proceedings-Paper-Reference>'}
<http://www.aktors.org/ontology/date#1976> {'cf:type': 'v', 'cf:id': '1', 'cf:label': '<http://www.aktors.org/ontology/support#Calendar-Date>'}
<http://www.aktors.org/ontology/date#1985> {'cf:type': 'v', 'cf:id': '10', 'cf:label': '<http://www.aktors.org/ontology/support#Calendar-Date>'}

In the table there are rows with 'id' from 1 to 1000. If I code this in Java using HBase Java library it works fine, parsing integer value with Byte.toBytes() function.

Thank you.


Source: (StackOverflow)

Passing regex to coloumn attribute in happybase scan

I am trying to pass in a list of regexes to the columns attribute in my happybase scan calls. This is because, my coloumn names are made by dynamically appending ids which i dont have acces to at scan time.

Is this possible?


Source: (StackOverflow)

HappyBase - Is there an equivalent of find_one or scan_one?

All the rows in a particular HBase table that I am making a UI for happen to have the same columns and will have so for the foreseeable future. I would like my html data visualizer application to simply query for a single random row to take note of the column names, and put this list of column names into a variable to refer to throughout the program.

I didn't see any equivalent to find_one or scan_one in the docs for HappyBase.

What is the best way to accomplish this?


Source: (StackOverflow)

Sharing HappyBase connections through Spark maps

I am working with Spark and HBase (using HappyBase library) and everything goes ok when working with small datasets. But, when working with big datasets, the connection to HBase Thrift is lost after many calls to map function. I am working with a single pseudonode at this moment.

Concretly, the following error takes place at the map function:

TTransportException: Could not connect to localhost:9090

Map function:

def save_triples(triple, ac, table_name, ac_vertex_id, graph_table_name):
    connection = happybase.Connection(HBASE_SERVER_IP, compat='0.94')
    table = connection.table(table_name)
    [...]
    connection.close()

This is the call to map function:

counts = lines.map(lambda x: save_triples(x, ac, table_name, ac_vertex_id, graph_table_name))
output = counts.collect()

I suspect that it is happening because many connections are being opened. I have tried to create the 'connection' object in main function and passing it to the map function as a parameter (something like this works with HBase libraries in Java), but I get following error:

pickle.PicklingError: Can't pickle builtin <type 'method_descriptor'>

Any help would be appreciated.


Source: (StackOverflow)