happybase
A developer-friendly Python library to interact with Apache HBase
HappyBase — HappyBase 0.9 documentation
I get this weird error message
15/01/26 13:05:12 INFO spark.SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
File "/home/user/inverted-index.py", line 78, in <module>
print sc.wholeTextFiles(data_dir).flatMap(update).top(10)#groupByKey().map(store)
File "/home/user/spark2/python/pyspark/rdd.py", line 1045, in top
return self.mapPartitions(topIterator).reduce(merge)
File "/home/user/spark2/python/pyspark/rdd.py", line 715, in reduce
vals = self.mapPartitions(func).collect()
File "/home/user/spark2/python/pyspark/rdd.py", line 676, in collect
bytesInJava = self._jrdd.collect().iterator()
File "/home/user/spark2/python/pyspark/rdd.py", line 2107, in _jrdd
pickled_command = ser.dumps(command)
File "/home/user/spark2/python/pyspark/serializers.py", line 402, in dumps
return cloudpickle.dumps(obj, 2)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 816, in dumps
cp.dump(obj)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 133, in dump
return pickle.Pickler.dump(self, obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 254, in save_function
self.save_function_tuple(obj, [themodule])
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 304, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
save(x)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 254, in save_function
self.save_function_tuple(obj, [themodule])
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 304, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
save(x)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 254, in save_function
self.save_function_tuple(obj, [themodule])
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 304, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 249, in save_function
self.save_function_tuple(obj, modList)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 309, in save_function_tuple
save(f_globals)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 650, in save_reduce
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 650, in save_reduce
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 650, in save_reduce
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 547, in save_inst
self.save_inst_logic(obj)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 537, in save_inst_logic
save(stuff)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 547, in save_inst
self.save_inst_logic(obj)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 537, in save_inst_logic
save(stuff)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 616, in save_reduce
save(cls)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 467, in save_global
d),obj=obj)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 631, in save_reduce
save(args)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 616, in save_reduce
save(cls)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/user/spark2/python/pyspark/cloudpickle.py", line 442, in save_global
raise pickle.PicklingError("Can't pickle builtin %s" % obj)
pickle.PicklingError: Can't pickle builtin <type 'method_descriptor'>
My update func returns a list of tuples of type (key, (value1, value2))
and all of them are strings as seen below:
def update(doc):
doc_id = doc[0][path_len:-ext_len] #actual file name
content = doc[1].lower()
new_fi = regex.split(content)
old_fi = fi_table.row(doc_id)
fi_table.put(doc_id, {'cf:col': ",".join(new_fi)})
if not old_fi:
return [(term, ('add', doc_id)) for term in new_fi]
else:
new_fi = set(new_fi)
old_fi = set(old_fi['cf:col'].split(','))
return [(term, ('add', doc_id)) for term in new_fi - old_fi] + \
[(term, ('del', doc_id)) for term in old_fi - new_fi]
EDIT:
The problem lies on these 2 hbase functions, the row and the put. When I comment them both the code works (setting the old_fi as an empty dictionary) but if one of them runs, it produces the above error. I use happybase to operate hbase in python. Can someone explain me what goes wrong?
Source: (StackOverflow)
I have been testing table.put using java and python.
In java, you can write int or float values into a column. using happybase
table.put(line_item_key, {'allinone:quantity': quantity})
it bombs out with
TypeError: object of type 'int' has no len()
Could this be true that happybase does not support write out anything other than string?
Source: (StackOverflow)
def hbasePopulate(self,table="abc",MachineIP="xx.xx.xx.xx"):
connection=happybase.Connection(MachineIP,autoconnect=True)
tablename=Reptype.lower()+'rep'
print "Connecting to table "
print tablename
try:
table=connection.table(tablename)
for key,data in table.scan():
print key,data
print table
#except IOError as e:
except:
print "Table does not exists,creating"
self.createTable(table=table,machineIP=machineIP)
with table.batch() as b:
with open('xxx.csv','r') as queryFile:
for lines in queryFile:
lines=lines.strip("\n")
splitRecord=lines.split(",")
key=splitRecord[0]
key=key.replace("'","")
val=",".join(splitRecord[1:])
val=ast.literal_eval(val)
table.put(splitRecord[0],val)
for key,data in table.scan():
print key,data
def createTable(self,table="abc",MachineIP=""):
connection=happybase.Connection(MachineIP,autoconnect=True)
print "Connection Handle",connection
tname=table.lower()
tablename=str(tname)
print "Creating table : "+table+", On Hbase machine : "+MachineIP
families={"cf":{} ,} #using default column family
connection.create_table(table,families=families)
print "Creating table done "
Every time I run this script it populated data to hbase table but it leaves a connection open. When I check using netstat -an
I see the connection count has increased which persists even after the script completes.
Am I missing something? Do we need to explicitly close connection?
Thanks for helping.
Source: (StackOverflow)
i'm using happybase to pass data from twitter to my hbase setup,Initially its working fine,i can create connection with hbase and table also,but now i cant able to put,scan,delete any data in hbase through happybase, by script & python prompt,
>>> import happybase
>>> cn = happybase.Connection('localhost')
>>> v = cn.table('test')
>>> <happybase.table.Table name='test'>
>>> v
>>> <happybase.table.Table name='test'>
>>> n = v.scan(row_prefix='0001')
>>> for key,data in n:
... print key,data
When i try put or scan data, system doesn't do anything only loading upto 8 hours
Please give me suggestion.
Source: (StackOverflow)
I have such HBase-table:
total date1:tCount1 date2:tCount2 ...
url1 date1:clickCount1 date2:clickCount2 ...
url2 date1:clickCount1 date2:clickCount2 ...
...
url1, url2, ...
are row keys. The table has only one column family.
I have a date range (from datei
to datej
) as input.
I need to output shares of clicks in a day for each url.
The output must have the such format:
datei url1:share1 url2:share1...
...
datej url1:share1 url2:share1...
where
datei.url1:share1 = url1.datei:clickCount1 / total datei:tCount1
I started to write happybase-script, but I don't know, how to select separate columns from row using happybase.
My happybase-script is below:
import argparse
import calendar
import getpass
import happybase
import logging
import random
import sys
USAGE = """
To query daily data for a year, run:
$ {0} --action query --year 2014
To query daily data for a particular month, run:
$ {0} --action query --year 2014 --month 10
To query daily data for a particular day, run:
$ {0} --action query --year 2014 --month 10 --day 27
To compute totals add `--total` argument.
""".format(sys.argv[0])
logging.basicConfig(level="DEBUG")
HOSTS = ["bds%02d.vdi.mipt.ru" % i for i in xrange(7, 10)]
TABLE = "VisitCountPy-" + getpass.getuser()
def connect():
host = random.choice(HOSTS)
conn = happybase.Connection(host)
logging.debug("Connecting to HBase Thrift Server on %s", host)
conn.open()
if TABLE not in conn.tables():
# Create a table with column family `cf` with default settings.
conn.create_table(TABLE, {"cf": dict()})
logging.debug("Created table %s", TABLE)
else:
logging.debug("Using table %s", TABLE)
return happybase.Table(TABLE, conn)
def query(args, table):
r = list(get_time_range(args))
t = 0L
for key, data in table.scan(row_start=min(r), row_stop=max(r)):
if args.total:
t += long(data["cf:value"])
else:
print "%s\t%s" % (key, data["cf:value"])
if args.total:
print "total\t%s" % t
def get_time_range(args):
cal = calendar.Calendar()
years = [args.year]
months = [args.month] if args.month is not None else range(1, 1+12)
for year in years:
for month in months:
if args.day is not None:
days = [args.day]
else:
days = cal.itermonthdays(year, month)
for day in days:
if day > 0:
yield "%04d%02d%02d" % (year, month, day)
def main():
parser = argparse.ArgumentParser(description="An HBase example", usage=USAGE)
parser.add_argument("--action", metavar="ACTION", choices=("generate", "query"), required=True)
parser.add_argument("--year", type=int, required=True)
parser.add_argument("--month", type=int, default=None)
parser.add_argument("--day", type=int, default=None)
parser.add_argument("--total", action="store_true", default=False)
args = parser.parse_args()
table = connect()
if args.day is not None and args.month is None:
raise RuntimeError("Please, specify a month when specifying a day.")
if args.day is not None and (args.day < 0 or args.day > 31):
raise RuntimeError("Please, specify a valid day.")
query(args, table)
if __name__ == "__main__":
main()
So, how should I change my script (actually, the query()
function) to get the separated collumns in the defined date range?
Source: (StackOverflow)
I cannot install happybase on Ubuntu linux 12.04, Python 2.7. I've tried pip install happybase
. but I get an error. Can anyone tell me what I am doing wrong?
The error is:
error: invalid command 'egg_info'
Complete output from command python setup.py egg_info:
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option:
Source: (StackOverflow)
I want to create a database Hbase on Amazon EC2 and write some test data using python Happybase. How to do it?
Please tell me links where I can read about it.
Thanks
Source: (StackOverflow)
Is it possible to get only a limited number of columns for a column family from a row? Lets say I just want to fetch the first 10 values for ['cf1': 'col1']
for a particular row.
Source: (StackOverflow)
I have two version of python script that scans the table in hbase by 1000 rows in while loop.
1st one using happybase as in https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows
while variable:
for key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):
print key
new_key = key
the 2nd one using hbase thrift interface as in http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/
scanner_id = hbase.scannerOpenWithStop(tablename, '', '', [])
data = hbase.scannerGetList(scanner_id, 1000)
while len(data):
for dbpost in data:
print row_of_dbpost
data = hbase.scannerGetList(scanner_id, 1000)
rows in database are numbers. so my problem is that in certain row something weird is happening:
happybase prints(rows):
... 100161632382107648
10016177552
10016186396
10016200693
10016211838
100162138374537217 (point of interest)
193622937692155904
193623435597983745...
and thrift_scanner prints(rows):
... 100161632382107648
10016177552
10016186396
10016200693
10016211838
100162138374537217 (point of interest)
100162267416506368
10016241167
10016296927 ...
and this is happening not in the point of next 1000 rows (row_start=new_scan or next data=scannerGetList), but in the middle of batch. And it happens every time.
I would say that 2nd script with scannerGetList is doing it right.
Why happybase doing it differently? is it considering timestamps or some other inside happybase/hbase logic? will it eventually scan the whole table, just in different order?
ps. i do know that happybase version will scan and print 1000th row two times, and scannerGetList will ignore the first row in next data. that is not the point, magic is happening in the middle of 1000 row batch.
Source: (StackOverflow)
I'm new to Hbase and I would like to comunicate with it throught a python API which works with Thrift. I've followed this tutorial in order to install it properly on my machine, everything seemed to worked fine then I generated a .thrift file with the following command:
wget http://svn.apache.org/viewvc/hbase/trunk/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift\?view\=markup
-O hbase.thrift
Then I tried to generate my client like it's showed here but i get the following error message:
[ERROR:/home/tests/hbase/hbase.thrift:12] (last token was '<')
syntax error
[FAILURE:/home/tests/hbase/hbase.thrift:12] Parser error during include pass.
I tried to lookup on internet what was the cause of this error and found this paper, I tried to lookup in thriftl.ll to see if I could correct the error but I found that the correction was already present in the file.
What can I do more in order to make this work ?
Thank you !
EDIT:
I'm using thrift 0.9.0
Source: (StackOverflow)
I am trying to filter rows from a HBase table (I am using HappyBase), concretely I am trying to get rows whose 'id' is less than 1000:
for key, data in graph_table.scan(filter="SingleColumnValueFilter('cf', 'id', <, 'binary:1000')"):
print key, data
The results are the following ones:
<http://ieee.rkbexplorer.com/id/publication-d2a6837e67d808b41ffe6092db50f7cc> {'cf:type': 'v', 'cf:id': '100', 'cf:label': '<http://www.aktors.org/ontology/portal#Proceedings-Paper-Reference>'}
<http://www.aktors.org/ontology/date#1976> {'cf:type': 'v', 'cf:id': '1', 'cf:label': '<http://www.aktors.org/ontology/support#Calendar-Date>'}
<http://www.aktors.org/ontology/date#1985> {'cf:type': 'v', 'cf:id': '10', 'cf:label': '<http://www.aktors.org/ontology/support#Calendar-Date>'}
In the table there are rows with 'id' from 1 to 1000. If I code this in Java using HBase Java library it works fine, parsing integer value with Byte.toBytes() function.
Thank you.
Source: (StackOverflow)
I am trying to pass in a list of regexes to the columns attribute in my happybase scan calls. This is because, my coloumn names are made by dynamically appending ids which i dont have acces to at scan time.
Is this possible?
Source: (StackOverflow)
All the rows in a particular HBase table that I am making a UI for happen to have the same columns and will have so for the foreseeable future. I would like my html data visualizer application to simply query for a single random row to take note of the column names, and put this list of column names into a variable to refer to throughout the program.
I didn't see any equivalent to find_one or scan_one in the docs for HappyBase.
What is the best way to accomplish this?
Source: (StackOverflow)
I am working with Spark and HBase (using HappyBase library) and everything goes ok when working with small datasets. But, when working with big datasets, the connection to HBase Thrift is lost after many calls to map function. I am working with a single pseudonode at this moment.
Concretly, the following error takes place at the map function:
TTransportException: Could not connect to localhost:9090
Map function:
def save_triples(triple, ac, table_name, ac_vertex_id, graph_table_name):
connection = happybase.Connection(HBASE_SERVER_IP, compat='0.94')
table = connection.table(table_name)
[...]
connection.close()
This is the call to map function:
counts = lines.map(lambda x: save_triples(x, ac, table_name, ac_vertex_id, graph_table_name))
output = counts.collect()
I suspect that it is happening because many connections are being opened. I have tried to create the 'connection' object in main function and passing it to the map function as a parameter (something like this works with HBase libraries in Java), but I get following error:
pickle.PicklingError: Can't pickle builtin <type 'method_descriptor'>
Any help would be appreciated.
Source: (StackOverflow)