EzDevInfo.com

joblib

Python function as pipeline jobs. Joblib: running Python functions as pipeline jobs — joblib 0.9.0b4 documentation

Get it on Github
Language : Python

http://packages.python.org/joblib/

Multiprocessing backed parallel loops cannot be nested below threads

What is the reason of such issue in joblib? 'Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1' What should I do to avoid such issue?

Actually I need to implement XMLRPC server which run heavy computation in background thread and report current progress through polling from UI client. It uses scikit-learn which are based on joblib.

P.S.: I've simply changed name of the thread to "MainThread" to avoid such warning and everything looks working good (run in parallel as expected without issues). What might be a problem in future for such workaround?

Source: (StackOverflow)

parallel for loop python

I would like to parallelize a for loop in python.

The loop gets fed by a generator and I expect 1 billion items.

It turned out, that joblib has a giant memory leak

Parallel(n_jobs=num_cores)(delayed(testtm)(tm) for tm in powerset(all_turns))

I do not want to store data in this loop, just print sometimes something out, but the main thread grows in seconds to 1 GB size.

Are there any other frameworks for a large number of iterations?

Source: (StackOverflow)

Advertisements

Memoizing SQL queries

Say I have a function that runs a SQL query and returns a dataframe:

import pandas.io.sql as psql
import sqlalchemy

query_string = "select a from table;"

def run_my_query(my_query):
    # username, host, port and database are hard-coded here
    engine = sqlalchemy.create_engine('postgresql://{username}@{host}:{port}/{database}'.format(username=username, host=host, port=port, database=database))

    df = psql.read_sql(my_query, engine)
    return df

# Run the query (this is what I want to memoize)
df = run_my_query(my_query)

I would like to:

Be able to memoize my query above with one cache entry per value of query_string (i.e. per query)
Be able to force a cache reset on demand (e.g. based on some flag), e.g. so that I can update my cache if I think that the database has changed.

How can I do this with joblib, jug?

Source: (StackOverflow)

Saving Random Forest

I want to save and load a fitted Random Forest Classifier, but I get an error.

forest = RandomForestClassifier(n_estimators = 100, max_features = mf_val)
forest = forest.fit(L1[0:100], L2[0:100])
joblib.dump(forest, 'screening_forest/screening_forest.pkl')
forest2 = joblib.load('screening_forest/screening_forest.pkl')

The error is:

  File "C:\Users\mkolarek\Documents\other\TrackerResultAnalysis\ScreeningClassif
ier\ScreeningClassifier.py", line 67, in <module>
    forest2 = joblib.load('screening_forest/screening_forest.pkl')
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py",
 line 425, in load
    obj = unpickler.load()
  File "C:\Python27\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py",
 line 285, in load_build
    Unpickler.load_build(self)
  File "C:\Python27\lib\pickle.py", line 1217, in load_build
    setstate(state)
  File "_tree.pyx", line 2280, in sklearn.tree._tree.Tree.__setstate__ (sklearn\
tree\_tree.c:18350)
ValueError: Did not recognise loaded array layout
Press any key to continue . . .

Do I have to initialize forest2 or something?

Source: (StackOverflow)

Decorators for selective caching / memoization

I am looking for a way of building a decorator @memoize that I can use in functions as follows:

@memoize
my_function(a, b, c):
    # Do stuff 
    # result may not always be the same for fixed (a,b,c)
return result

Then, if I do:

result1 = my_function(a=1,b=2,c=3)
# The function f runs (slow). We cache the result for later

result2 = my_function(a=1, b=2, c=3)
# The decorator reads the cache and returns the result (fast)

Now say that I want to force a cache update:

result3 = my_function(a=1, b=2, c=3, force_update=True)
# The function runs *again* for values a, b, and c. 

result4 = my_function(a=1, b=2, c=3)
# We read the cache

At the end of the above, we always have result4 = result3, but not necessarily result4 = result, which is why one needs an option to force the cache update for the same input parameters.

How can I approach this problem?

Note on `joblib`

As far as I know joblib supports .call, which forces a re-run, but it does not update the cache.

Follow-up on using `klepto`:

Is there any way to have klepto (see @Wally's answer) cache its results by default under a specific location? (e.g. /some/path/) and share this location across multiple functions? E.g. I would like to say

cache_path = "/some/path/"

and then @memoize several functions in a given module under the same path.

Source: (StackOverflow)

Selective Re-Memoization of DataFrames

Say I setup memoization with Joblib as follows (using the solution provided here):

from tempfile import mkdtemp
cachedir = mkdtemp()

from joblib import Memory
memory = Memory(cachedir=cachedir, verbose=0)

@memory.cache
def run_my_query(my_query)
    ...
    return df

And say I define a couple of queries, query_1 and query_2, both of them take a long time to run.

I understand that, with the code as it is:

The second call with either query, would use the memoized output, i.e:

run_my_query(query_1)
run_my_query(query_1) # <- Uses cached output

run_my_query(query_2)
run_my_query(query_2) # <- Uses cached output

I could use memory.clear() to delete the entire cache directory

But what if I want to re-do the memoization for only one of the queries (e.g. query_2) without forcing a delete on the other query?

Source: (StackOverflow)

Memoizing an entire block in Python

Say I have some code that creates several variables:

# Some code

# Beginning of the block to memoize
a = foo()
b = bar()
...
c =
# End of the block to memoize

# ... some more code

I would like to memoize the entire block above without having to be explicit about every variable created/changed in the block or pickle them manually. How can I do this in Python?

Ideally I would like to be able to wrap it with something (if/else or with statement) and have a flag that forces a refresh if I want.

Conceptually speaking, it woul dbe like:

# Some code

# Flag that I can set from outside to save or force a reset of the chache 
refresh_cache = True

if refresh_cache == False
   load_cache_of_block()
else:      
   # Beginning of the block to memoize
   a = foo()
   b = bar()
   ...
   c = stuff()
   # End of the block to memoize

   save_cache_of_block()

# ... some more code

Is there any way to do this without having to explicitly pickle each variable defined or changed in the code? (i.e. at the end of the first run we save, and we later just reuse the values)

Source: (StackOverflow)

No module named numpy_pickle when executing script under a different user

I have a python script that uses sklearn joblib to load a persistent model and perform prediction. The script runs fine when I run it under my username and when some other user tries to run the same script they get the error "ImportError: No module named numpy_pickle"

I also copied the script to the other user home directory and run it from there and still same error and I also ran it from python shell and nothing changed. Here is what I run in the Python shell:

from sklearn.externals import joblib
joblib.load("model_filename.pkl")

The second line above works under my username and gives the following error under all other users:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.7/joblib/numpy_pickle.py", line 424, in load
    obj = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named numpy_pickle

This is all running one a server with Ubuntu 14.04.1 LTS.

Any ideas why this is happening?

Thank you

Source: (StackOverflow)

Multiple processes sharing a single Joblib cache

I'm using Joblib to cache results of a computationally expensive function in my python script. The function's input arguments and return values are numpy arrays. The cache works fine for a single run of my python script. Now I want to spawn multiple runs of my python script in parallel for sweeping some parameter in an experiment. (The definition of the function remains same across all the runs).

Is there a way to share the joblib cache among multiple python scripts running in parallel? This would save a lot of function evaluations which are repeated across different runs but do not repeat within a single run. I couldn't find if this is possible in Joblib's documentation

Source: (StackOverflow)

ImportError: No module named util

I'm using sklearn.externals.joblib to persist a classifier model to the disk which in reality uses pickle module at lower level.

I create a custom CountVectorizer class named StemmedCountVectorizer and saved it in util.py, then used it in the script for persisting the model

import util

from sklearn.externals import joblib

vect = util.StemmedCountVectorizer(stop_words='english', ngram_range=(1,1))

bow = vect.fit_transform(sentences)

joblib.dump(vect, 'vect.pkl')

This my project structure using Flask:

   |- sentiment/
     |- run.py
     |- my_app/
       |- analytic/
         |- views.py
         |- util. py
         |- vect.pkl

I run the app with python run.py and try to load the persisted object with joblib.load in views.py but it does not work, I imported the util module but I receive the error:

ImportError: No module named util

can anybody give a solution to this? thanks

Source: (StackOverflow)

joblib.Parallel for nested list comprehension

I have a nested list comprehension that looks something like this:

>>> nested = [[1, 2], [3, 4, 5]]
>>> [[sqrt(i) for i in j] for j in nested]
[[1.0, 1.4142135623730951], [1.7320508075688772, 2.0, 2.23606797749979]]

Is it possible to parellelize this using the standard joblib approach for embarrassingly parallel for loops? If so, what is the proper syntax for delayed?

As far as I can tell, the docs don't mention or give any example of nested inputs. I've tried a few naive implementations, to no avail:

>>> #this syntax fails:
>>> Parallel(n_jobs = 2) (delayed(sqrt)(i for i in j) for j in nested)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\joblib\parallel.py", line 660, in __call__
    self.retrieve()
  File "C:\Python27\lib\site-packages\joblib\parallel.py", line 512, in retrieve
    self._output.append(job.get())
  File "C:\Python27\lib\multiprocessing\pool.py", line 558, in get
    raise self._value
pickle.PicklingError: Can't pickle <type 'generator'>: it's not found as __builtin__.generator
>>> #this syntax doesn't fail, but gives the wrong output:
>>> Parallel(n_jobs = 2) (delayed(sqrt)(i) for i in j for j in nested)
[1.7320508075688772, 1.7320508075688772, 2.0, 2.0, 2.23606797749979, 2.23606797749979]

If this is impossible, I can obviously restructure the list before and after passing it to Parallel. However, my actual list is long and each item is enormous, so doing so isn't ideal.

Source: (StackOverflow)

Bringing a classifier to production

I've saved my classifier pipeline using joblib:

vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)

Now i'm trying to use it in a production env:

def classify(title):

  #load classifier and predict
  classifier = joblib.load('class.pkl')

  #vectorize/transform the new title then predict
  vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
  X_test = vectorizer.transform(title)
  predict = classifier.predict(X_test)
  return predict

The error i'm getting is: ValueError: Vocabulary wasn't fitted or is empty! I guess i should load the Vocabulary from te joblid but i can't get it to work

Source: (StackOverflow)

Tracking progress of joblib.Parallel execution

Is there a simple way to track the overall progress of a joblib.Parallel execution?

I have a long-running execution composed of thousands of jobs, which I want to track and record in a database. However, to do that, whenever Parallel finishes a task, I need it to execute a callback, reporting how many remaining jobs are left.

I've accomplished a similar task before with Python's stdlib multiprocessing.Pool, by launching a thread that records the number of pending jobs in Pool's job list.

Looking at the code, Parallel inherits Pool, so I thought I could pull off the same trick, but it doesn't seem to use these that list, and I haven't been able to figure out how else to "read" it's internal status any other way.

Source: (StackOverflow)

Saving a PyNeural model to disk

I trained a model using PyNeural.

How do I save this trained model? I tried pickle and sklearn.externals.joblib but did not work. As the later, said TypeError: can't pickle NeuralNet objects

Source: (StackOverflow)

Python multiprocessing (joblib) best way for argument passing

I've noticed a huge delay when using multiprocessing (with joblib). Here is a simplified version of my code:

import numpy as np
from joblib import Parallel, delayed

class Matcher(object):
    def match_all(self, arr1, arr2):
        args = ((elem1, elem2) for elem1 in arr1 for elem2 in arr2)

        results = Parallel(n_jobs=-1)(delayed(_parallel_match)(self, e1, e2) for e1, e2 in args)
        # ...

    def match(self, i1, i2):
        return i1 == i2

def _parallel_match(m, i1, i2):
    return m.match(i1, i2)

matcher = Matcher()
matcher.match_all(np.ones(250), np.ones(250))

So if I run it like shown above, it takes about 30 secs to complete and use almost 200Mb. If I just change the parameter n_jobs in Parallel and set it to 1 it only takes 1.80 secs and barely use 50Mb...

I suppose it has to be something related to the way I pass the arguments, but haven't found a better way to do it...

I'm using Python 2.7.9

Source: (StackOverflow)