joblib
Python function as pipeline jobs.
Joblib: running Python functions as pipeline jobs — joblib 0.9.0b4 documentation
What is the reason of such issue in joblib?
'Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1'
What should I do to avoid such issue?
Actually I need to implement XMLRPC server which run heavy computation in background thread and report current progress through polling from UI client. It uses scikit-learn which are based on joblib.
P.S.:
I've simply changed name of the thread to "MainThread" to avoid such warning and everything looks working good (run in parallel as expected without issues). What might be a problem in future for such workaround?
Source: (StackOverflow)
I would like to parallelize a for loop in python.
The loop gets fed by a generator and I expect 1 billion items.
It turned out, that joblib has a giant memory leak
Parallel(n_jobs=num_cores)(delayed(testtm)(tm) for tm in powerset(all_turns))
I do not want to store data in this loop, just print sometimes something out, but the main thread grows in seconds to 1 GB size.
Are there any other frameworks for a large number of iterations?
Source: (StackOverflow)
Say I have a function that runs a SQL query and returns a dataframe:
import pandas.io.sql as psql
import sqlalchemy
query_string = "select a from table;"
def run_my_query(my_query):
# username, host, port and database are hard-coded here
engine = sqlalchemy.create_engine('postgresql://{username}@{host}:{port}/{database}'.format(username=username, host=host, port=port, database=database))
df = psql.read_sql(my_query, engine)
return df
# Run the query (this is what I want to memoize)
df = run_my_query(my_query)
I would like to:
- Be able to memoize my query above with one cache entry per value of
query_string
(i.e. per query)
- Be able to force a cache reset on demand (e.g. based on some flag), e.g. so that I can update my cache if I think that the database has changed.
How can I do this with joblib, jug?
Source: (StackOverflow)
I want to save and load a fitted Random Forest Classifier, but I get an error.
forest = RandomForestClassifier(n_estimators = 100, max_features = mf_val)
forest = forest.fit(L1[0:100], L2[0:100])
joblib.dump(forest, 'screening_forest/screening_forest.pkl')
forest2 = joblib.load('screening_forest/screening_forest.pkl')
The error is:
File "C:\Users\mkolarek\Documents\other\TrackerResultAnalysis\ScreeningClassif
ier\ScreeningClassifier.py", line 67, in <module>
forest2 = joblib.load('screening_forest/screening_forest.pkl')
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py",
line 425, in load
obj = unpickler.load()
File "C:\Python27\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py",
line 285, in load_build
Unpickler.load_build(self)
File "C:\Python27\lib\pickle.py", line 1217, in load_build
setstate(state)
File "_tree.pyx", line 2280, in sklearn.tree._tree.Tree.__setstate__ (sklearn\
tree\_tree.c:18350)
ValueError: Did not recognise loaded array layout
Press any key to continue . . .
Do I have to initialize forest2 or something?
Source: (StackOverflow)
I am looking for a way of building a decorator @memoize
that I can use in functions as follows:
@memoize
my_function(a, b, c):
# Do stuff
# result may not always be the same for fixed (a,b,c)
return result
Then, if I do:
result1 = my_function(a=1,b=2,c=3)
# The function f runs (slow). We cache the result for later
result2 = my_function(a=1, b=2, c=3)
# The decorator reads the cache and returns the result (fast)
Now say that I want to force a cache update:
result3 = my_function(a=1, b=2, c=3, force_update=True)
# The function runs *again* for values a, b, and c.
result4 = my_function(a=1, b=2, c=3)
# We read the cache
At the end of the above, we always have result4 = result3
, but not necessarily result4 = result
, which is why one needs an option to force the cache update for the same input parameters.
How can I approach this problem?
Note on joblib
As far as I know joblib
supports .call
, which forces a re-run, but it does not update the cache.
Follow-up on using klepto
:
Is there any way to have klepto
(see @Wally's answer) cache its results by default under a specific location? (e.g. /some/path/
) and share this location across multiple functions? E.g. I would like to say
cache_path = "/some/path/"
and then @memoize
several functions in a given module under the same path.
Source: (StackOverflow)
Say I setup memoization with Joblib as follows (using the solution provided here):
from tempfile import mkdtemp
cachedir = mkdtemp()
from joblib import Memory
memory = Memory(cachedir=cachedir, verbose=0)
@memory.cache
def run_my_query(my_query)
...
return df
And say I define a couple of queries, query_1
and query_2
, both of them take a long time to run.
I understand that, with the code as it is:
The second call with either query, would use the memoized output, i.e:
run_my_query(query_1)
run_my_query(query_1) # <- Uses cached output
run_my_query(query_2)
run_my_query(query_2) # <- Uses cached output
I could use memory.clear()
to delete the entire cache directory
But what if I want to re-do the memoization for only one of the queries (e.g. query_2
) without forcing a delete on the other query?
Source: (StackOverflow)
Say I have some code that creates several variables:
# Some code
# Beginning of the block to memoize
a = foo()
b = bar()
...
c =
# End of the block to memoize
# ... some more code
I would like to memoize the entire block above without having to be explicit about every variable created/changed in the block or pickle them manually. How can I do this in Python?
Ideally I would like to be able to wrap it with something (if/else
or with
statement) and have a flag that forces a refresh if I want.
Conceptually speaking, it woul dbe like:
# Some code
# Flag that I can set from outside to save or force a reset of the chache
refresh_cache = True
if refresh_cache == False
load_cache_of_block()
else:
# Beginning of the block to memoize
a = foo()
b = bar()
...
c = stuff()
# End of the block to memoize
save_cache_of_block()
# ... some more code
Is there any way to do this without having to explicitly pickle each variable defined or changed in the code? (i.e. at the end of the first run we save, and we later just reuse the values)
Source: (StackOverflow)
I have a python script that uses sklearn joblib to load a persistent model and perform prediction. The script runs fine when I run it under my username and when some other user tries to run the same script they get the error "ImportError: No module named numpy_pickle"
I also copied the script to the other user home directory and run it from there and still same error and I also ran it from python shell and nothing changed. Here is what I run in the Python shell:
from sklearn.externals import joblib
joblib.load("model_filename.pkl")
The second line above works under my username and gives the following error under all other users:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/joblib/numpy_pickle.py", line 424, in load
obj = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
__import__(module)
ImportError: No module named numpy_pickle
This is all running one a server with Ubuntu 14.04.1 LTS.
Any ideas why this is happening?
Thank you
Source: (StackOverflow)
I'm using Joblib to cache results of a computationally expensive function in my python script. The function's input arguments and return values are numpy arrays. The cache works fine for a single run of my python script. Now I want to spawn multiple runs of my python script in parallel for sweeping some parameter in an experiment. (The definition of the function remains same across all the runs).
Is there a way to share the joblib cache among multiple python scripts running in parallel? This would save a lot of function evaluations which are repeated across different runs but do not repeat within a single run. I couldn't find if this is possible in Joblib's documentation
Source: (StackOverflow)
I'm using sklearn.externals.joblib
to persist a classifier model to the disk which in reality uses pickle
module at lower level.
I create a custom CountVectorizer
class named StemmedCountVectorizer
and saved it in util.py
, then used it in the script for persisting the model
import util
from sklearn.externals import joblib
vect = util.StemmedCountVectorizer(stop_words='english', ngram_range=(1,1))
bow = vect.fit_transform(sentences)
joblib.dump(vect, 'vect.pkl')
This my project structure using Flask:
|- sentiment/
|- run.py
|- my_app/
|- analytic/
|- views.py
|- util. py
|- vect.pkl
I run the app with python run.py
and try to load the persisted object with joblib.load
in views.py
but it does not work, I imported the util
module but I receive the error:
ImportError: No module named util
can anybody give a solution to this? thanks
Source: (StackOverflow)
I have a nested list comprehension that looks something like this:
>>> nested = [[1, 2], [3, 4, 5]]
>>> [[sqrt(i) for i in j] for j in nested]
[[1.0, 1.4142135623730951], [1.7320508075688772, 2.0, 2.23606797749979]]
Is it possible to parellelize this using the standard joblib approach for embarrassingly parallel for loops? If so, what is the proper syntax for delayed
?
As far as I can tell, the docs don't mention or give any example of nested inputs. I've tried a few naive implementations, to no avail:
>>> #this syntax fails:
>>> Parallel(n_jobs = 2) (delayed(sqrt)(i for i in j) for j in nested)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\joblib\parallel.py", line 660, in __call__
self.retrieve()
File "C:\Python27\lib\site-packages\joblib\parallel.py", line 512, in retrieve
self._output.append(job.get())
File "C:\Python27\lib\multiprocessing\pool.py", line 558, in get
raise self._value
pickle.PicklingError: Can't pickle <type 'generator'>: it's not found as __builtin__.generator
>>> #this syntax doesn't fail, but gives the wrong output:
>>> Parallel(n_jobs = 2) (delayed(sqrt)(i) for i in j for j in nested)
[1.7320508075688772, 1.7320508075688772, 2.0, 2.0, 2.23606797749979, 2.23606797749979]
If this is impossible, I can obviously restructure the list before and after passing it to Parallel
. However, my actual list is long and each item is enormous, so doing so isn't ideal.
Source: (StackOverflow)
I've saved my classifier pipeline using joblib:
vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)
Now i'm trying to use it in a production env:
def classify(title):
#load classifier and predict
classifier = joblib.load('class.pkl')
#vectorize/transform the new title then predict
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
X_test = vectorizer.transform(title)
predict = classifier.predict(X_test)
return predict
The error i'm getting is: ValueError: Vocabulary wasn't fitted or is empty!
I guess i should load the Vocabulary from te joblid but i can't get it to work
Source: (StackOverflow)
Is there a simple way to track the overall progress of a joblib.Parallel execution?
I have a long-running execution composed of thousands of jobs, which I want to track and record in a database. However, to do that, whenever Parallel finishes a task, I need it to execute a callback, reporting how many remaining jobs are left.
I've accomplished a similar task before with Python's stdlib multiprocessing.Pool, by launching a thread that records the number of pending jobs in Pool's job list.
Looking at the code, Parallel inherits Pool, so I thought I could pull off the same trick, but it doesn't seem to use these that list, and I haven't been able to figure out how else to "read" it's internal status any other way.
Source: (StackOverflow)
I trained a model using PyNeural.
How do I save this trained model?
I tried pickle
and sklearn.externals.joblib
but did not work.
As the later, said TypeError: can't pickle NeuralNet objects
Source: (StackOverflow)
I've noticed a huge delay when using multiprocessing (with joblib). Here is a simplified version of my code:
import numpy as np
from joblib import Parallel, delayed
class Matcher(object):
def match_all(self, arr1, arr2):
args = ((elem1, elem2) for elem1 in arr1 for elem2 in arr2)
results = Parallel(n_jobs=-1)(delayed(_parallel_match)(self, e1, e2) for e1, e2 in args)
# ...
def match(self, i1, i2):
return i1 == i2
def _parallel_match(m, i1, i2):
return m.match(i1, i2)
matcher = Matcher()
matcher.match_all(np.ones(250), np.ones(250))
So if I run it like shown above, it takes about 30 secs to complete and use almost 200Mb.
If I just change the parameter n_jobs in Parallel and set it to 1 it only takes 1.80 secs and barely use 50Mb...
I suppose it has to be something related to the way I pass the arguments, but haven't found a better way to do it...
I'm using Python 2.7.9
Source: (StackOverflow)