EzDevInfo.com

fuzzywuzzy

Fuzzy String Matching in Python FuzzyWuzzy: Fuzzy String Matching in Python - ChairNerd seatgeek open sourced seatgeek/fuzzywuzzy fuzzy string matching in python we’ve made it our mission to pull in event tickets from every corner of …

Python fuzzywuzzy error string or buffer expect

I'm using fuzzywuzzy to find near matches in a csv of company names. I'm comparing manually matched strings with the unmatched strings in the hope of finding some useful proximity matches, however, I'm getting a string or buffer error within fuzzywuzzy. My code is:

from fuzzywuzzy import process
from pandas import read_csv

if __name__ == '__main__':
    df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
    df_false = df[df['match_manual'].isnull()]  
    df_true = df[df['match_manual'].notnull()]
    sss_false = df_false['sss'].values.tolist()
    sss_true = df_true['sss'].values.tolist()


    for sssf in sss_false:
        mmm = process.extractOne(sssf, sss_true) # find best choice
        print sssf + str(tuple(mmm))

This creates the following error:

Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer

This is something to do with the effects of importing into pandas with encoding specified, which I added to prevent UnicodeDecodeErrors but had the knock on effect of causing this error. I've tried to force the object using str(sssf) but that doesn't work.

So, I've isolated a line that is causing the error, here: #N/A,,,,,, (line 29 in code pasted below). I assumed it was the # that was causing the error, but strangely its not, its the A char that is causing the problem, because the file works when it is removed. What is strange to me is that the string two rows below is N/A which parses fine, however, row 29 won't parse when I delete the # symbol, even though the field appears identical to the field below.

sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,

Source: (StackOverflow)

Import Error: No module named 'utils'

Please excuse me I'm a newbie. I'm trying to use the fuzzywuzzy module from seatgeek. I am using Python 3

Initially, I was getting this error:

  from fuzzywuzzy import fuzz
ImportError: cannot import name fuzz

I changed the import statement to import fuzzywuzzy.fuzz and Now, I'm getting this error:

  File "test.py", line 4, in <module>
     import fuzzywuzzy.fuzz
  File "C:\Python33\lib\site-packages\fuzzywuzzy\fuzz.py", line 31, in <module>
     from utils import *
ImportError: No module named 'utils'

Source: (StackOverflow)

Advertisements

Python Fuzzy Matching (FuzzyWuzzy)

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):

with open(path,'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
    data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
    #print(process.extract(row,data, limit = 100))
    for found, score, matchrow in process.extract(row, data, limit=100):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)


save_file.close()

The output is as follows:

Name11 , 90 , Name25 
Name11 , 85 , Name24 
Name11 , 65 , Name29

The script works fine. The output is as expected. But what I am looking for is only the best match.

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22

So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2. It should be fairly straightforward, but I can't seem to figure it out. Any help would be appreciated.


Source: (StackOverflow)

Python Comparing two lists of strings for similarities

I'm very new at Python but I thought it would be fun to make a program to sort all my downloads, but I'm having a little trouble with it. It works perfectly if my destination only has one word in it but if the destination has two words or more this is where it goes wrong and the program gets stuck in a loop. Does anybody have a better idea to compare the lists than me

>>>for i in dstdir:
>>>    print i.split()

['CALIFORNICATION']
['THAT', "'70S", 'SHOW']
['THE', 'BIG', 'BANG', 'THEORY']
['THE', 'OFFICE']
['DEXTER']
['SPAWN']
['SCRUBS']
['BETTER', 'OF', 'TED']

>>>for i in dstdir:
>>>    print i.split()
['Brooklyn.Nine-Nine.S01E16.REAL.HDTV.x264-EXCELLENCE.mp4']
['Revolution', '2012', 'S02E12', 'HDTV', 'x264-LOL[ettv]']]
['Inequality', 'for', 'All', '(2013)', '[1080p]']

This is an example of the lists output.

I have a destination directory with only folders in it and a download directory. I want to make a program to automatically look at the source file name and then look at the destination name. if the destination name is in the source name then I have the yes to go ahead and copy the downloaded file so it is sorted in my collection.

destination = '/media/mediacenter/SAMSUNG/SERIES/'
source = '/home/mediacenter/Downloads/'
dstdir = os.listdir(destination)
srcdir = os.listdir(source)

for i in srcdir:
    source = list(i.split())
    for j in dstdir:
        count = 0
        succes = 0
        destination = list(j.split())
        if len(destination) == 1:
            while (count < len(source)):
                if destination[0].upper() == source[count].upper():
                    print 'succes ', destination, ' ', source
                count = count + 1
        elif len(destination) == 2:
            while(count < len(source)):
                if (destination[0].upper() == source[count].upper()):
                    succes = succes + 1
                    count = len(source)
            count = 0
            while(count < len(source)):
                if (destination[1].upper() == source[count].upper()):
                    succes = succes + 1
                    count = len(source)
            count = 0
            if succes == 2:
                print 'succes ', destination, ' ', source

For now I'm happy with only "success" as an output; I will figure out how to copy files as it will be a totally different problem for me in the near future


Source: (StackOverflow)

Python Pandas fuzzy merge/match with duplicates

I have 2 dataframes currently, 1 for donors and 1 for fundraisers. Ideally what I'm trying to find is if any fundraisers also gave donations and if so copy some of that information into my fundraiser data set (donor name, email and their first donation). Problems with my data are 1) I need to match by name and email, but a user might have slightly different names (ex Kat and Kathy). 2) Duplicate names for donors and fundraisers. 2a) With donors I can get unique name/email combinations since I just care about the first donation date 2b) With fundraisers though I need to keep both rows and not lose data like the date.

Sample code I have right now:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).

ideally I want my fundraiser dataset to look like:

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-01-03 10:39:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-02-03 10:39:00     d@d.ca      Kathy test  Kat test    d@d.ca      2013-03-27 10:39:00 
2013-03-03 10:39:00     asdf@asdf.ca    Tes Ester   
2013-04-03 10:39:00     something@a.ca  Jane Doe    Jane Doe    something@a.ca  2013-04-03 10:39:00

I tried following this thread: is it possible to do fuzzy match merge with python pandas? but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?

doing it with for loops (which works but is super slow and I feel there has to be a better way)

fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
    max = 75
    for funindex in range(len(fundraisers.index)):
        aname = donors["name"][donindex]
        comp = fundraisers["name"][funindex]
        ratio = fuzz.ratio(aname, comp)
        if ratio > max:
            if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
                ratio *= 2
            max = ratio
            fundraisers["donor name"][funindex] = aname
            fundraisers["donor email"][funindex] = donors["Email"][donindex]
            fundraisers["donor date"][funindex] = donors["Date"][donindex]

Source: (StackOverflow)

Inspect a large dataframe for errors arising during merge/combine in python

I hope this is an appropriate question for here. If not, let me know, and I will remove it immediately.

Question:

How can I use python to inspect (visually?) a large dataset for errors that arise during combination?

Background:

I am working with several large (but not, you know "Big") datasets that I combine to form one larger dataset. This new set is ~2.5G in size, so it does not fit in most spreadsheet programs, or at least not in the ones I've tried (MS Excel, OpenOffice).

The process to create the final dataset uses fuzzy matching (via fuzzywuzzy), and I want to inspect the results of the matching to see if there are any errors introduced.

As of now, I have tried importing the entire set into a pandas dataframe. This DF has 64 columns, so when I simply do something like df.head() the resulting displayed info obviously does not show all the columns; I thus ruled out just iterating through multiple .head() calls.

There is a similar question about visualizing specific aspects of a dataframe here. My question is different, I think, because I don't need to visualize anything about the underlying structure or types. I just want to visually inspect areas I suspect might have errors.


Source: (StackOverflow)

python module local installation - How to access methods using zipimport

I am trying to implement string match program on hadoop using fuzzywuzzy. Understand from this link here, that fuzzywuzzy need to be installed on hadoop nodes. So, I was testing this on local machine without installing fuzzywuzzy on my python.

Procedure I followed is: 1. pip installed fuzzywuzzy to a different directory. 2. zipped all .pyc files created in the directory with name "fuzzywuzzy.zip" 3. Then I used below code to load fuzzywuzzy package

import zipfile 
zf = zipfile.PyZipFile('fuzzywuzzy.zip')
for name in zf.namelist():
   print name

import zipimport
import sys
importer = zipimport.zipimporter('fuzzywuzzy.zip')
print importer.find_module('fuzzywuzzy')
fw=importer.load_module('fuzzywuzzy')

Output shows that

 fuzzywuzzy/
 fuzzywuzzy/.DS_Store
 fuzzywuzzy/__init__.pyc
 fuzzywuzzy/fuzz.pyc
 fuzzywuzzy/process.pyc
 fuzzywuzzy/string_processing.pyc
 fuzzywuzzy/StringMatcher.pyc
 fuzzywuzzy/utils.pyc
 <zipimporter object "fuzzywuzzy.zip">

However, when I add below line to above code:

print fw.fuzz.ratio("this is a test", "this is a test!")

I get below error:

AttributeError: 'module' object has no attribute 'fuzz'

If I had installed fuzzywuzzy on my python, I could have used below code to call methods in fuzz,process, utils

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from fuzzywuzzy import utils
from fuzzywuzzy.string_processing import StringProcessor

My question is, how can I access methods in fuzz, process, utils from zipimport?


Source: (StackOverflow)

Python module returning errors in bash but not from IDLE

I'm a newbie programmer posting here for the first time. Any suggestions or advice would be greatly appreciated! I am working on a project that compares the contents of, say test.csv to ref.csv (both single columns containing strings of 3-4 words) and assigns a score to each string from test.csv based its similarity to the most similar string in ref.csv. I am using the fuzzywuzzy string matching module to assign the similarity score.

The following code snippet takes the two input files, converts them into arrays, and prints out the arrays:

import csv

# Load text matching module
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


# Import reference and test lists and assign to variables
ref_doc = csv.reader(open('ref.csv', 'rb'), delimiter=",", quotechar='|')
test_doc = csv.reader(open('test.csv', 'rb'), delimiter=",", quotechar='|')

# Define the arrays into which to load these lists
ref = []
test = []

# Assign the reference and test docs to arrays
for row in ref_doc:
    ref.append(row)

for row in test_doc:
    test.append(row)

# Print the arrays to make sure this all worked properly
# before we procede to run matching operations on the arrays
print ref, "\n\n\n", test

The problem is that this script works as expected when I run it in IDLE, but returns the following error when I call it from bash:

['one', 'two']
Traceback (most recent call last):
  File "csvimport_orig.py", line 4, in <module>
    from fuzzywuzzy import fuzz
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/fuzzywuzzy/fuzz.py", line 32, in <module>
    import utils
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 6, in <module>
    table_from=string.punctuation+string.ascii_uppercase
AttributeError: 'module' object has no attribute 'punctuation'

Is there something I need to configure in bash for this to work properly? Or is there something fundamentally wrong that IDLE is not catching? For simplicity's sake, I don't call the fuzzywuzzy module in this snippet, but it works as expected in IDLE.

Eventually, I'd like to use pylevenshtein but am trying to see if my use for this script has value before I put the extra time in making that work.

Thanks in advance.


Source: (StackOverflow)

Fuzzy compare two column

I have a CSV file with search terms (numbers and text) that I would like to compare against a list of other terms (numbers and text) to determine if there are any matches or potential matches. Then I would like to have all results written to a new CSV for manual review. I am using the fuzzywuzzy plug-in to create a 'score' to determine how close of a match there is between the terms. Ideally, I would be able to filter on the ratio.

My current code compares the files rows one to one instead of one row in the first file to all the rows in the second; which is what I need.

How do I perfrom a fuzzy lookup for each row in file1 against all the rows in file2?

from fuzzywuzzy import fuzz
import csv
from itertools import zip_longest

f = open('FuzzyMatch2.csv', 'wt')
writer = csv.writer(f, lineterminator = '\n')


file1_loc = 'LookUp.csv'
file2_loc = 'Prod.csv'

file1 = csv.DictReader(open(file1_loc, 'r'), delimiter=',', quotechar='"')
file2 = csv.DictReader(open(file2_loc, 'r'), delimiter=',', quotechar='"')

for row in file1:
    for l1, l2 in zip_longest(file1, file2):
        if all((l1, l2)):
            partial_ratio = fuzz.token_sort_ratio(str(l1['SearchTerm']), str(l2['Description']))       

        a = [l1,l2,partial_ratio]
        writer.writerow(a)

f.close()

Below is a much cleaner version of the above code, but it still has issues. The code gives an error

IndexError: list index out of range

Any idea how to get the list within range and the code working?

from fuzzywuzzy import process
import csv

save_file = open('FuzzyResults.csv', 'wt')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):
    with open(path,'r') as f:
        for row in f:
            row = row.split()
            yield row


if __name__ == "__main__":
    ## Create lookup dictionary by parsing the products csv
    data = {}
    for row in parse_csv('Prod.csv'):
        data[row[0]] = row[1]

    ## For each row in the lookup compute the partial ratio
    for row in parse_csv("LookUp.csv"):

        for found, score in process.extract(row, data, limit=100):
            if score >= 10:
                print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
                Digi_Results = [score, row, found]
                writer.writerow(Digi_Results)


save_file.close()

Source: (StackOverflow)

How can I improve performance on my apply() with fuzzy matching statement

I've written a function called muzz that leverages the fuzzywuzzy module to 'merge' two pandas dataframes. Works great, but the performance is pretty bad on larger frames. Please take a look at my apply() that does the extracting/scoring and let me know if you have any ideas that could speed it up.

import pandas as pd
import numpy as np
import fuzzywuzzy as fw

Create a frame of raw data

dfRaw = pd.DataFrame({'City': {0: u'St Louis',
                      1: 'Omaha',
                      2: 'Chicogo',
                      3: 'Kansas  city',
                      4: 'Des Moine'},
                      'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 3 : 'MO', 4 : 'IA'}})

Which yields

City    State
0   St Louis    MO
1   Omaha   NE
2   Chicogo IL
3   Kansas city MO
4   Des Moine   IA

Then a frame that represents the good data that we want to look up

dfLocations = pd.DataFrame({'City': {0: 'Saint Louis',
                          1: u'Omaha',
                          2: u'Chicago',
                          3: u'Kansas City',
                          4: u'Des Moines'},
                         'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 
                                   3 : 'KS', 4 : 'IA'},
                          u'Zip': {0: '63201', 1: '68104', 2: '60290', 
                                   3: '68101', 4: '50301'}})

Which yields

    City    State   Zip
0   Saint Louis MO  63201
1   Omaha   NE  68104
2   Chicago IL  60290
3   Kansas City KS  68101
4   Des Moines  IA  50301

and now the muzz function. EDIT: Added choices= right[match_col_name] line and used choices in the apply per Brenbarn suggestion. I also, per Brenbarn suggestion, ran some tests with the extractOne() without the apply and it it appears to be the bottleneck. Maybe there's a faster way to do the fuzzy matching?

 def muzz(left, right, on, match_col_name='match_on',score_col_name='score_match',
     right_suffix='_match', score_cutoff=80):  

     right[match_col_name] = np.sum(right[on],axis=1)
     choices= right[match_col_name] 

     ###The offending statement### 
     left[[match_col_name,score_col_name]] = 
         pd.Series(np.sum(left[on],axis=1)).apply(lambda x : pd.Series(
         fw.process.extractOne(x,choices,score_cutoff=score_cutoff))) 

     dfTemp = pd.merge(left,right,how='left',on=match_col_name,suffixes=('',right_suffix))         
     return dfTemp.drop(match_col_name, axis=1)

Calling muzz

muzz(dfRaw.copy(),dfLocations,on=['City','State'], score_cutoff=85)

Which yields

    City        State   score_match City_match  State_match Zip
0   St Louis    MO      87          Saint Louis MO          63201
1   Omaha       NE      100         Omaha       NE          68104
2   Chicogo     IL      89          Chicago     IL          60290
3   Kansas city MO      NaN         NaN         NaN         NaN
4   Des Moine   IA      96          Des Moines  IA          50301

Source: (StackOverflow)

Program fails on AWS EMR with hadoop (OK on local machine)

I am trying to use python's fuzzywuzzy package in mapper program for computing edit distance. My program runs fine on local machine but it fails on AWS emr cluster. I tried below two approaches(on both local machine and also on AWS EMR cluster):

1. By installing fuzzywuzzy:

I installed fuzzywuzzy using pip on both master and slave nodes. If I comment out last 4 lines of below code, I do not get any error. But I want to use fuzzywuzzy in my program.

!/usr/bin/python  
import re
import sys
import os
import csv

desc_dict = {}
with open('Keys.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
            query_set = row

for line in sys.stdin:
  line = line.strip() 
  row = line.split(',')
  if(len(row)>2):
      desc_dict[(int(row[0]), row[1])] = (row[2].lower()).encode('utf-8')
from fuzzywuzzy import *
import fuzzywuzzy.fuzz
import fuzzywuzzy.utils
print fuzzywuzzy.fuzz.partial_ratio("this is a test", "this is a test!")

I get below error:

 Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

2. Without installing fuzzywuzzy

I could run above map-reduce program without installing fuzzywuzzy on local machine. When I tried the same on AWS EMR it failed.

I zipped fuzzywuzzy package ("temp.zip") and called it in my map program. I copied temp.zip file to slave nodes also.

!/usr/bin/python
import re import sys import os import csv

desc_dict = {}
with open('Keys.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
            query_set = row

for line in sys.stdin:
  line = line.strip() 
  row = line.split(',')
  if(len(row)>2):
      desc_dict[(int(row[0]), row[1])] = (row[2].lower()).encode('utf-8')

sys.path.insert(0,'temp.zip')
from fuzzywuzzy import *
import fuzzywuzzy.fuzz
print fuzzywuzzy.fuzz.partial_ratio("this is a test", "this is a test!")

I get below error:

 Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Can someone guide what is wrong with my code/ how to run fuzzywuzzy on hadoop?


Source: (StackOverflow)

python highest fuzzy ratio to print line from list

I have a list consisting of some lines.I want to print the line matching word 'good' with highest fuzzyratio.

Problem: Its only printing word instead of line in the list

Coding:

from fuzzywuzzy import fuzz
c = ['I am python', 'python is good', 'Everyone are humans']
print(max(c, key=lambda a: fuzz.ratio(a, 'good')))

Expected Output:

python is good

I get a single word instead of line of highest fuzzyvalue from the list. Please help to fix my code! Answers will be appreciated!


Source: (StackOverflow)

Fuzzywuzzy import error weirdness

I have installed fuzzywuzzy via pip install into a virtual environment [fuzzywuzzy==0.3.1].

In the python interpreter (via ipython) I do the following

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

a = "my name is lena"
b = "my name is Elena"

fuzz.ratio(a,b)

Which works fine and gives me a result.

Next, I write the following into a file (using Sublime Text):

#!/Users/InNov8/Projects/datamine/denv/bin/python
# -*- coding: utf-8 -*-

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

'''
Fuzzy Logic Test
'''
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

a = "my name is lena"
b = "my name is Elena"

print fuzz.ratio(a,b)

When I run this in terminal I get the following error:

File "/Users/InNov8/Projects/datamine/_MiningScripts/fuzz-test2.py", line 4, in from fuzzywuzzy import fuzz ImportError: No module named fuzzywuzzy

Is there any reason why a module does import successfully into the interpreter, but wouldn't import when executed from a script?

I am using the same version of python in both, i.e, via the virtualenv

#!/Users/InNov8/Projects/datamine/denv/bin/python

Thanks for any advice!


Source: (StackOverflow)

Python missing module v 2.7.3 and Windows 7: Installed fuzzywuzzy, imports in powershell, not in IDLE

I'm betting there's a simple solution to this problem that I don't know, and from googling and stackoverflowing around it seems to have something to do with setting a path.

I have anaconda installed on my computer and it seems to use python 2.7.4. I also have python 2.7.3 installed, which seems to be the version being used when I open up IDLE. When I installed fuzzywuzzy using 'python setup.py install' it's installed in the anaconda folder and using python in powershell, the command 'from fuzzywuzzy import fuzz' works fine, but when doing the same thing in IDLE I get a missing module error.

Is there a way to reconcile the two versions of Python? Can I get them to share packages, or delete one of the versions without ruining everything?

I tried doing this:

''' Setting the PYTHONPATH / PYTHONHOME variables

Right click the Computer icon in the start menu, go to properties. On the left tab, go to Advanced system settings. In the window that comes up, go to the Advanced tab, then at the bottom click Environment Variables. Click in the list of user variables and start typing Python, and repeat for System variables, just to make certain that you don't have mis-set variables for PYTHONPATH or PYTHONHOME. Next, add new variables (I did in System rather than User, although it may work for User too): PYTHONPATH, set to C:\Python27\Lib. PYTHONHOME, set to C:\Python27. '''

then reinstalled fuzzywuzzy, and it installed in the C:Python27 folder and works in IDLE, but now Kivy doesn't work!

Do I need to reinstall that too? Or is there a Path sharing fix?


Source: (StackOverflow)

FuzzyWuzzy String Matching - Case Sensitivity

I'm using the FuzzyWuzzy String Matching module from SeatGeek.

I find that when using the token_set_ratio search algorithm, small differences in case gives wildly differing results.

For example, if I am looking for the phrase "I am eating" in a file, I get a 100% match. But if the phrase is "i am eating", just the change in case of ONE letter, gives me a 65% match.

Is there any way to make the algorithm case insensitive?


Source: (StackOverflow)