EzDevInfo.com

chardet

Python 2/3 compatible character encoding detector.

juniversalchardet is defective on www.wikipedia.org

I'm trying to use juniversalchardet to auto-detect encoding of a saved webpage, my first test use www.wikipedia.org, which uses UTF-8 encoding according to HTTP response header (this information is lost after being saved to disk)

This is my scala code in doing so:

    val content = <...load Wikipedia.html from disk...>
    val charsetD = new UniversalDetector(null)
    charsetD.handleData(content, 0, content.length)
    val charset = charsetD.getDetectedCharset

However regardless of what I load, the charset result is always 'null', is it because the juniversalchardet library is defective? Or I'm using it wrong?


Source: (StackOverflow)

How do I encode files to UTF-8 for Rails 3?

I've been working on outlook imports (linked in exports to outlook format) but I'm having troubles with encoding. The outlook format CSV I get from exporting my LinkedIn contacts are not in UTF-8. Letters like ñ cause an exception in the mongoid_search gem when calling str.to_s.mb_chars.normalize. I think encoding is the issue, because when I call mb_chars (see first code example). I am not sure if this is a bug in the gem, but I was advised to sanitize the data nonetheless.

From File Picker, I tried using their new, community-supported gem to upload CSV data. I tried three encoding detectors and transcoders:

  1. Ruby port of a Python lib chardet
    • Didn't work as expected
    • The port still contained Python code, preventing it from running in my app
  2. rchardet19 gem
    • Detected iso-8859 with .8/1 confidence.
    • Tried to transcode with Iconv, but crashed on "illegal characters" at ñ
  3. Charlock_Holmes gem
    • Detected windows-1252 with 33/100 confidence
    • I assume that's the actual encoding, and rchardet got iso-8859 because this ones based of that.
    • This gem uses ICU and has a maintained branch "bundle-icu" which supports Heroku. When I try to transcode using charlock, I get the error U_FILE_ACCESS_ERROR, an ICU error code meaning "could not open file"

Anybody know what to do here?


Source: (StackOverflow)

Advertisements

Italian dected as iso-8859-2

I am using chardet to detect encoding of text files including Italian. The problem is it consistently detects their encoding as iso-8859-2 while the correct detection would be iso-8859-1. Does anybody know a fix? My local language is set to Polish? Could that influence the detection?


Source: (StackOverflow)

failure in installing chardet

I download a chardet module,placed it in d:\\and want it installed in python, so I use the cmd :

c:\\Python27\python.exe d:\\chardet\setup.py

the win command says that:

Traceback (most recent call last): File "d:\\chardet\setup.py", line 13, in <module> long_description=open('README.rst').read(), IOError: [Errno 2] No such file or directory: 'README.rst'

but I am sure that the file 'README.rst' is in dir d:\\chardet

I don't know how to deal with it ,and hoping for your help .


Source: (StackOverflow)

rchardet gem support for ISO-8859-1, and Windows-1252

I would like to know whether rchardet supports encoding for ISO-8859-1, and Windows-1252. I have seen the documentation but I didn't get proper info on this.


Source: (StackOverflow)

Python fix a broken encoding

I have a small icecast2 home server with django playlist management. Also, i have a lot of mp3's with broken encodings. First, i've tried to find some encoding repair tool on python, but haven't find anything working for me (python-ftfy, nltk - it does not support unicode input).

I use beets pip like a swiss knife for parsing media tags, it's quite simple, and i think, it's almost enough for the most cases.

For character set detection i use chardet, but it has some issues on the short strings, so i use some coercing tweaks for encountered encodings. I presume, if encoding is wrong, it's wrong in all tags, so i collect all used encodings first.

class MostFrequentEncoding(dict):
    def from_attrs(self, obj):
        for attr in dir(obj):
            val = getattr(obj, attr)
            self.feed(val)

    def feed(self, obj):
        if obj and isinstance(obj, basestring):
            guess = chardet.detect(obj)
            encoding = guess['encoding']

            if encoding not in self:
                self.setdefault(encoding, {'confidence': 0.0, 'total': 0})

            self[encoding]['confidence'] += guess['confidence']
            self[encoding]['total'] += 1

    def encodings(self):
        return sorted(self, key=lambda x: self[x]['total'], reverse=True)

Here are the tweaks:

charset_coercing = {
    ('MacCyrillic', 'windows-1251'): {'MacCyrillic': -0.1},
}

That means, if we have MacCyrillic and windows-1251 chance at a same time, we should prefer windows-1251.

def fix_encoding(src, possible_encodings):
    if not isinstance(src, basestring) or not src:
        return src

    guess = chardet.detect(src)
    first_encoding = guess['encoding']

    encodings = list(possible_encodings)        # copy possible encodings
    if first_encoding in encodings:             # we believe chardet, so first tested
        encodings.remove(first_encoding)        # encoding will be the one, detected by chardet
    encodings.insert(0, first_encoding)
    encodings_set = set(encodings)

    tested_encodings = { k:{'string': '', 'confidence': -1.0} for k in encodings }

    try:
        lat = src.encode('latin-1') if isinstance(src, unicode) else src # make latin string
    except UnicodeEncodeError:
        lat = src.encode('utf-8') # may be not necessary, should return src?

    while encodings:
        candidate = encodings.pop(0)
        if not candidate:
            continue

        if not candidate in tested_encodings:
            tested_encodings.setdefault(candidate, {'string': '', 'confidence': -1.0})

        try:
            fixed_string = lat.decode(candidate)
        except UnicodeDecodeError:
            continue

        # try to detect charset again
        fixed_confidence = chardet.detect(fixed_string)['confidence']
        # it seems, that new confidence is usually higher, if the previous detection was right

        tested_encodings[candidate]['string'] = fixed_string
        tested_encodings[candidate]['confidence'] = fixed_confidence

    # perform charset coercing
    for subset, coercing_encodings in charset_coercing.items():
        if set(subset).issubset(encodings_set):
            for enc, penalty in coercing_encodings.items():
                tested_encodings[enc]['confidence'] += penalty


    result = tested_encodings.get(first_encoding)
    if result['confidence'] >= 0.99: # if confidence value for first detection is high, use it
        return result['string']

    max_confidence_charset = max(tested_encodings, key=lambda x: tested_encodings[x]['confidence'])
    return tested_encodings[max_confidence_charset]['string']

Media file parsing:

def extract_tags(media_file):
    try:
        mf = MediaFile(media_file)
    except:
        return {}

    mfe = MostFrequentEncoding()
    mfe.from_attrs(mf)

    encodings = mfe.encodings()
    tags = {}

    for attr in sorted(dir(mf)):
        val = getattr(mf, attr)
        if not val or callable(val) or \
        attr in ['__dict__', '__doc__', '__module__', '__weakref__', 'mgfile', 'art']:
            continue

        fixed = fix_encoding(val, encodings)
        tags[attr] = remove_extra_spaces(fixed) if isinstance(fixed, basestring) else fixed

    if mf.art:
        tags['art'] = { 'data': mf.art, 'mime': imghdr.what(None, h=mf.art) }

    return tags

And the usage for example:

f = '/media/Media/Music/Jason Becker/Jason Becker - Perpetual Burn/02__1.mp3'
pprint(extract_tags(f))

Here is a full script. It can show ascii-covers for albums during parsing.

It seems that it works, but is there any maintained swiss-knife encodings repair lib on python?


Source: (StackOverflow)

Chardet detects no encoding

I want some data from a website with the following url: http://www.iex.nl/Ajax/ChartData/interday.ashx?id=360113249&callback=ChartData

I think the data is Json. Going to the url in my browser, I can read the data.

In python I have the following code:

import urllib
import re
import json
import chardet
url = "http://www.iex.nl/Ajax/ChartData/interday.ashx?id=360113249&callback=ChartData"
htmlfile = urllib.urlopen(url).read()
chardet.detect(htmlfile)

This gives the following output: {'confidence': 0.0, 'encoding': None}

When I print htmlfile, it looks like it is 'utf8'.

What could be the reason for this output of chardet?


Source: (StackOverflow)

chardet run incorect in python 3


I am using chardet 2.01 in python 3.2,the souce code like this site http://getpython3.com/diveintopython3/case-study-porting-chardet-to-python-3.html

can download here
http://jaist.dl.sourceforge.net/project/cygwin-ports/release-2/Python/python3-chardet/python3-chardet-2.0.1-2.tar.bz2

I use lxml2 to parse html to get some string
,and use below code to detect the encoding

chardet.detect(name)

But an error occurs

Traceback (most recent call last):
  File "C:\python\test.py", line 125, in <module>
    print(chardet.detect(str(name)))
  File "E:\Python32\lib\site-packages\chardet\__init__.py", line 24, in detect
    u.feed(aBuf)
  File "E:\Python32\lib\site-packages\chardet\universaldetector.py", line 98, in feed
    if self._highBitDetector.search(aBuf):
TypeError: can't use a bytes pattern on a string-like object

name is a string object
Convert the string to bytes means encoding it with encoding like 'utf-8','big5'
and so on,charset would detect the encoding you made....not the original string's encoding
I have no idea with this problem...


Source: (StackOverflow)

Error while parsing a page with BeautifulSoup4, Chardet and Python 3.3 in Windows

I get the following error when I try to call BeautifulSoup(page)

Traceback (most recent call last):
 File "error.py", line 10, in <module>
  soup = BeautifulSoup(page)
 File "C:\Python33\lib\site-packages\bs4\__init__.py", line 169, in __init__
  self.builder.prepare_markup(markup, from_encoding))
 File "C:\Python33\lib\site-packages\bs4\builder\_htmlparser.py", line 136, in
 prepare_markup
  dammit = UnicodeDammit(markup, try_encodings, is_html=True)
 File "C:\Python33\lib\site-packages\bs4\dammit.py", line 223, in __init__
  u = self._convert_from(chardet_dammit(self.markup))
 File "C:\Python33\lib\site-packages\bs4\dammit.py", line 30, in chardet_dammit

   return chardet.detect(s)['encoding']
 File "C:\Python33\lib\site-packages\chardet\__init__.py", line 21, in detect
  import universaldetector
ImportError: No module named 'universaldetector'

I am running Python 3.3 in windows 7, I have installed bs4 from the setup.py by downloading the .tar.gz. I have installed pip and then installed chardet by doing pip.exe install chardet. My chardet version is 2.2.1. Bs4 works fine for other url.

Here's the code

import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import chardet

url = "http://www.edgar-online.com/brand/yahoo/search/?cik=1400810"
page = urlopen(url).read()
#print(page)
soup = BeautifulSoup(page)

I look forward to your answers


Source: (StackOverflow)

In Python, how to begin with chardet module?

I would like to try some code that uses the chardet module. This is the code i have found on the web :

import urllib2
import chardet

def fetch(url):
try:
   result = urllib2.urlopen(url)
   rawdata = result.read()
   encoding = chardet.detect(rawdata)
   return rawdata.decode(encoding['encoding'])

except urllib2.URLError, e:
   handleError(e)

But to try this code, i have to get the chardet module : But, i have two choices : https://pypi.python.org/pypi/chardet#downloads

  • chardet-2.2.1-py2.py3-none-any.whl (md5) Python Wheel
  • chardet-2.2.1.tar.gz (md5) Python source

I have chosen the Python Wheel and put this file in my Python27 directory. But still does not work.

So my problems are : - which type of chardet file to download + where to put this file for Python not to print this error : Traceback (most recent call last): File "C:/Python27/s7/test5.py", line 2, in import chardet ImportError: No module named chardet

Note :(I'm on Python 2.7)

Thanks in advance for any help or suggestions ! :D

EDIT 1 : Sorry for being a very beginner, but in fact it is the python source that must be chosen! Especially, installing with setup.py, but it does not work to me ! I opened the Windows command and wrote the path to the chardet-2.2.1(unzipped) , and then i wrote : python setup.py install, but it does not work ...:S

I think it's better to open a new subject.


Source: (StackOverflow)

chardet apparently wrong on Big5

I'm decoding a large (about a gigabyte) flat file database, which mixes character encodings willy nilly. The python module chardet is doing a good job, so far, of identifying the encodings, but if hit a stumbling block...

In [428]: badish[-3]
Out[428]: '\t\t\t"Kuzey r\xfczgari" (2007) {(#1.2)}  [Kaz\xc4\xb1m]\n'

In [429]: chardet.detect(badish[-3])
Out[429]: {'confidence': 0.98999999999999999, 'encoding': 'Big5'}

In [430]: unicode(badish[-3], 'Big5')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)

~/src/imdb/<ipython console> in <module>()

UnicodeDecodeError: 'big5' codec can't decode bytes in position 11-12: illegal multibyte sequence

chardet reports a very high confidence in it's choice of encodings, but it doesn't decode... Are there any other sensible approaches?


Source: (StackOverflow)

Encoding detection in Python, use the chardet library or not?

I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).

I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.

I tried just using the standard Linux file

file -bi name.txt

And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.

So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?


Source: (StackOverflow)

Encoding error while parsing RSS with lxml

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

But I get an error:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

Source: (StackOverflow)