chardet
Python 2/3 compatible character encoding detector.
I'm trying to use juniversalchardet to auto-detect encoding of a saved webpage, my first test use www.wikipedia.org, which uses UTF-8 encoding according to HTTP response header (this information is lost after being saved to disk)
This is my scala code in doing so:
val content = <...load Wikipedia.html from disk...>
val charsetD = new UniversalDetector(null)
charsetD.handleData(content, 0, content.length)
val charset = charsetD.getDetectedCharset
However regardless of what I load, the charset result is always 'null', is it because the juniversalchardet library is defective? Or I'm using it wrong?
Source: (StackOverflow)
I've been working on outlook imports (linked in exports to outlook format) but I'm having troubles with encoding. The outlook format CSV I get from exporting my LinkedIn contacts are not in UTF-8. Letters like ñ
cause an exception in the mongoid_search
gem when calling str.to_s.mb_chars.normalize
. I think encoding is the issue, because when I call mb_chars
(see first code example). I am not sure if this is a bug in the gem, but I was advised to sanitize the data nonetheless.
From File Picker, I tried using their new, community-supported gem to upload CSV data. I tried three encoding detectors and transcoders:
- Ruby port of a Python lib
chardet
- Didn't work as expected
- The port still contained Python code, preventing it from running in my app
rchardet19
gem
- Detected
iso-8859
with .8/1
confidence.
- Tried to transcode with Iconv, but crashed on "illegal characters" at
ñ
Charlock_Holmes
gem
- Detected
windows-1252
with 33/100
confidence
- I assume that's the actual encoding, and
rchardet
got iso-8859
because this ones based of that.
- This gem uses ICU and has a maintained branch "bundle-icu" which supports Heroku. When I try to transcode using
charlock
, I get the error U_FILE_ACCESS_ERROR
, an ICU error code meaning "could not open file"
Anybody know what to do here?
Source: (StackOverflow)
I am using chardet to detect encoding of text files including Italian. The problem is it consistently detects their encoding as iso-8859-2 while the correct detection would be iso-8859-1. Does anybody know a fix?
My local language is set to Polish? Could that influence the detection?
Source: (StackOverflow)
I download a chardet module,placed it in d:\\
and want it installed in python,
so I use the cmd :
c:\\Python27\python.exe d:\\chardet\setup.py
the win command says that:
Traceback (most recent call last):
File "d:\\chardet\setup.py", line 13, in <module>
long_description=open('README.rst').read(),
IOError: [Errno 2] No such file or directory: 'README.rst'
but I am sure that the file 'README.rst
' is in dir d:\\chardet
I don't know how to deal with it ,and hoping for your help .
Source: (StackOverflow)
I would like to know whether rchardet supports encoding for ISO-8859-1, and Windows-1252. I have seen the documentation but I didn't get proper info on this.
Source: (StackOverflow)
I have a small icecast2 home server with django playlist management. Also, i have a lot of mp3's with broken encodings. First, i've tried to find some encoding repair tool on python, but haven't find anything working for me (python-ftfy, nltk - it does not support unicode input).
I use beets
pip like a swiss knife for parsing media tags, it's quite simple, and i think, it's almost enough for the most cases.
For character set detection i use chardet
, but it has some issues on the short strings, so i use some coercing tweaks for encountered encodings. I presume, if encoding is wrong, it's wrong in all tags, so i collect all used encodings first.
class MostFrequentEncoding(dict):
def from_attrs(self, obj):
for attr in dir(obj):
val = getattr(obj, attr)
self.feed(val)
def feed(self, obj):
if obj and isinstance(obj, basestring):
guess = chardet.detect(obj)
encoding = guess['encoding']
if encoding not in self:
self.setdefault(encoding, {'confidence': 0.0, 'total': 0})
self[encoding]['confidence'] += guess['confidence']
self[encoding]['total'] += 1
def encodings(self):
return sorted(self, key=lambda x: self[x]['total'], reverse=True)
Here are the tweaks:
charset_coercing = {
('MacCyrillic', 'windows-1251'): {'MacCyrillic': -0.1},
}
That means, if we have MacCyrillic
and windows-1251
chance at a same time, we should prefer windows-1251
.
def fix_encoding(src, possible_encodings):
if not isinstance(src, basestring) or not src:
return src
guess = chardet.detect(src)
first_encoding = guess['encoding']
encodings = list(possible_encodings) # copy possible encodings
if first_encoding in encodings: # we believe chardet, so first tested
encodings.remove(first_encoding) # encoding will be the one, detected by chardet
encodings.insert(0, first_encoding)
encodings_set = set(encodings)
tested_encodings = { k:{'string': '', 'confidence': -1.0} for k in encodings }
try:
lat = src.encode('latin-1') if isinstance(src, unicode) else src # make latin string
except UnicodeEncodeError:
lat = src.encode('utf-8') # may be not necessary, should return src?
while encodings:
candidate = encodings.pop(0)
if not candidate:
continue
if not candidate in tested_encodings:
tested_encodings.setdefault(candidate, {'string': '', 'confidence': -1.0})
try:
fixed_string = lat.decode(candidate)
except UnicodeDecodeError:
continue
# try to detect charset again
fixed_confidence = chardet.detect(fixed_string)['confidence']
# it seems, that new confidence is usually higher, if the previous detection was right
tested_encodings[candidate]['string'] = fixed_string
tested_encodings[candidate]['confidence'] = fixed_confidence
# perform charset coercing
for subset, coercing_encodings in charset_coercing.items():
if set(subset).issubset(encodings_set):
for enc, penalty in coercing_encodings.items():
tested_encodings[enc]['confidence'] += penalty
result = tested_encodings.get(first_encoding)
if result['confidence'] >= 0.99: # if confidence value for first detection is high, use it
return result['string']
max_confidence_charset = max(tested_encodings, key=lambda x: tested_encodings[x]['confidence'])
return tested_encodings[max_confidence_charset]['string']
Media file parsing:
def extract_tags(media_file):
try:
mf = MediaFile(media_file)
except:
return {}
mfe = MostFrequentEncoding()
mfe.from_attrs(mf)
encodings = mfe.encodings()
tags = {}
for attr in sorted(dir(mf)):
val = getattr(mf, attr)
if not val or callable(val) or \
attr in ['__dict__', '__doc__', '__module__', '__weakref__', 'mgfile', 'art']:
continue
fixed = fix_encoding(val, encodings)
tags[attr] = remove_extra_spaces(fixed) if isinstance(fixed, basestring) else fixed
if mf.art:
tags['art'] = { 'data': mf.art, 'mime': imghdr.what(None, h=mf.art) }
return tags
And the usage for example:
f = '/media/Media/Music/Jason Becker/Jason Becker - Perpetual Burn/02__1.mp3'
pprint(extract_tags(f))
Here is a full script. It can show ascii-covers for albums during parsing.
It seems that it works, but is there any maintained swiss-knife encodings repair lib on python?
Source: (StackOverflow)
I want some data from a website with the following url:
http://www.iex.nl/Ajax/ChartData/interday.ashx?id=360113249&callback=ChartData
I think the data is Json. Going to the url in my browser, I can read the data.
In python I have the following code:
import urllib
import re
import json
import chardet
url = "http://www.iex.nl/Ajax/ChartData/interday.ashx?id=360113249&callback=ChartData"
htmlfile = urllib.urlopen(url).read()
chardet.detect(htmlfile)
This gives the following output:
{'confidence': 0.0, 'encoding': None}
When I print htmlfile, it looks like it is 'utf8'.
What could be the reason for this output of chardet?
Source: (StackOverflow)
I get the following error when I try to call BeautifulSoup(page)
Traceback (most recent call last):
File "error.py", line 10, in <module>
soup = BeautifulSoup(page)
File "C:\Python33\lib\site-packages\bs4\__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "C:\Python33\lib\site-packages\bs4\builder\_htmlparser.py", line 136, in
prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "C:\Python33\lib\site-packages\bs4\dammit.py", line 223, in __init__
u = self._convert_from(chardet_dammit(self.markup))
File "C:\Python33\lib\site-packages\bs4\dammit.py", line 30, in chardet_dammit
return chardet.detect(s)['encoding']
File "C:\Python33\lib\site-packages\chardet\__init__.py", line 21, in detect
import universaldetector
ImportError: No module named 'universaldetector'
I am running Python 3.3 in windows 7, I have installed bs4 from the setup.py by downloading the .tar.gz. I have installed pip and then installed chardet by doing pip.exe install chardet. My chardet version is 2.2.1. Bs4 works fine for other url.
Here's the code
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import chardet
url = "http://www.edgar-online.com/brand/yahoo/search/?cik=1400810"
page = urlopen(url).read()
#print(page)
soup = BeautifulSoup(page)
I look forward to your answers
Source: (StackOverflow)
I would like to try some code that uses the chardet module.
This is the code i have found on the web :
import urllib2
import chardet
def fetch(url):
try:
result = urllib2.urlopen(url)
rawdata = result.read()
encoding = chardet.detect(rawdata)
return rawdata.decode(encoding['encoding'])
except urllib2.URLError, e:
handleError(e)
But to try this code, i have to get the chardet module :
But, i have two choices :
https://pypi.python.org/pypi/chardet#downloads
- chardet-2.2.1-py2.py3-none-any.whl (md5) Python Wheel
- chardet-2.2.1.tar.gz (md5) Python source
I have chosen the Python Wheel and put this file in my Python27 directory.
But still does not work.
So my problems are :
- which type of chardet file to download + where to put this file for Python not to print this error :
Traceback (most recent call last):
File "C:/Python27/s7/test5.py", line 2, in
import chardet
ImportError: No module named chardet
Note :(I'm on Python 2.7)
Thanks in advance for any help or suggestions ! :D
EDIT 1 : Sorry for being a very beginner, but in fact it is the python source that must be chosen!
Especially, installing with setup.py, but it does not work to me !
I opened the Windows command and wrote the path to the chardet-2.2.1(unzipped) , and then i wrote : python setup.py install, but it does not work ...:S
I think it's better to open a new subject.
Source: (StackOverflow)
I'm decoding a large (about a gigabyte) flat file database, which mixes character encodings willy nilly. The python module chardet
is doing a good job, so far, of identifying the encodings, but if hit a stumbling block...
In [428]: badish[-3]
Out[428]: '\t\t\t"Kuzey r\xfczgari" (2007) {(#1.2)} [Kaz\xc4\xb1m]\n'
In [429]: chardet.detect(badish[-3])
Out[429]: {'confidence': 0.98999999999999999, 'encoding': 'Big5'}
In [430]: unicode(badish[-3], 'Big5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~/src/imdb/<ipython console> in <module>()
UnicodeDecodeError: 'big5' codec can't decode bytes in position 11-12: illegal multibyte sequence
chardet reports a very high confidence in it's choice of encodings, but it doesn't decode... Are there any other sensible approaches?
Source: (StackOverflow)
I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).
I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1).
And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.
I tried just using the standard Linux file
file -bi name.txt
And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.
So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?
Source: (StackOverflow)
I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
But I get an error:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
Source: (StackOverflow)