utf-8 interview questions
Top utf-8 frequently asked interview questions
Let's suppose I have just used a BufferedInputStream
to read the bytes of a UTF-8 encoded text file into a byte array. I know that I can use the following routine to convert the bytes to a string, but is there a more efficient/smarter way of doing this than just iterating through the bytes and converting each one?
public String openFileToString(byte[] _bytes)
{
String file_string = "";
for(int i = 0; i < _bytes.length; i++)
{
file_string += (char)_bytes[i];
}
return file_string;
}
Source: (StackOverflow)
I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.
My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.
- Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
- What type of data would UTF-8 Binary be applicable to?
Source: (StackOverflow)
I need to get UTF-8 working in my Java webapp (servlets + JSP, no framework used) to support äöå
etc. for regular Finnish text and Cyrillic alphabets like ЦжФ
for special cases.
My setup is the following:
- Development environment: Windows XP
- Production environment: Debian
Database used: MySQL 5.x
Users mainly use Firefox2 but also Opera 9.x, FF3, IE7 and Google Chrome are used to access the site.
How to achieve this?
Source: (StackOverflow)
I wonder why most modern solutions built using Perl don't enable UTF-8 by default.
I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.
Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?
Commenting @tchrist got too long, so I'm adding it here.
It seems that I did not make myself clear. Let me try to add some things.
tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.
tchrist pointed so many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.
I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.
Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.
Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!
I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.
Source: (StackOverflow)
What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
Best solutions so far:
On Linux/UNIX/OS X/cygwin:
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
Gnu recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (Dos):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Edit: Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: Th
e possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii"
- CsCvt - Kalytta's Character Set Converter (http://www.cscvt.de) is another great command line based conversion tool for Windows.
Source: (StackOverflow)
What's the basis for Unicode and why the need for UTF-8 or UTF-16?
I have researched this on Google and searched here as well but it's not clear to me.
In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?
Please explain in simple terms.
Source: (StackOverflow)
How do I properly set the default character encoding used by the JVM (1.5.x) programmatically?
I have read that -Dfile.encoding=whatever
used to be the way to go for older JVMs... I don't have that luxury for reasons I wont get into.
I have tried:
System.setProperty("file.encoding", "UTF-8");
And the property gets set, but it doesn't seem to cause the final getBytes call below to use UTF8:
System.setProperty("file.encoding", "UTF-8");
byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
FileOutputStream fos = new FileOutputStream("response-2.txt");
String in = new String(inbytes, "UTF8");
fos.write(in.getBytes());
Source: (StackOverflow)
What are the differences between UTF-8, UTF-16, and UTF-32. I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
Source: (StackOverflow)
I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO-8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are:
1) How to find out what encoding the text uses
2) How to convert it to UTF-8 - whatever the old encoding is
EDIT:
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it but it doesn't work. What's wrong with it?
Source: (StackOverflow)
I have heard conflicting opinions from people - according to wikipedia, see here
They are the same thing, aren't they? Can someone clarify?
Source: (StackOverflow)
This question already has an answer here:
$ cat bla.py
u = unicode('d…')
s = u.encode('utf-8')
print s
$ python bla.py
File "bla.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file bla.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How can I declare utf-8 strings in source code?
Source: (StackOverflow)
I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# the string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n
into my favorite editor, in file f2.
then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions.
Edit: What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ascii representation of this unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Source: (StackOverflow)
I need to use UTF-8 in my resource properties using Java's ResourceBundle
. When I enter the text directly into the properties file, it displays as mojibake.
My app runs on Google App Engine.
Can anyone give me an example? I can't get this work.
Source: (StackOverflow)