character-encoding interview questions
Top character-encoding frequently asked interview questions
Charset issues are confusing and complicated by themselves, but on top of that you have to remember exact names of your charsets. Is it "utf8"
? Or "utf-8"
? Or maybe "UTF-8"
? When searching internet for code samples you will see all of the above. Why not just make them named constants and use Charset.UTF8
?
Source: (StackOverflow)
How do I properly set the default character encoding used by the JVM (1.5.x) programmatically?
I have read that -Dfile.encoding=whatever
used to be the way to go for older JVMs... I don't have that luxury for reasons I wont get into.
I have tried:
System.setProperty("file.encoding", "UTF-8");
And the property gets set, but it doesn't seem to cause the final getBytes call below to use UTF8:
System.setProperty("file.encoding", "UTF-8");
byte inbytes[] = new byte[1024];
FileInputStream fis = new FileInputStream("response.txt");
fis.read(inbytes);
FileOutputStream fos = new FileOutputStream("response-2.txt");
String in = new String(inbytes, "UTF8");
fos.write(in.getBytes());
Source: (StackOverflow)
Currently we are using the following commands in PHP to set the character set to UTF-8 in our application.
Since this is a bit of overhead, we'd like to set this as the default setting in MySQL. Can we do this in /etc/my.cnf or in another location?
SET NAMES 'utf8'
SET CHARACTER SET utf8
I've looked for a default charset in /etc/my.cnf, but there's nothing there about charsets.
At this point, I did the following to set the MySQL charset and collation variables to UTF-8:
skip-character-set-client-handshake
character_set_client=utf8
character_set_server=utf8
Is that a correct way to handle this?
Source: (StackOverflow)
I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO-8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are:
1) How to find out what encoding the text uses
2) How to convert it to UTF-8 - whatever the old encoding is
EDIT:
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it but it doesn't work. What's wrong with it?
Source: (StackOverflow)
This question already has an answer here:
$ cat bla.py
u = unicode('d…')
s = u.encode('utf-8')
print s
$ python bla.py
File "bla.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file bla.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How can I declare utf-8 strings in source code?
Source: (StackOverflow)
I'm looking to prevent a line break after a hyphen -
on a case-by-case basis that is compatible with all browsers.
Example:
I have this text: 3-3/8"
which in HTML is this: 3-3/8”
The problem is that near the end of a line, because of the hyphen, it breaks and wraps to the next line instead of treating it like a full word...
3-
3/8"
I've tried inserting the "zero width no break character", 
with no luck...
3-3/8”
I'm seeing this in Safari and thinking it will be the same in all browsers.
The following is my doctype
and character encoding...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
Is there any way I can prevent these from line-breaking after the hyphen? I do not need any solution that applies to the whole page... just something I can insert as needed, like a "zero width no break character", except one that works.
Here is a Demo. Simply make the frame narrower until the line breaks at the hyphen.
http://jsfiddle.net/RagKH/
Source: (StackOverflow)
What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Source: (StackOverflow)
In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?
Source: (StackOverflow)
I have to maintain a large number of classic ASP pages, many of which have tabular data with no sort capabilities at all. Whatever order the original developer used in the database query is what you're stuck with.
I want to to tack on some basic sorting to a bunch of these pages, and I'm doing it all client side with javascript. I already have the basic script done to sort a given table on a given column in a given direction, and it works well as long as the table is limited by certain conventions we follow here.
What I want to do for the UI is just indicate sort direction with the caret character ( ^ ) and ... what? Is there a special character that is the direct opposite of a caret? The letter v
won't quite cut it. Alternatively, is there another character pairing I can use?
Source: (StackOverflow)
When I make a POST request with a JSON body to my REST service I include Content-type: application/json; charset=utf-8
in the message header. Without this header, I get an error from the service. I can also successfully use Content-type: application/json
without the ;charset=utf-8
portion.
What exactly does charset=utf-8
do ? I know it specifies the character encoding but the service works fine without it. Does this encoding limit the characters that can be in the message body?
Source: (StackOverflow)