unicode interview questions
Top unicode frequently asked interview questions
Is there any way to convert an entire user inputted string from uppercase, or even part uppercase to lowercase?
E.g. Kilometers --> kilometers.
Source: (StackOverflow)
How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?
Source: (StackOverflow)
Between utf8_general_ci
and utf8_unicode_ci
, are there any differences in terms of performance?
Source: (StackOverflow)
I am not able to understand the differences between std::string
and std::wstring
. I know wstring
supports wide characters such as Unicode characters. I have got the following questions:
- When should I use
std::wstring
over std::string
?
- Can
std::string
hold the entire ASCII character set, including the special characters?
- Is
std::wstring
supported by all popular C++ compilers?
- What is exactly a "wide character"?
Source: (StackOverflow)
I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the Web an elegant way to do this in Java:
- convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
- remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.
Source: (StackOverflow)
I was baffled when a colleague showed me this line of JavaScript alerting 42.
It quickly turns out that what looks like a minus sign is actually an arcane Unicode character with clearly different semantics.
This left me wondering why that character doesn't produce a syntax error when the expression is parsed. I'd also like to know if there are more characters behaving like this.
Source: (StackOverflow)
In Ruby on Rails 3 (currently using Beta 4), I see that when using the form_tag
or form_for
helpers there is a hidden field named _snowman
with the value of ☃ (Unicode \x9731) showing up.
So, what is this for?
Source: (StackOverflow)
This question already has an answer here:
I have a problem. I have found the HTML code for the downwards arrow, ↓
(↓)
Cool. Now I need to use it in CSS like so:
nav a:hover {content:"&darr";}
That obviously won't work since ↓
is an HTML symbol. There seems to be less info about these "escaped unicode" symbols that are used in css. There are other symbols like \2020
that I found but no arrows. What are the arrow codes?
Source: (StackOverflow)
This code:
string a = "abc";
string b = "A𠈓C";
Console.WriteLine("Length a = {0}", a.Length);
Console.WriteLine("Length b = {0}", b.Length);
outputs:
Length a = 3
Length b = 4
Why? The only thing I could imagine is that the Chinese character is 2 bytes long and that the .Length
method returns the byte count.
Source: (StackOverflow)
I'm using Python 2 to parse JSON from (ASCII encoded) text files. When loading these files with either json
or simplejson
, all my string values are cast to Unicode objects instead of string objects.
The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.
Is it possible to get string objects instead of Unicode ones from json
or simplejson
?
Here's a small example:
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(js)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
Source: (StackOverflow)
Apparently, the following is valid syntax...
my_string = b'The string'
I would like to know...
- What does this
b
character infront of the string mean?
- What are the effects of using it?
- What are appropriate situations to use it.
I found a related question right here on SO but that question is about PHP though and it states the b
is used to indicate the string is binary as opposed to unicode which was needed for code to be compatible from version of PHP < 6 when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the python site about using a u
character in the same syntax to specify a string as unicode. Unfortunately it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b
and u
that do other things?
Source: (StackOverflow)
I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# the string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n
into my favorite editor, in file f2.
then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions.
Edit: What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ascii representation of this unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Source: (StackOverflow)
We have a project in TFS that has a non-English character (š) in it. When trying to script a few build-related things we've stumbled upon a problem - we can't pass the š letter to the command line tools. Command prompt or what not else messes it up, and the tf.exe utility can't find the specified project.
I've tried different formats for the .bat file (ANSI, UTF-8 with and without BOM) as well as scripting it in JavaScript (which is Unicode inherently) - but no luck. Anybody have an idea how to excecute a program and pass it a Unicode command line?
Source: (StackOverflow)
What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Source: (StackOverflow)