
unicode interview questions

Top unicode frequently asked interview questions

How to convert string to lowercase in Python?

Is there any way to convert an entire user inputted string from uppercase, or even part uppercase to lowercase?

E.g. Kilometers --> kilometers.

Source: (StackOverflow)

Convert a Unicode string to a string in Python (containing extra symbols)

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?

Source: (StackOverflow)


What's the difference between utf8_general_ci and utf8_unicode_ci

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?

Source: (StackOverflow)

std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

  1. When should I use std::wstring over std::string?
  2. Can std::string hold the entire ASCII character set, including the special characters?
  3. Is std::wstring supported by all popular C++ compilers?
  4. What is exactly a "wide character"?

Source: (StackOverflow)

What's different between UTF-8 and UTF-8 without BOM?

What's different between UTF-8 and UTF-8 without a BOM? Which is better?

Source: (StackOverflow)

What is the best way to remove accents in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the Web an elegant way to do this in Java:

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

Source: (StackOverflow)

Why does 2+ 40 equal 42?

I was baffled when a colleague showed me this line of JavaScript alerting 42.

alert(2+ 40);

It quickly turns out that what looks like a minus sign is actually an arcane Unicode character with clearly different semantics.

This left me wondering why that character doesn't produce a syntax error when the expression is parsed. I'd also like to know if there are more characters behaving like this.

Source: (StackOverflow)

What is the _snowman param in Ruby on Rails 3 forms for?

In Ruby on Rails 3 (currently using Beta 4), I see that when using the form_tag or form_for helpers there is a hidden field named _snowman with the value of ☃ (Unicode \x9731) showing up.

So, what is this for?

Source: (StackOverflow)

Placing Unicode character in CSS content value [duplicate]

This question already has an answer here:

I have a problem. I have found the HTML code for the downwards arrow, ↓ (↓)

Cool. Now I need to use it in CSS like so:

nav a:hover {content:"&darr";}

That obviously won't work since ↓ is an HTML symbol. There seems to be less info about these "escaped unicode" symbols that are used in css. There are other symbols like \2020 that I found but no arrows. What are the arrow codes?

Source: (StackOverflow)

Why is the length of this string longer than the number of characters in it?

This code:

string a = "abc";
string b = "A𠈓C";
Console.WriteLine("Length a = {0}", a.Length);
Console.WriteLine("Length b = {0}", b.Length);


Length a = 3
Length b = 4

Why? The only thing I could imagine is that the Chinese character is 2 bytes long and that the .Length method returns the byte count.

Source: (StackOverflow)

How to get string objects instead of Unicode ones from JSON in Python?

I'm using Python 2 to parse JSON from (ASCII encoded) text files. When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects.

The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.

Is it possible to get string objects instead of Unicode ones from json or simplejson?

Here's a small example:

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(js)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`

Source: (StackOverflow)

What does the 'b' character do in front of a string literal?

Apparently, the following is valid syntax...

my_string = b'The string'

I would like to know...

  1. What does this b character infront of the string mean?
  2. What are the effects of using it?
  3. What are appropriate situations to use it.

I found a related question right here on SO but that question is about PHP though and it states the b is used to indicate the string is binary as opposed to unicode which was needed for code to be compatible from version of PHP < 6 when migrating to PHP 6. I don't think this applies to Python.

I did find this documentation on the python site about using a u character in the same syntax to specify a string as unicode. Unfortunately it doesn't mention the b character anywhere in that document.

Also, just out of curiosity, are there more symbols than the b and u that do other things?

Source: (StackOverflow)

Unicode (utf8) reading and writing to files in python

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

# the string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

("u'Capit\xe1n'", "'Capit\xc3\xa1n'")

print ss, ss8    
print >> open('f1','w'), ss8

>>> file('f1').read() 

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.


>>> open('f1').read()
>>> open('f2').read()
>>> open('f1').read().decode('utf8')
>>> open('f2').read().decode('utf8')

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions.

Edit: What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ascii representation of this unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))

Source: (StackOverflow)

Unicode characters in Windows command line - how?

We have a project in TFS that has a non-English character (š) in it. When trying to script a few build-related things we've stumbled upon a problem - we can't pass the š letter to the command line tools. Command prompt or what not else messes it up, and the tf.exe utility can't find the specified project.

I've tried different formats for the .bat file (ANSI, UTF-8 with and without BOM) as well as scripting it in JavaScript (which is Unicode inherently) - but no luck. Anybody have an idea how to excecute a program and pass it a Unicode command line?

Source: (StackOverflow)

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?

In what way are these helpful for programmers?

Source: (StackOverflow)