Homework 8
Due Monday, November 27
This homework deals with Unicode and UTF-8. You will need to
understand the
unicodedata
module. I also suggest you read
The
Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!).
There is also a personal history of the development of UTF8
by Rob Pike
here.
You can find UTF-8 test pages for browsers here.
If your browser shows funny squiggles for a given page, it supports
that script. If it shows question marks or hex codes, it does not.
-
Unicode code points run from 0 to 216-1, but not all of
those code points are valid characters. Write a function that uses
functionality from the unicodedata module to figure out which code
points are valid. Your function should return a list of
numbers corresponding to valid code points.
-
Write a similar function that returns a list of
numbers corresponding to code points that are capital
letters.
-
Write a function that takes a list of code points and returns the
names of (valid) code points in that list. Run
this on the list from the previous exercise and use the returned list
to show which scripts have case distinction.
-
Write a function that takes an arbitrary utf8 string and downcases
it. You will need to do this in unicode and then reencode back to
utf8.
-
In /home/rws/HomeworkExercises/HW8 you'll find eight small
text files each in a different script. Tell me what script each one is
in. No cheating like reading them into a browser and using your
knowledge of writing systems to figure it out: I want you to do it
using the python tools, and show me what you did.