Homework 8

Due Monday, November 27

This homework deals with Unicode and UTF-8. You will need to understand the unicodedata module. I also suggest you read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). There is also a personal history of the development of UTF8 by Rob Pike here.

You can find UTF-8 test pages for browsers here. If your browser shows funny squiggles for a given page, it supports that script. If it shows question marks or hex codes, it does not.

  1. Unicode code points run from 0 to 216-1, but not all of those code points are valid characters. Write a function that uses functionality from the unicodedata module to figure out which code points are valid. Your function should return a list of numbers corresponding to valid code points.

  2. Write a similar function that returns a list of numbers corresponding to code points that are capital letters.

  3. Write a function that takes a list of code points and returns the names of (valid) code points in that list. Run this on the list from the previous exercise and use the returned list to show which scripts have case distinction.

  4. Write a function that takes an arbitrary utf8 string and downcases it. You will need to do this in unicode and then reencode back to utf8.

  5. In /home/rws/HomeworkExercises/HW8 you'll find eight small text files each in a different script. Tell me what script each one is in. No cheating like reading them into a browser and using your knowledge of writing systems to figure it out: I want you to do it using the python tools, and show me what you did.