Unicode

From VistrailsWiki
Jump to navigation Jump to search

This page discusses the steps and issues with getting proper unicode support in VisTrails.

Goal

VisTrails should be able to handle any kind of unicode string. This means any string should be acceptable as input from console/API/GUI settings and should be serialized/deserialized correctly. Basically, every string in the program should be a unicode object, and although in some places we can accept a bytestring (e.g. from the API or modules) they should be casted to unicode.

Mentions of str should mainly disappear in favor of unicode and bytes.

About Python 3

Proper unicode support is a prerequisite to Python 3 support. Once VisTrails is fully unicode-safe and using unicode_literals, replacing unicode with six.text_type is very easy.

Issues

The GUI has no problem with unicode, PyQt is fully unicode-safe. Since we are using API v2, PyQt already returns (and accepts) native unicode objects. It used to return QString objects which (I assume) is why we have str() casts everywhere in the code; these probably cannot be dealt with manually, which is why I propose to replace all of them with unicode() automatically (and deal with special cases afterwards).

Serialization to/deserialization from XML is not a problem since XML documents are already unicode. Properly encoding filenames in there might require some work (using proper url-encoded URLs instead is probably a good idea) -- see Locators.

Guidelines

  • Please familiarize yourself with unicode, encodings, and do not put casts in the code unless they are necessary. Be aware that some Python functions only accept str (like tarfile, zipfile, ...) so some compatibility might be used around these modules.
  • Stop using str() when unnecessary. Document why you use str(), unicode() or bytes() when you do.
  • Stop using type() at all.

Tasks

Mass-replacing str() casts

From (regex) To
(?<!def )(?<![a-z_.])str\(         unicode(

Locators

Tests

We need to make sure everything works with unicode, so it's probably a good idea to insert non-ascii characters here and there in the tests (filenames, versions, parameters...)