2018-05-22

python编码

文档

https://docs.python.org/2/howto/unicode.html

ddd

Python获取系统编码参数的几个函数

系统的缺省编码(一般就是ascii): sys.getdefaultencoding()    （系统自动解码会采用这种类型）
系统当前的编码: locale.getdefaultlocale()
系统代码中临时被更改的编码（通过locale.setlocale(locale.LC_ALL,“zh_CN.UTF-8″)）: locale.getlocale()
文件系统的编码: sys.getfilesystemencoding()
终端的输入编码: sys.stdin.encoding   （这个类似文件系统的编码）
终端的输出编码: sys.stdout.encoding
代码的缺省编码: 文件头上# -*- coding: utf-8 –*-

头部编码

文件头部编码声明决定了python解析源码中的str的编码选择方式，比如头部声明的是utf-8编码，则代码中s="中文"python就会按照utf-8编码格式来解析，通过repr(s)可以看到字符编码是\xe4\xb8\xad\xe6\x96\x87，如果头部声明的编码是gbk编码，则python会对s采用gbk编码解析，结果是\xd6\xd0\xce\xc4。

需要注意的是，文件本身的编码要跟文件头部声明编码一致，不然就会出现问题。文件本身的编码在Linux下面可以在vim下用命令set fenc来查看。如果文件本身编码是gbk，而源码文件头部声明的编码是utf-8，这样如果源码中有中文就会有问题了，因为本身中文str存储是按照gbk编码来的，而python在解析str的时候又以为是utf-8编码，这样就会报SyntaxError: (unicode error) 'utf8' codec can't decode byte错误。

###文件编码 text file encoding

eclipse中设置： project /resouce / text file encoding

###默认编码问题

###读写文件编码

string byte

由一系列不可改变的Unicode字符组成的叫string。而一系列不可改变的介于0-255之间的数字被称为bytes对象。

###编码示例

1
2
3

>>> a=u'中国 abc'    #  u'\u4e2d\u56fd abc'        type: unicode    len(a)=6  空格也占一个长度，一个汉字占1个长度
>>> a='中国 abc'      #  '\xd6\xd0\xb9\xfa abc'    type: str   len(a)=8   空格也占一个长度，一个汉字占2个长度
>>> a=u'中国abc'.encode('utf8')     #'\xe4\xb8\xad\xe5\x9b\xbdabc'  type: str   len(a)=9  一个汉字占3个长度（URL中就是采用的这种编码）

###问题

http://jingyan.baidu.com/article/e75aca85440f01142edac636.html
中修改的是什么编码？
mobaxterm中修改的是什么编码？
eclipse中的很多项，分别修改的是什么编码？
textstudio中分别修改的是什么编码？

###windows

cmd的默认显示编码是GBK，
如果改成utf-8编码，需要输入命令 CHCP 65001，并修改字体。
见 http://jingyan.baidu.com/article/e75aca85440f01142edac636.html

对str按照文件编码或者终端的输入编码进行解码

###Linux中

我的Ubuntu中的默认编码

对str按照文件编码或者终端的输入编码进行解码

因为a是str，自动的先将 s 解码为 unicode ，然后再编码成 gb18030。
因为解码是python自动进行的，我们没有指明解码方式，python 就会使用 sys.defaultencoding 指明的方式来解码。
实际应该按照文件编码或者终端的输入编码进行解码，因此会出现错误。

系统无关

python安装时，默认编码是ascii， sys.getdefaultencoding() 。当程序中出现非ascii编码时，python的处理常常会报这样的错UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0x?? in position 1: ordinal not in range(128)，python没办法处理非ascii编码的，此时需要自己设置将python的默认编码，一般设置为utf8的编码格式。

靠尼玛， cmd和eclipse中都是 sys.getdefaultencoding() = ascii，为什么

转码实例

BERT实例

# https://github.com/google-research/bert/blob/master/tokenization.py#L27
import six
def convert_to_unicode(text):
  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text.decode("utf-8", "ignore")
    elif isinstance(text, unicode):
      return text
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")

def printable_text(text):
  """Returns text encoded in a way suitable for print or `tf.logging`."""

  # These functions want `str` for both Python2 and Python3, but in one case
  # it's a Unicode string and in the other it's a byte string.
  if six.PY3:
    """
    python3分string类型和bytes类型？
    python2呢？string类型，unicode类型，，，，？
    """
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text
    elif isinstance(text, unicode):
      return text.encode("utf-8")  # 这里只支持utf-8解码，太弱了
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")

tensor2tensor实例

语种判断

# https://github.com/google-research/bert/blob/master/tokenization.py#L201
def _is_chinese_char(self, cp):
  """Checks whether CP is the codepoint of a CJK character."""
  # This defines a "chinese character" as anything in the CJK Unicode block:
  #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
  #
  # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
  # despite its name. The modern Korean Hangul alphabet is a different block,
  # as is Japanese Hiragana and Katakana. Those alphabets are used to write
  # space-separated words, so they are not treated specially and handled
  # like the all of the other languages.
  if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
      (cp >= 0x3400 and cp <= 0x4DBF) or  #
      (cp >= 0x20000 and cp <= 0x2A6DF) or  #
      (cp >= 0x2A700 and cp <= 0x2B73F) or  #
      (cp >= 0x2B740 and cp <= 0x2B81F) or  #
      (cp >= 0x2B820 and cp <= 0x2CEAF) or
      (cp >= 0xF900 and cp <= 0xFAFF) or  #
      (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
    return True

  return False

文档

ddd

头部编码

string byte

系统无关

转码 实例

语种判断

转码实例