Python 3之str类型、string模块学习笔记

Windows 10家庭中文版，Python 3.6.4，

Python 3.7官文：

Text Sequence Type — str

string — Common string operations

str类型

Python（特指Python 3）中包含字符串，字符串的类型为str，字符串是Unicode码点（Unicode code codepoint）的序列，属于不可变类型。

字符串有三种写法：

单引号（Single quotes）、双引号（Double quotes）、三引号（Triple quoted）。

单双引号可以互相嵌套，三引号可以嵌套单双引号，使得字符串扩展为多行。若要嵌套自身，需要用反斜杠转移。

还可以使用str构造函数创建字符串：

class str(object=\'\')
class str(object=b\'\', encoding=\'utf-8\', errors=\'strict\')

注意，第二个构造函数是基于bytes（准确的说法是 a bytes-like object (e.g. bytes or bytearray)）构造字符串，也即实现bytes转字符串的功能，但是要写对encoding参数。

注意，str(bytes, encoding, errors)和bytes.decode(encoding, errors)功能相同。

新：

两个字符串字面量之间只有空格时，它们会被自动转换为一个字符串字面量。

>>> "sdfs" "www"
\'sdfswww\'
>>> ("sdfs" "www")
\'sdfswww\'
>>> "sdfs"         "www" # 多个空格
\'sdfswww\'

参考：字符串字面量的语法（有些复杂，不是一眼就可以看懂的，进阶的话可以dig，此处略过）

字符串是不可变的，但是，可以使用str.join()方法创造字符串，或者使用io模块的io.StringIO函数构造字符串，两者原型如下：

str.join(iterable)

class io.StringIO(initial_value=\'\', newline=\'\n\')

后者还需要dig，前者略懂一二。

自己一直以来没有搞明白字符串前面添加 r、u 做什么？现在OK了：

-r 表示字符串中所有的字符表示其本身，比如，反斜杠就是反斜杠，不是用来转义的，\'\n\' 表示换行符，是一个字符，而 r\'\n\' 则是两个字符——一个反斜杠、一个小写n。

-u 表示字符串是Unicode字符串，在Python 3中保留是为了兼容Python 2，而Python 3中的字符串默认都是Unicode字符串，在Python 3中，不需要添加，而且不能和 r 一起使用。

repr()函数的用法：什么时候用？不是很清楚~菜鸟教程中的解释：repr() 函数将对象转化为供解释器读取的形式——使用eval()。

下面是一些测试：

>>> x = \'n\'
>>> y = \'\n\'
>>> d1 = 123
>>> f1 = 999.87
>>> repr(x), repr(y), repr(d1), repr(f1)
("\'n\'", "\'\\n\'", \'123\', \'999.87\')
>>> len(repr(x)), len(repr(y)), len(repr(d1)), len(repr(f1))
(3, 4, 3, 6)
>>> eval(repr(x)), eval(repr(y)), eval(repr(d1)), eval(repr(f1))
(\'n\', \'\n\', 123, 999.87)

Python字符串的对象属性、方法——使用dir(str)可以看到全部方法（methods，外部可以直接调用的有44个）：

>>> for attr in dir(str):
	print(attr, type(eval(\'str.%s\' % attr)))

__add__ <class \'wrapper_descriptor\'>
__class__ <class \'type\'>
__contains__ <class \'wrapper_descriptor\'>
__delattr__ <class \'wrapper_descriptor\'>
__dir__ <class \'method_descriptor\'>
__doc__ <class \'str\'>
__eq__ <class \'wrapper_descriptor\'>
__format__ <class \'method_descriptor\'>
__ge__ <class \'wrapper_descriptor\'>
__getattribute__ <class \'wrapper_descriptor\'>
__getitem__ <class \'wrapper_descriptor\'>
__getnewargs__ <class \'method_descriptor\'>
__gt__ <class \'wrapper_descriptor\'>
__hash__ <class \'wrapper_descriptor\'>
__init__ <class \'wrapper_descriptor\'>
__init_subclass__ <class \'builtin_function_or_method\'>
__iter__ <class \'wrapper_descriptor\'>
__le__ <class \'wrapper_descriptor\'>
__len__ <class \'wrapper_descriptor\'>
__lt__ <class \'wrapper_descriptor\'>
__mod__ <class \'wrapper_descriptor\'>
__mul__ <class \'wrapper_descriptor\'>
__ne__ <class \'wrapper_descriptor\'>
__new__ <class \'builtin_function_or_method\'>
__reduce__ <class \'method_descriptor\'>
__reduce_ex__ <class \'method_descriptor\'>
__repr__ <class \'wrapper_descriptor\'>
__rmod__ <class \'wrapper_descriptor\'>
__rmul__ <class \'wrapper_descriptor\'>
__setattr__ <class \'wrapper_descriptor\'>
__sizeof__ <class \'method_descriptor\'>
__str__ <class \'wrapper_descriptor\'>
__subclasshook__ <class \'builtin_function_or_method\'>
capitalize <class \'method_descriptor\'>
casefold <class \'method_descriptor\'>
center <class \'method_descriptor\'>
count <class \'method_descriptor\'>
encode <class \'method_descriptor\'>
endswith <class \'method_descriptor\'>
expandtabs <class \'method_descriptor\'>
find <class \'method_descriptor\'>
format <class \'method_descriptor\'>
format_map <class \'method_descriptor\'>
index <class \'method_descriptor\'>
isalnum <class \'method_descriptor\'>
isalpha <class \'method_descriptor\'>
isdecimal <class \'method_descriptor\'>
isdigit <class \'method_descriptor\'>
isidentifier <class \'method_descriptor\'>
islower <class \'method_descriptor\'>
isnumeric <class \'method_descriptor\'>
isprintable <class \'method_descriptor\'>
isspace <class \'method_descriptor\'>
istitle <class \'method_descriptor\'>
isupper <class \'method_descriptor\'>
join <class \'method_descriptor\'>
ljust <class \'method_descriptor\'>
lower <class \'method_descriptor\'>
lstrip <class \'method_descriptor\'>
maketrans <class \'builtin_function_or_method\'>
partition <class \'method_descriptor\'>
replace <class \'method_descriptor\'>
rfind <class \'method_descriptor\'>
rindex <class \'method_descriptor\'>
rjust <class \'method_descriptor\'>
rpartition <class \'method_descriptor\'>
rsplit <class \'method_descriptor\'>
rstrip <class \'method_descriptor\'>
split <class \'method_descriptor\'>
splitlines <class \'method_descriptor\'>
startswith <class \'method_descriptor\'>
strip <class \'method_descriptor\'>
swapcase <class \'method_descriptor\'>
title <class \'method_descriptor\'>
translate <class \'method_descriptor\'>
upper <class \'method_descriptor\'>
zfill <class \'method_descriptor\'>

View Code

\'capitalize\', \'casefold\', \'center\', \'count\', \'encode\', \'endswith\', \'expandtabs\', 
\'find\', \'format\', \'format_map\', \'index\', \'isalnum\', \'isalpha\', \'isdecimal\', \'isdigit\', 
\'isidentifier\', \'islower\', \'isnumeric\', \'isprintable\', \'isspace\', \'istitle\', \'isupper\', 
\'join\', \'ljust\', \'lower\', \'lstrip\', \'maketrans\', \'partition\', \'replace\', \'rfind\', \'rindex\', 
\'rjust\', \'rpartition\', \'rsplit\', \'rstrip\', \'split\', \'splitlines\', \'startswith\', \'strip\', 
\'swapcase\', \'title\', \'translate\', \'upper\', \'zfill\'

包括查找、去除左右空格、判断字符串元素的类别、分隔——中文分隔需要用re模块、大小写转换、转换为bytes——encode、格式化字符串——本文后面会简单介绍、居中、左右对齐、替换replace等。

string模块

string模块包含了一些字符串常量，另外还有Formatter类、Template类和一个帮助函数capwords（string.capwords(s, sep=None)）。

其中，Formatter类型用于字符串格式化，继承它可以开发自定义的格式化类；Template类提供简单的字符串替换功能，主要用途是上下文的国际化（internationalization (i18n)）。

字符串常量包括——感觉用处不是很大：

string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase
string.digits
string.hexdigits
string.octdigits
string.punctuation
string.printable
string.whitespace

下面是测试，可是，发生了错误，和上面讲的一条规则冲突了——没有连接起来：

>>> string.ascii_letters
\'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'
>>> string.digits
\'0123456789\'
>>> string.ascii_letters string.digits
SyntaxError: invalid syntax
>>> \'2323\' \'sdsds\'
\'2323sdsds\'
>>> 
>>> type(string.digits)
<class \'str\'>
>>> type(string.ascii_letters)
<class \'str\'>

学习笔记：

学习了一遍str、string，发现string几乎很难用到，字符串类型的大部分功能都在str类型中，除了Template类的使用，当然，这个也可以使用str本身的格式化功能实现，当然，Template会更便捷——语法相对来说较为简单。

关于Formatter类，string模块官文说它和str.format()函数进行格式化转换时使用的是相同的语法，但是，开发者可以继承Formatter类实现自己特有的格式化字符串功能——绝大部分开发者用不到吧？两者的语法都是和Formatted string literals相关，但又有不同之处——请查看官文。

字符串格式化简介

经过前面的学习，发现Python字符串有4种格式化的语法：

1.printf-style String Formatting

format % values

2.the newer formatted string literals

A formatted string literal or f-string is a string literal that is prefixed with \'f\' or \'F\'.

解释：在字符串字面量前面添加 f or F，即可使用当前命名空间中的元素来格式化字符串了，不需要像其它语法一样把格式化字符串和变量放在一起，，的确有些高级呢！

使用示例：

>>> name = "Fred"
>>> f"He said his name is {name!r}."
"He said his name is \'Fred\'."
>>> f"He said his name is {repr(name)}."  # repr() is equivalent to !r
"He said his name is \'Fred\'."
>>> width = 10
>>> precision = 4
>>> value = decimal.Decimal("12.34567")
>>> f"result: {value:{width}.{precision}}"  # nested fields
\'result:      12.35\'
>>> today = datetime(year=2017, month=1, day=27)
>>> f"{today:%B %d, %Y}"  # using date format specifier
\'January 27, 2017\'
>>> number = 1024
>>> f"{number:#0x}"  # using integer format specifier
\'0x400\'

3.the str.format() interface / string.Formatter

str.format(*args, **kwargs)

str.format_map(mapping) 和 str.format(**mapping) 功能相同（相似，原文：Similar to str.format(**mapping), except that mapping is used directly and not copied to a dict.）

示例：

>>> "The sum of 1 + 2 is {0}".format(1+2)
\'The sum of 1 + 2 is 3\'

>>> \'{0}, {1}, {2}\'.format(\'a\', \'b\', \'c\')
\'a, b, c\'
>>> \'{}, {}, {}\'.format(\'a\', \'b\', \'c\')  # 3.1+ only
\'a, b, c\'
>>> \'{2}, {1}, {0}\'.format(\'a\', \'b\', \'c\')
\'c, b, a\'
>>> \'{2}, {1}, {0}\'.format(*\'abc\')      # unpacking argument sequence
\'c, b, a\'
>>> \'{0}{1}{0}\'.format(\'abra\', \'cad\')   # arguments\' indices can be repeated
\'abracadabra\'

>>> \'Coordinates: {latitude}, {longitude}\'.format(latitude=\'37.24N\', longitude=\'-115.81W\')
\'Coordinates: 37.24N, -115.81W\'
>>> coord = {\'latitude\': \'37.24N\', \'longitude\': \'-115.81W\'}
>>> \'Coordinates: {latitude}, {longitude}\'.format(**coord)
\'Coordinates: 37.24N, -115.81W\'

str.format示例

更多示例请查看string模块下的Format Examples，有不少高级或更复杂的用法，适合进阶使用。

4.template strings（PEP 292）

支持使用美元符号$来做替换，$identifier、${identifier}两种替换方式，两个美元符号（$$）为转义，代表一个美元符号$。的确挺简单的。

class string.Template(template)

-substitute(mapping, **kwds)

-safe_substitute(mapping, **kwds)

开发者可以继承Template类，实现自定义的模板类。

官文使用示例：

>>> from string import Template
>>> s = Template(\'$who likes $what\')
>>> s.substitute(who=\'tim\', what=\'kung pao\')
\'tim likes kung pao\'
>>> d = dict(who=\'tim\')
>>> Template(\'Give $who $100\').substitute(d)
Traceback (most recent call last):
...
ValueError: Invalid placeholder in string: line 1, col 11
>>> Template(\'$who likes $what\').substitute(d)
Traceback (most recent call last):
...
KeyError: \'what\'
>>> Template(\'$who likes $what\').safe_substitute(d)
\'tim likes $what\'

string.Template示例

学习笔记：

语法1类似于C语言的priintf函数的格式化字符串方法；

语法2请查看参考链接2，孤还没有细读；

语法3在str.format()函数和string模块的Formatter类中使用，和语法2有关联——基于语法2？；

语法4是string模块提供的一种简单的字符串替换功能。

都知道怎么使用了，基本的使用，但是，更有难度的是理解它们的语法，下面补充语法2、语法3的描述（官文，具体解释也请查看官文），这两个是最难的：

语法2：

f_string          ::=  (literal_char | "{{" | "}}" | replacement_field)*
replacement_field ::=  "{" f_expression ["!" conversion] [":" format_spec] "}"
f_expression      ::=  (conditional_expression | "*" or_expr)
                         ("," conditional_expression | "," "*" or_expr)* [","]
                       | yield_expression
conversion        ::=  "s" | "r" | "a"
format_spec       ::=  (literal_char | NULL | replacement_field)*
literal_char      ::=  <any code point except "{", "}" or NULL>

语法3：

format_spec     ::=  [[fill]align][sign][#][0][width][grouping_option][.precision][type]
fill            ::=  <any character>
align           ::=  "<" | ">" | "=" | "^"
sign            ::=  "+" | "-" | " "
width           ::=  digit+
grouping_option ::=  "_" | ","
precision       ::=  digit+
type            ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "X" | "%"

疑问：

这些人怎么想到使用上面的方式来表示语法呢？在计算机科学中，上面的结构叫做什么？好像自己在其它的文档中也看到过，只是，不理解，欢迎读者赐教！和编译原理有关系吗？能写出上面语法的人一定很聪明吧，或者，在计算机科学上有很高的造诣！当然，很可能是站在某些计算机科学先驱的肩膀上，比如C语言的创造者们，当然，还可以继续追溯。

本文就这样吧，几乎涵盖了Python的str类型、string模块的各个知识点，暂且交差。其中，str中的字符串函数还需要重难点练习突破，格式化字符串还需要更多场景来练习突破（官文示例好好研究下）。

更进一步

在学习过程中发现，str.split函数在分隔汉语句子时失败了，需要用re模块的分隔函数，此问题以及中文的相关问题（中文分词？中文词云？自然语言识别？）还需dig：

>>> cnstr = \'姑娘还晒出了自己的辞职信，引发众多网友关注。姑娘说，她在这家公司上班6年，一个月3.5k左右。在辞职信中，她列出了7条离职原因。没想好做什么，但是不能继续这样下去了’\'
>>> cnstr.split(\'了的\')
[\'姑娘还晒出了自己的辞职信，引发众多网友关注。姑娘说，她在这家公司上班6年，一个月3.5k左右。在辞职信中，她列出了7条离职原因。没想好做什么，但是不能继续这样下去了’\']
>>> len(cnstr.split(\'了的\')) # 分隔失败，返回列表长度为1
1

>>> import re
>>> re.split(\'[了的]\', cnstr) # \'[了的]\' 是正则表达式
[\'姑娘还晒出\', \'自己\', \'辞职信，引发众多网友关注。姑娘说，她在这家公司上班6年，一个月3.5k左右。在辞职信中，她列出\', \'7条离职原因。没想好做什么，但是不能继续这样下去\', \'’\']

参考链接

1.Python3 字符串 from RUNOOB.COM

2.Formatted string literals

3.Python中文本分割的具体方式

str.join(iterable)