tongsiying

阅读|运动|自律

0%

第34篇:Python正则表达式

一、re模块常用方法:

1. match()

match(pattern, string, flags=0)

match()函数只检测字符串开头位置是否匹配,匹配成功才会返回结果,否则返回None

  • 正则表达式
  • 要匹配的字符串
  • 标志位,用于控制正则表达式的匹配方式
1
2
3
4
5
import re

obj = re.match('\d+', '123uuasf')
if obj:
print(obj.group())

search(pattern, string, flags=0)

search()函数会在整个字符串内查找模式匹配,只到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以通过调用group()方法得到匹配的字符串,如果字符串没有匹配,则返回None。

1
2
3
4
5
import re

obj = re.search('\d+', 'u123uu888asf')
if obj:
print(obj.group())

3. group和groups

1
2
3
4
5
6
7
8
a = "123abc456"
print re.search("([0-9]*)([a-z]*)([0-9]*)", a).group()

print re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(0)
print re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(1)
print re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(2)

print re.search("([0-9]*)([a-z]*)([0-9]*)", a).groups()

4. findall()

findall(pattern, string, flags=0)

上述两中方式均用于匹配单值,即:只能匹配字符串中的一个,如果想要匹配到字符串中所有符合条件的元素,则需要使用 findall。

1
2
3
4
5
import re

obj = re.findall('\d+', 'fa123uu888asf')
print(obj)
# ['123', '888']

5. sub()

sub(pattern, repl, string, count=0, flags=0)

用于替换匹配的字符串

1
2
3
4
5
content = "123abc456"
new_content = re.sub('\d+', 'sb', content)
# new_content = re.sub('\d+', 'sb', content, 1)
print(new_content)
sbabcsb

相比于str.replace功能更加强大

6. split()

split(pattern, string, maxsplit=0, flags=0)

根据指定匹配进行分组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
content = "'1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'"
new_content = re.split('\*', content)
# new_content = re.split('\*', content, 1)
print(new_content)
["'1 - 2 ", ' ((60-30+1', '(9-2', '5/3+7/3', '99/4', '2998+10', '568/14))-(-4', '3)/(16-3', "2) )'"]

content = "'1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'"
new_content = re.split('[\+\-\*\/]+', content)
# new_content = re.split('\*', content, 1)
print(new_content)
["'1 ", ' 2 ', ' ((60', '30', '1', '(9', '2', '5', '3', '7', '3', '99', '4', '2998', '10', '568', '14))', '(', '4', '3)', '(16', '3', "2) )'"]

inpp = '1-2*((60-30 +(-40-5)*(9-2*5/3 + 7 /3*99/4*2998 +10 * 568/14 )) - (-4*3)/ (16-3*2))'
inpp = re.sub('\s*','',inpp)
new_content = re.split('\(([\+\-\*\/]?\d+[\+\-\*\/]?\d+){1}\)', inpp, 1)
print(new_content)
['1-2*((60-30+', '-40-5', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))']

相比于str.split更加强大

二、单字符匹配

字符 功能
. 匹配任意1个字符(除了\n),注意因为.表示任意一个字符,所以如果匹配‘.’则需要用转义字符.来表示
[ ] 匹配[ ]中列举的字符,如果[a-zA-Z0-9],[a-zA-Z]表示所有字母和数字,后者表示所有字母,注意中间没有空格符号。
\d 匹配数字,即0-9
\D 匹配非数字,即不是数字
\s 匹配空白,即 空格,tab键
\S 匹配非空白
\w 匹配单词字符,即a-z、A-Z、0-9、_
\W 匹配非单词字符

单字符匹配案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
In [8]: ma = re.match(r'.','b')

In [9]: ma.gro
ma.group ma.groupdict ma.groups

In [9]: ma.group()
Out[9]: 'b'

In [10]: ma = re.match(r'.','0')

In [11]: ma.grou
ma.group ma.groupdict ma.groups

In [11]: ma.group()
Out[11]: '0'

In [12]: clear


In [13]: ma = re.match(r'{.}','{a}')

In [14]: ma.group()
Out[14]: '{a}'

In [15]: ma = re.match(r'{.}','{0}')

In [16]: ma.grou
ma.group ma.groupdict ma.groups

In [16]: ma.group()
Out[16]: '{0}'

In [17]: ma = re.match(r'{..}','{01}')

In [18]: ma.group()
Out[18]: '{01}'

In [19]: ma = re.match(r'{[abc]}','{a}')

In [20]: ma.group()
Out[20]: '{a}'

In [21]: ma = re.match(r'{[a-z]}','{d}')

In [22]: ma.group()
Out[22]: '{d}'

In [23]: ma = re.match(r'{[a-zA-Z]}','{A}')

In [24]: ma.group()
Out[24]: '{A}'

In [25]: ma = re.match(r'{[a-zA-Z0-9]}','{0}')

In [26]: ma.group()
Out[26]: '{0}'

In [27]: ma = re.match(r'{[\w]}','{ }')

In [28]: ma

In [29]: ma = re.match(r'{[\W]}','{ }')

In [30]: ma
Out[30]: <_sre.SRE_Match object; span=(0, 3), match='{ }'>

In [31]: ma.group()
Out[31]: '{ }'

In [32]: ma = re.match(r'{[\W]}','{9}')

In [33]: ma.group()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-33-7c62fc675aee> in <module>()
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [34]: ma

In [35]: ma = re.match(r'[[\w]]','[a]')

In [36]: ma

In [37]: ma = re.match(r'\[[\w]\]','[a]')

In [38]: ma.group()
Out[38]: '[a]'

In [39]: ma = re.match(r'\[[\w]\]','[0]')

In [40]: ma.group()
Out[40]: '[0]'

三、表示数量

字符 功能
* 匹配前一个字符出现0次或者无限次,即可有可无
+ 匹配前一个字符出现1次或者无限次,即至少有1次
? 匹配前一个字符出现1次或者0次,即要么有1次,要么没有
{m} 匹配前一个字符出现m次
{m,} 匹配前一个字符至少出现m次
{m,n} 匹配前一个字符出现从m到n次

多个字符匹配案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
In [1]: import re

In [2]: ma = re.match(r'[A-Z][a-z]','Aa')

In [3]: ma.grou
ma.group ma.groupdict ma.groups

In [3]: ma.group()
Out[3]: 'Aa'

In [4]: ma = re.match(r'[A-Z][a-z]','A')

In [6]: ma

In [8]:

In [8]: ma = re.match(r'[A-Z][a-z]*','A')

In [9]: ma
Out[9]: <_sre.SRE_Match object; span=(0, 1), match='A'>

In [10]: ma.group()
Out[10]: 'A'
In [12]: ma = re.match(r'[A-Z][a-z]*','Asdsdwqass')

In [14]: ma.
ma.end ma.group ma.lastgroup ma.re ma.start
ma.endpos ma.groupdict ma.lastindex ma.regs ma.string
ma.expand ma.groups ma.pos ma.span

In [14]: ma.group()
Out[14]: 'Asdsdwqass'

In [15]: ma = re.match(r'[A-Z][a-z]*','1Asdsdwqass')

In [16]: ma

In [17]: ma = re.match(r'[A-Z][a-z]*','Asd1sdwqass')

In [18]: ma.group()
Out[18]: 'Asd'

In [19]: ma = re.match(r'[_a-zA-Z]+[_\w]*','10')

In [20]: ma

In [21]: ma = re.match(r'[_a-zA-Z]+[_\w]*','_ht11')

In [22]: ma.group()
Out[22]: '_ht11'

In [23]: ma = re.match(r'[1-9]?[0-9]','99')

In [24]: ma.group()
Out[24]: '99'

In [25]: ma = re.match(r'[1-9]?[0-9]','90')

In [26]: ma.group()
Out[26]: '90'

In [27]: ma = re.match(r'[1-9]?[0-9]','9')

In [28]: ma.group()
Out[28]: '9'

In [29]: ma = re.match(r'[1-9]?[0-9]','0')

In [30]: ma.group()
Out[30]: '0'

In [31]: ma = re.match(r'[1-9]?[0-9]','09')

In [32]: ma.group()
Out[32]: '0'

In [33]: ma = re.match(r'[[a-zA-Z0-9]{6}','abc123')

In [34]: ma.group()
Out[34]: 'abc123'

In [35]: ma = re.match(r'[[a-zA-Z0-9]{6}','abc1234')

In [36]: ma.group()
Out[36]: 'abc123'

In [37]: ma = re.match(r'[[a-zA-Z0-9]{6}','abc1__')

In [38]: ma

In [39]: ma = re.match(r'[[a-zA-Z0-9]{6}@163.com','abc123@163.com')

In [40]: ma.group()
Out[40]: 'abc123@163.com'

In [41]: ma = re.match(r'[[a-zA-Z0-9]{6,10}@163.com','abc1234@163.com')

In [42]: ma.grou
ma.group ma.groupdict ma.groups

In [42]: ma.group()
Out[42]: 'abc1234@163.com'

In [43]: ma = re.match(r'[0-9][a-z]*?','1bc')

In [44]: ma.group()
Out[44]: '1'

In [45]: ma = re.match(r'[0-9][a-z]*','1bc')

In [46]: ma.group()
Out[46]: '1bc'

四、表示边界

字符 功能
^ 匹配字符串开头
$ 匹配字符串结尾
\b 匹配一个单词边界,也就是指单词和空格间的位置。例如, ‘er\b’ 可以匹配”never” 中的 ‘er’,但不能匹配 “verb” 中的 ‘er’。
\B 匹配非单词边界,’er\B’ 能匹配 “verb” 中的 ‘er’,但不能匹配 “never” 中的 ‘er’。

表示边界

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
In [48]: ma = re.match(r'[[a-zA-Z0-9]{6,10}@163.com','abc1234@163.comabc') 

In [49]: ma.group()
Out[49]: 'abc1234@163.com'

In [50]: ma = re.match(r'[[a-zA-Z0-9]{6,10}@163.com$','abc1234@163.comabc')

In [51]: ma

In [52]: ma = re.match(r'^[[a-zA-Z0-9]{6,10}@163.com$','abc1234@163.com')
In [53]: ma.group()
Out[53]: 'abc1234@163.com'

In [54]: ma = re.match(r'\Aimooc[\w]*','imoocpython')

In [55]: ma.group()
Out[55]: 'imoocpython'

In [56]: ma = re.match(r'\Aimooc[\w]*','iimooc')

In [57]: ma.group()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-57-7c62fc675aee> in <module>()
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'
1
2
3
4
5
6
7
8
# 手机号匹配
result = re.match(r'1[35678]\d{9}$','15735177116')

result
<_sre.SRE_Match object; span=(0, 11), match='15735177116'>

result.group()
'15735177116'

五、分组匹配

字符 功能
| 匹配左右任意一个表达式
(ab) 将括号中字符作为一个分组
\num 引用分组num匹配到的字符串
(?P<name>) 分组起别名
(?P=name) 引用别名为name分组匹配到的字符串

分组匹配

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
In [59]: ma = re.match(r'abc|d','abc')

In [60]: ma.group()
Out[60]: 'abc'

In [61]: ma = re.match(r'abc|d','d')

In [62]: ma.group()
Out[62]: 'd'

In [63]: ma = re.match(r'[1-9]?\d$','9')

In [64]: ma.group()
Out[64]: '9'

In [65]: ma = re.match(r'[1-9]?\d$','99')

In [66]: ma.group()
Out[66]: '99'

In [67]: ma = re.match(r'[1-9]?\d$','09')

In [68]: ma

In [69]: ma = re.match(r'[1-9]?\d$','100')

In [70]: ma.group()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-70-7c62fc675aee> in <module>()
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [71]: ma = re.match(r'[1-9]?\d$|100','100')

In [72]: ma.group()
Out[72]: '100'

In [73]: ma = re.match(r'[1-9]?\d$|100','99')

In [74]: ma.group()
Out[74]: '99'

In [75]: ma = re.match(r'[\w]{4,6}@163.com','imooc@163.com')

In [76]: ma.group()
Out[76]: 'imooc@163.com'

In [77]: ma = re.match(r'[\w]{4,6}@(163,123).com','imooc@163.com')

In [78]: ma = re.match(r'[\w]{4,6}@(163,123).com','imooc@123.com')

In [79]: ma.group()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-79-7c62fc675aee> in <module>()
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [80]: ma = re.match(r'[\w]{4,6}@(163|123).com','imooc@123.com')

In [81]: ma.group()
Out[81]: 'imooc@123.com'

In [82]: ma = re.match(r'<[\w]+>','<book>')

In [83]: ma.group()
Out[83]: '<book>'

In [84]: ma = re.match(r'<([\w]+>)','<book>')

In [85]: ma.group()
Out[85]: '<book>'

In [86]: ma = re.match(r'<([\w]+>)\1','<book>')

In [87]: ma.groups()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-87-f4e4ca66607d> in <module>()
----> 1 ma.groups()

AttributeError: 'NoneType' object has no attribute 'groups'

In [88]: ma = re.match(r'<([\w]+>)\1','<book>book>')

In [89]: ma.groups()
Out[89]: ('book>',)

In [90]: ma.group()
Out[90]: '<book>book>'

In [91]: ma = re.match(r'<([\w]+>\1','<book>book>')



In [3]: ma = re.match(r'<([\w]+>)[\w]+</\1','<book>python</book>')

In [4]: ma.group()
Out[4]: '<book>python</book>'

In [5]: ma = re.match(r'<([\w]+>)[\w]+</\1','<book>python</book1>')

In [6]: ma


In [9]: ma = re.match(r'<(?P<mark>[\w]+>)[\w]+</(?P=mark)','<book>python</book>')

In [10]: ma.group()
Out[10]: '<book>python</book>'
1
2
3
4
5
6
7
8
9
10
#   匹配邮箱
p = '(\w+)@(163|126|gmail|qq)\.(com|cn|net)$'

r = re.match(p,'zhang@qq.com')

r
<_sre.SRE_Match object; span=(0, 12), match='zhang@qq.com'>

r.group()
'# zhang@qq.com'

六、python贪婪和非贪婪

Python里数量词默认是贪婪的(在少数语言里也可能是默认非贪婪),总是尝试匹配尽可能多的字符;非贪婪则相反,总是尝试匹配尽可能少的字符。

在”*”,”?”,”+”,”{m,n}”后面加上?,使贪婪变成非贪婪。

1
2
3
4
s = 'this is a number 234-235-22-432'
r = re.match(r'.+(\d+-\d+-\d+-\d+)',s)
r.group(1)
Out[32]: '4-235-22-432'

咦?怎么和我们想的不一样啊?这就是因为Python默认的贪婪算法,解决方法:在*,+后面+?

1
2
3
4
5
6
7
8
9
r = re.match(r'(.+?)(\d+-\d+-\d+-\d+)',s)
r.groups()
Out[33]: ('this is a number ', '234-235-22-432')
r = re.match(r'(.+?)(\d+-\d+-\d+-\d+)',s)
r.group(1)
Out[34]: 'this is a number '
r = re.match(r'(.+?)(\d+-\d+-\d+-\d+)',s)
r.group(2)
Out[35]: '234-235-22-432'
赞赏一下吧~