一、re模块常用方法: 1. match() match(pattern, string, flags=0)
match()函数只检测字符串开头位置是否匹配,匹配成功才会返回结果,否则返回None
正则表达式
要匹配的字符串
标志位,用于控制正则表达式的匹配方式
1 2 3 4 5 import re obj = re.match('\d+' , '123uuasf' ) if obj: print (obj.group ())
2. search() search(pattern, string, flags=0)
search()函数会在整个字符串内查找模式匹配,只到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以通过调用group()方法得到匹配的字符串,如果字符串没有匹配,则返回None。
1 2 3 4 5 import re obj = re.search('\d+' , 'u123uu888asf' ) if obj: print (obj.group ())
3. group和groups 1 2 3 4 5 6 7 8 a = "123abc456" print re.search ("([0-9]*)([a-z]*)([0-9]*)" , a).group () print re.search ("([0-9]*)([a-z]*)([0-9]*)" , a).group (0)print re.search ("([0-9]*)([a-z]*)([0-9]*)" , a).group (1)print re.search ("([0-9]*)([a-z]*)([0-9]*)" , a).group (2) print re.search ("([0-9]*)([a-z]*)([0-9]*)" , a).groups()
4. findall() findall(pattern, string, flags=0)
上述两中方式均用于匹配单值,即:只能匹配字符串中的一个,如果想要匹配到字符串中所有符合条件的元素,则需要使用 findall。
1 2 3 4 5 import re obj = re.findall ('\d+' , 'fa123uu888asf' ) print (obj)# ['123' , '888' ]
5. sub() sub(pattern, repl, string, count=0, flags=0)
用于替换匹配的字符串
1 2 3 4 5 content = "123abc456" new _content = re.sub('\d+' , 'sb' , content)# new_content = re.sub('\d+', 'sb', content, 1) print(new _content ) sbabcsb
相比于str.replace功能更加强大
6. split() split(pattern, string, maxsplit=0, flags=0)
根据指定匹配进行分组
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 content = "'1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'" new_content = re.split('\*', content) # new_content = re.split('\*', content, 1 ) print(new_content) ["'1 - 2 " , ' ((60 -30 +1 ', '(9 -2 ', '5 /3 +7 /3 ', '99 /4 ', '2998 +10 ', '568 /14 ))-(-4 ', '3 )/(16 -3 ', "2) )'" ] content = "'1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'" new_content = re.split('[\+\-\*\/]+', content) # new_content = re.split('\*', content, 1 ) print(new_content) ["'1 " , ' 2 ', ' ((60 ', '30 ', '1 ', '(9 ', '2 ', '5 ', '3 ', '7 ', '3 ', '99 ', '4 ', '2998 ', '10 ', '568 ', '14 ))', '(', '4 ', '3 )', '(16 ', '3 ', "2) )'" ] inpp = '1 -2 *((60 -30 +(-40 -5 )*(9 -2 *5 /3 + 7 /3 *99 /4 *2998 +10 * 568 /14 )) - (-4 *3 )/ (16 -3 *2 ))' inpp = re.sub('\s*','',inpp) new_content = re.split('\(([\+\-\*\/]?\d+[\+\-\*\/]?\d+){1 }\)', inpp, 1 ) print(new_content) ['1 -2 *((60 -30 +', '-40 -5 ', '*(9 -2 *5 /3 +7 /3 *99 /4 *2998 +10 *568 /14 ))-(-4 *3 )/(16 -3 *2 ))']
相比于str.split更加强大
二、单字符匹配
字符
功能
.
匹配任意1个字符(除了\n),注意因为.表示任意一个字符,所以如果匹配‘.’则需要用转义字符.来表示
[ ]
匹配[ ]中列举的字符,如果[a-zA-Z0-9],[a-zA-Z]表示所有字母和数字,后者表示所有字母,注意中间没有空格符号。
\d
匹配数字,即0-9
\D
匹配非数字,即不是数字
\s
匹配空白,即 空格,tab键
\S
匹配非空白
\w
匹配单词字符,即a-z、A-Z、0-9、_
\W
匹配非单词字符
单字符匹配案例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 In [8 ]: ma = re.match(r'.' ,'b' ) In [9 ]: ma.gro ma.group ma.groupdict ma.groups In [9 ]: ma.group() Out[9 ]: 'b' In [10 ]: ma = re.match(r'.' ,'0' ) In [11 ]: ma.grou ma.group ma.groupdict ma.groups In [11 ]: ma.group() Out[11 ]: '0' In [12 ]: clear In [13 ]: ma = re.match(r'{.}' ,'{a}' ) In [14 ]: ma.group() Out[14 ]: '{a}' In [15 ]: ma = re.match(r'{.}' ,'{0}' ) In [16 ]: ma.grou ma.group ma.groupdict ma.groups In [16 ]: ma.group() Out[16 ]: '{0}' In [17 ]: ma = re.match(r'{..}' ,'{01}' ) In [18 ]: ma.group() Out[18 ]: '{01}' In [19 ]: ma = re.match(r'{[abc]}' ,'{a}' ) In [20 ]: ma.group() Out[20 ]: '{a}' In [21 ]: ma = re.match(r'{[a-z]}' ,'{d}' ) In [22 ]: ma.group() Out[22 ]: '{d}' In [23 ]: ma = re.match(r'{[a-zA-Z]}' ,'{A}' ) In [24 ]: ma.group() Out[24 ]: '{A}' In [25 ]: ma = re.match(r'{[a-zA-Z0-9]}' ,'{0}' ) In [26 ]: ma.group() Out[26 ]: '{0}' In [27 ]: ma = re.match(r'{[\w]}' ,'{ }' ) In [28 ]: ma In [29 ]: ma = re.match(r'{[\W]}' ,'{ }' ) In [30 ]: ma Out[30 ]: <_sre.SRE_Match object; span=(0 , 3 ), match='{ }' > In [31 ]: ma.group() Out[31 ]: '{ }' In [32 ]: ma = re.match(r'{[\W]}' ,'{9}' ) In [33 ]: ma.group() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-33 -7 c62fc675aee> in <module>() ----> 1 ma.group() AttributeError: 'NoneType' object has no attribute 'group' In [34 ]: ma In [35 ]: ma = re.match(r'[[\w]]' ,'[a]' ) In [36 ]: ma In [37 ]: ma = re.match(r'\[[\w]\]' ,'[a]' ) In [38 ]: ma.group() Out[38 ]: '[a]' In [39 ]: ma = re.match(r'\[[\w]\]' ,'[0]' ) In [40 ]: ma.group() Out[40 ]: '[0]'
三、表示数量
字符
功能
*
匹配前一个字符出现0次或者无限次,即可有可无
+
匹配前一个字符出现1次或者无限次,即至少有1次
?
匹配前一个字符出现1次或者0次,即要么有1次,要么没有
{m}
匹配前一个字符出现m次
{m,}
匹配前一个字符至少出现m次
{m,n}
匹配前一个字符出现从m到n次
多个字符匹配案例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 In [1 ]: import re In [2 ]: ma = re.match(r'[A-Z][a-z]' ,'Aa' ) In [3 ]: ma.grou ma.group ma.groupdict ma.groups In [3 ]: ma.group() Out[3 ]: 'Aa' In [4 ]: ma = re.match(r'[A-Z][a-z]' ,'A' ) In [6 ]: ma In [8 ]: In [8 ]: ma = re.match(r'[A-Z][a-z]*' ,'A' ) In [9 ]: ma Out[9 ]: <_sre.SRE_Match object; span=(0 , 1 ), match='A' > In [10 ]: ma.group() Out[10 ]: 'A' In [12 ]: ma = re.match(r'[A-Z][a-z]*' ,'Asdsdwqass' ) In [14 ]: ma. ma.end ma.group ma.lastgroup ma.re ma.start ma.endpos ma.groupdict ma.lastindex ma.regs ma.string ma.expand ma.groups ma.pos ma.span In [14 ]: ma.group() Out[14 ]: 'Asdsdwqass' In [15 ]: ma = re.match(r'[A-Z][a-z]*' ,'1Asdsdwqass' ) In [16 ]: ma In [17 ]: ma = re.match(r'[A-Z][a-z]*' ,'Asd1sdwqass' ) In [18 ]: ma.group() Out[18 ]: 'Asd' In [19 ]: ma = re.match(r'[_a-zA-Z]+[_\w]*' ,'10' ) In [20 ]: ma In [21 ]: ma = re.match(r'[_a-zA-Z]+[_\w]*' ,'_ht11' ) In [22 ]: ma.group() Out[22 ]: '_ht11' In [23 ]: ma = re.match(r'[1-9]?[0-9]' ,'99' ) In [24 ]: ma.group() Out[24 ]: '99' In [25 ]: ma = re.match(r'[1-9]?[0-9]' ,'90' ) In [26 ]: ma.group() Out[26 ]: '90' In [27 ]: ma = re.match(r'[1-9]?[0-9]' ,'9' ) In [28 ]: ma.group() Out[28 ]: '9' In [29 ]: ma = re.match(r'[1-9]?[0-9]' ,'0' ) In [30 ]: ma.group() Out[30 ]: '0' In [31 ]: ma = re.match(r'[1-9]?[0-9]' ,'09' ) In [32 ]: ma.group() Out[32 ]: '0' In [33 ]: ma = re.match(r'[[a-zA-Z0-9]{6}' ,'abc123' ) In [34 ]: ma.group() Out[34 ]: 'abc123' In [35 ]: ma = re.match(r'[[a-zA-Z0-9]{6}' ,'abc1234' ) In [36 ]: ma.group() Out[36 ]: 'abc123' In [37 ]: ma = re.match(r'[[a-zA-Z0-9]{6}' ,'abc1__' ) In [38 ]: ma In [39 ]: ma = re.match(r'[[a-zA-Z0-9]{6}@163.com' ,'abc123@163.com' ) In [40 ]: ma.group() Out[40 ]: 'abc123@163.com' In [41 ]: ma = re.match(r'[[a-zA-Z0-9]{6,10}@163.com' ,'abc1234@163.com' ) In [42 ]: ma.grou ma.group ma.groupdict ma.groups In [42 ]: ma.group() Out[42 ]: 'abc1234@163.com' In [43 ]: ma = re.match(r'[0-9][a-z]*?' ,'1bc' ) In [44 ]: ma.group() Out[44 ]: '1' In [45 ]: ma = re.match(r'[0-9][a-z]*' ,'1bc' ) In [46 ]: ma.group() Out[46 ]: '1bc'
四、表示边界
字符
功能
^
匹配字符串开头
$
匹配字符串结尾
\b
匹配一个单词边界,也就是指单词和空格间的位置。例如, ‘er\b’ 可以匹配”never” 中的 ‘er’,但不能匹配 “verb” 中的 ‘er’。
\B
匹配非单词边界,’er\B’ 能匹配 “verb” 中的 ‘er’,但不能匹配 “never” 中的 ‘er’。
表示边界
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 In [48 ]: ma = re.match(r'[[a-zA-Z0-9]{6,10}@163.com' ,'abc1234@163.comabc' ) In [49 ]: ma.group() Out[49 ]: 'abc1234@163.com' In [50 ]: ma = re.match(r'[[a-zA-Z0-9]{6,10}@163.com$' ,'abc1234@163.comabc' ) In [51 ]: ma In [52 ]: ma = re.match(r'^[[a-zA-Z0-9]{6,10}@163.com$' ,'abc1234@163.com' ) In [53 ]: ma.group() Out[53 ]: 'abc1234@163.com' In [54 ]: ma = re.match(r'\Aimooc[\w]*' ,'imoocpython' ) In [55 ]: ma.group() Out[55 ]: 'imoocpython' In [56 ]: ma = re.match(r'\Aimooc[\w]*' ,'iimooc' ) In [57 ]: ma.group() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-57 -7 c62fc675aee> in <module>() ----> 1 ma.group() AttributeError: 'NoneType' object has no attribute 'group'
1 2 3 4 5 6 7 8 result = re.match(r'1[35678]\d{9}$' ,'15735177116' ) result <_sre.SRE_Match object; span=(0 , 11 ), match='15735177116' > result.group() '15735177116'
五、分组匹配
字符
功能
|
匹配左右任意一个表达式
(ab)
将括号中字符作为一个分组
\num
引用分组num匹配到的字符串
(?P<name>)
分组起别名
(?P=name)
引用别名为name分组匹配到的字符串
分组匹配
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 In [59 ]: ma = re.match(r'abc|d' ,'abc' ) In [60 ]: ma.group() Out[60 ]: 'abc' In [61 ]: ma = re.match(r'abc|d' ,'d' ) In [62 ]: ma.group() Out[62 ]: 'd' In [63 ]: ma = re.match(r'[1-9]?\d$' ,'9' ) In [64 ]: ma.group() Out[64 ]: '9' In [65 ]: ma = re.match(r'[1-9]?\d$' ,'99' ) In [66 ]: ma.group() Out[66 ]: '99' In [67 ]: ma = re.match(r'[1-9]?\d$' ,'09' ) In [68 ]: ma In [69 ]: ma = re.match(r'[1-9]?\d$' ,'100' ) In [70 ]: ma.group() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-70 -7 c62fc675aee> in <module>() ----> 1 ma.group() AttributeError: 'NoneType' object has no attribute 'group' In [71 ]: ma = re.match(r'[1-9]?\d$|100' ,'100' ) In [72 ]: ma.group() Out[72 ]: '100' In [73 ]: ma = re.match(r'[1-9]?\d$|100' ,'99' ) In [74 ]: ma.group() Out[74 ]: '99' In [75 ]: ma = re.match(r'[\w]{4,6}@163.com' ,'imooc@163.com' ) In [76 ]: ma.group() Out[76 ]: 'imooc@163.com' In [77 ]: ma = re.match(r'[\w]{4,6}@(163,123).com' ,'imooc@163.com' ) In [78 ]: ma = re.match(r'[\w]{4,6}@(163,123).com' ,'imooc@123.com' ) In [79 ]: ma.group() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-79 -7 c62fc675aee> in <module>() ----> 1 ma.group() AttributeError: 'NoneType' object has no attribute 'group' In [80 ]: ma = re.match(r'[\w]{4,6}@(163|123).com' ,'imooc@123.com' ) In [81 ]: ma.group() Out[81 ]: 'imooc@123.com' In [82 ]: ma = re.match(r'<[\w]+>' ,'<book>' ) In [83 ]: ma.group() Out[83 ]: '<book>' In [84 ]: ma = re.match(r'<([\w]+>)' ,'<book>' ) In [85 ]: ma.group() Out[85 ]: '<book>' In [86 ]: ma = re.match(r'<([\w]+>)\1' ,'<book>' ) In [87 ]: ma.groups() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-87 -f4e4ca66607d> in <module>() ----> 1 ma.groups() AttributeError: 'NoneType' object has no attribute 'groups' In [88 ]: ma = re.match(r'<([\w]+>)\1' ,'<book>book>' ) In [89 ]: ma.groups() Out[89 ]: ('book>' ,) In [90 ]: ma.group() Out[90 ]: '<book>book>' In [91 ]: ma = re.match(r'<([\w]+>\1' ,'<book>book>' ) In [3 ]: ma = re.match(r'<([\w]+>)[\w]+</\1' ,'<book>python</book>' ) In [4 ]: ma.group() Out[4 ]: '<book>python</book>' In [5 ]: ma = re.match(r'<([\w]+>)[\w]+</\1' ,'<book>python</book1>' ) In [6 ]: ma In [9 ]: ma = re.match(r'<(?P<mark>[\w]+>)[\w]+</(?P=mark)' ,'<book>python</book>' ) In [10 ]: ma.group() Out[10 ]: '<book>python</book>'
1 2 3 4 5 6 7 8 9 10 # 匹配邮箱 p = '(\w+)@(163 |126 |gmail|qq)\.(com|cn|net)$' r = re.match (p,'zhang @qq .com') r <_sre.SRE_Match object ; span=(0 , 12 ), match ='zhang @qq .com'> r.group() '# zhang@qq .com'
六、python贪婪和非贪婪 Python里数量词默认是贪婪的(在少数语言里也可能是默认非贪婪),总是尝试匹配尽可能多的字符;非贪婪则相反,总是尝试匹配尽可能少的字符。
在”*”,”?”,”+”,”{m,n}”后面加上?,使贪婪变成非贪婪。
1 2 3 4 s = 'this is a number 234-235-22-432' r = re.match(r'.+(\d+-\d+-\d+-\d+)' ,s) r.group(1 ) Out[32 ]: '4-235-22-432'
咦?怎么和我们想的不一样啊?这就是因为Python默认的贪婪算法,解决方法:在*,+后面+?
1 2 3 4 5 6 7 8 9 r = re.match(r'(.+?)(\d+-\d+-\d+-\d+)' ,s) r.groups() Out[33 ]: ('this is a number ' , '234-235-22-432' ) r = re.match(r'(.+?)(\d+-\d+-\d+-\d+)' ,s) r.group(1 ) Out[34 ]: 'this is a number ' r = re.match(r'(.+?)(\d+-\d+-\d+-\d+)' ,s) r.group(2 ) Out[35 ]: '234-235-22-432'