BBS水木清华站∶精华区

发信人: starw (化缘道人), 信区: Linux
标  题: Python Regular Expression HOWTO 4.3
发信站: BBS 水木清华站 (Tue Nov 21 23:45:46 2000)

嘿嘿，就一部分....看了半天才弄明白

4.3 Non-capturing, and Named Groups

Elaborate REs may use many groups, both to capture substrings of interest,
and to group and structure the RE itself. In complex REs, it becomes
difficult to keep track of the group numbers. There are two features which
help with this problem. Both of them use a common syntax for regular
expression extensions, so we'll look at that first.

Perl 5 added several additional features to standard regular expressions,
and the Python re module supports most of them. It would have been difficult
to choose new single-keystroke metacharacters or new special sequences
beginning with "\" to represent the new features, without making Perl's
regular expressions confusingly different from standard REs. If you chose
"&" as a new metacharacter, for example, old expressions would be assuming
that "&" was a regular character and wouldn't have escaped it by writing \&
or [&]. The solution chosen was to use (?...) as the extension syntax. "?"
immediately after a parenthesis was a syntax error, because the "?" would
have nothing to repeat, so this doesn't introduce any compatibility problems.
The characters immediately after the "?" indicate what extension is being
used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo)
is something else (a non-capturing group containing the subexpression foo).

Python adds an extension syntax to Perl's extension syntax. If the first
character after the question mark is a "P", you know that it's a extension
that's specific to Python. Currently there are two such extensions:
(?P<name>...) defines a named group, and (?P=name) is a backreference to a
named group. If future versions of Perl 5 add similar features using a
different syntax, the re module will be changed to support the new syntax,
while preserving the Python-specific syntax for compatibility's sake.

Now that we've looked at the general extension syntax, we can return to the
features that simplify working with groups in complex REs. Since groups are
numbered from left to right, and a complex expression may use many groups,
it can become difficult to keep track of the correct numbering, and modifying
such a complex RE is annoying. Insert a new group near the beginning, and you
change the numbers of everything that follows it.

First, sometimes you'll want to use a group to collect a part of a regular
expression, but aren't interested in retrieving the group's contents. You can
make this fact explicit by using a non-capturing group: (?:...), where you
can put any other regular expression inside the parentheses.

＞>> m = re.match("([abc])+'', "abc")
＞>> m.groups()
('c',)
＞>> m = re.match("(?:[abc])+", "abc")
＞>> m.groups()
()

Except for the fact that you can't retrieve the contents of what the group
matched, a non-capturing group behaves exactly the same as a capturing group;
you can put anything inside it, repeat it with a repetition metacharacter
such as "*", and nest it within other groups (capturing or non-capturing).
(?:...) is particularly useful when modifying an existing group, since you
can add new groups without changing how all the other groups are numbered.
It should be mentioned that there's no performance difference in searching
between capturing and non-capturing groups; neither form is any faster than
the other.

The second, and more significant, feature, is named groups; instead of
referring to them by numbers, groups can be referenced by a name.

The syntax for a named group is one of the Python-specific extensions:
(?P<name>...). name is, obviously, the name of the group. Except for
associating a name with a group, named groups also behave identically to
capturing groups. The MatchObject methods that deal with capturing groups
all accept either integers, to refer to groups by number, or a string
containing the group name. Named groups are still given numbers, so you
can retrieve information about a group in two ways:

＞>> p = re.compile(r'(?P<word>\b\w+\b)')
＞>> m = p.search( '(((( Lots of punctuation )))' )
＞>> m.group('word')
'Lots'
＞>> m.group(1)
'Lots'

Named groups are handy because they let you use easily-remembered names,
instead of having to remember numbers. Here's an example RE from the imaplib
module:

InternalDate = re.compile(r'INTERNALDATE "'
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
        r'(?P<year>[0-9][0-9][0-9][0-9])'
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
        r'"')

It's obviously much easier to retrieve m.group('zonem'), instead of having
to remember to retrieve group 9. Since the syntax for backreferences refers
to the number of the group, in an expression like (...)\1, there's naturally
a variant that uses the group name instead of the number. This is also a
Python extension: (?P=name) indicates that the contents of the group called
name should again be found at the current point. The regular expression for
finding doubled words, (\b\w+)\s+\1 can also be written as
(?P<word>\b\w+)\s+(?P=word):

＞>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
＞>> p.search('Paris in the the spring').group()
'the the'


--

        铜铁投洪冶，蝼蚁上粉墙。
        阴阳无二义，天地我中央。

※ 来源:·BBS 水木清华站 smth.org·[FROM: 202.117.27.35]