Functions for tokenizing a text, based on a regular expression which
matches tokens or gaps.
|
|
|
_display(tokens)
A helper function for demo that displays a list of tokens. |
source code
|
|
|
_remove_group_identifiers(parsed_re)
Modifies the given parsed regular expression, replacing all groupings
(as indicated by parenthesis in the regular expression string) with
non-grouping variants (indicated with '(?:...)'). |
source code
|
|
|
|
|
demo()
A demonstration that shows the output of several different tokenizers
on the same string. |
source code
|
|
|
|
|
regexp(text,
pattern,
gaps=True,
advanced=True)
Tokenize the text according to the regular expression pattern. |
source code
|
|
|
shoebox(s)
Tokenize a Shoebox entry into its fields (separated by backslash
markers). |
source code
|
|
|
token_split(text,
pattern,
advanced=True)
Returns:
An iterator that generates tokens and the gaps between them |
source code
|
|
|
|
|
|
|
word(s)
Tokenize the text into sequences of word characters (a-zA-Z0-9). |
source code
|
|
|
wordpunct(s)
Tokenize the text into sequences of alphabetic and non-alphabetic
characters. |
source code
|
|