Tcl Regular Expression Cheat Sheet

Posted : admin On 1/29/2022

On -all -inline. From the documentation:-all: Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found.If this is specified with match variables, they will contain information for the last match only.-inline: Causes the command to return, as a list, the data that would otherwise be placed in match variables. 'regular expressions' 'regular expressions' cheat sheet regex cheat sheet (BString) regex examples (BString) regex tutorial (BString) 'regular expressions' tutorial 'regular expressions' examples Regex tutorial — A quick cheatsheet by examples - Factory. Regex tutorial — A quick cheatsheet by examples.

  1. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as.txt to find all text files in a file manager. The regex equivalent is «.txt».
  2. In Tcl 8.0 and before, you had to surround the regular expression with double quotes so the Tcl backslash processor could convert the t to a literal tab character. The square brackets had to be hidden from the backslash processor by adding backslashes before them, which made code harder to read and possibly more error-prone.

Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. A specific sequence of.

Tcl 8.1 now handles advanced regular expressions (REs). Previous regularexpression handling is almost unchanged except that clumsy handlingof escapes like n has been much improved, and a few escapesthat were previously legal (but useless) now won't work.

Note that a few advanced features aren't useful yet but are ready forfuture Tcl releases.That's because Tcl 8.1 (apart from the regular expression engine)implements only the Unicode locale (where all characters sort inUnicode order, there are no multi-character collating elements andno equivalence classes).

This document has an overview of the new regular expression features.For exact semantics and more details, see the newre_syntax(n)reference page.(The re_syntax(n) page was split from the 8.1regexp(n)reference page, which used to cover RE syntax for all Tcl commands.)This howto document covers:

1. Regular Expression Overview

2. Regular Expressions in Tcl 8.1

  • Backslash Escapes (xxx)
  • Bounds ({})
  • Character Classes ([: :])
  • Collating Elements ([. .])
  • Equivalence Classes ([= =])
  • Noncapturing Subpatterns ((?:re))
  • Lookahead Assertions ((?=re) and (?!re))

3. Summary: Regular Expression changes in Tcl 8.1

Part 1. Regular Expression Overview

This Part describes regular expressions (REs), explains REs fromTcl 8.0 and before, and describes the Tcl regexp andregsub commands.Part Two describes the new Tcl 8.1 REs.

What are Regular Expressions?

A regular expression, or RE, describes strings of characters(words or phrases or any arbitrary text). It's a pattern that matchescertain strings and doesn't match others. For example, you could writean RE to tell you if a string contains a URL (World Wide Web UniformResource Locator, such as http://somehost/somefile.html). Regularexpressions can be either broad and general or focused and precise.

A regular expression uses metacharacters(characters that assume special meaning for matching other characters)such as *, [], $ and ..For example, the RE [Hh]ello!* would match Hello andhello and Hello! (and hello!!!!!).The RE [Hh](ello i)!* would match Hello and Hiand Hi! (and so on).A backslash () disables the special meaning of the followingcharacter, so you could match the string [Hello] with the RE[Hello].

Regular Expressions in Tcl 8.0 and Before

Regular expressions in Tcl 8.0 and before had the followingmetacharacters:Tcl Regular Expression Cheat Sheet
.Match any single character (e.g., m.d matches mad, mod, m3d, etc.)
[]Bracket expression: Match any one of the enclosed characters (e.g., [a-z0-9_] matches a lowercase ASCII letter, a digit, or an underscore)
^Start-of-string anchor: Match only at the start of a string (e.g., ^hi matches hi and his but not this)
$End-of-string anchor: Match only at the end of a string (e.g., hi$ matches hi and chi but not this)
*Zero-or-more quantifier: makes the previous part of the RE match zero or more times (e.g., M.*D matches MD, MAD, MooD, M.D, etc.)
?Zero-or-one quantifier: makes the previous part of the RE match zero or one time (e.g., hi!? matches hi or hi!)
+One-or-more quantifier: makes the previous part of the RE match one or more times (e.g., hi!+ matches hi! or hi!! or hi!!! or ...)
Alternation (vertical bar): Match just one alternative (e.g., this that matches this or that)
()Subpattern: Group part of the RE. Many uses, such as:
  • Makes a quantifier apply to a group of text (e.g., ([0-9A-F][0-9A-F])+ matches groups of two hexadecimal digits: A9 or AB03 or 8A6E00, but not A or A2C).
  • Set limits for alternation (e.g., 'Eat (this that)!' matches 'Eat this!' or 'Eat that!').
  • Used for subpattern matching in the regexp and regsub commands.
Escape: Disables meaning of the following metacharacter (e.g., a.* matches a or a. or a.. or etc.). Note that also has special meaning to the Tcl interpreter (and to applications, such as C compilers).

The syntax above is supported in Tcl 8.1.Tcl 8.1 also supports advanced regular expressions (AREs).These powerful expressions are introduced in more detailin Part Two.Briefly, though, AREs support backreferences,lookahead, non-greedy matching, many escapes, features that areuseful for internationalization (handling collation elements,equivalence classes and character classes), and much more.

The Tcl 8.1 regular expression engine almost always interprets 8.0-styleREs correctly. In the few cases that it doesn't, and when the problemis too difficult to fix, the 8.1 engine has anoption to select 8.0 ('ERE') interpretation.

Overview of regexp and regsub

The Tcl commands regexp and regsub use regular expressions:
  • regexp compares a string to an RE.It returns a value of 1 ifthe RE matches part or all of the string or 0 if there's no match.Optionally, it stores the matched part of the string in a variable (andalso can store subparts of a string in multiple variables).For example, to compare the string in$line against the RE [Hh]ello!*, you would write:

    If part or all of the line variable matches the RE, regexpstores the matching part in the match variable and returns avalue of 1.

  • regsub substitutes part of a string that matches an RE.For instance, the following command edits the string from $in_lineto replace all space or tab characters with a single space character;the edited line is stored in the out_line variable:

    Please also read the following section about backslash processing.

Backslash Processing

If you've used Tcl, you probably recognize the t in theprevious example as a character-entry escape that standsfor a tab character.

We actually used the 8.1 syntax above; the example wouldn't haveworked under 8.0!In Tcl 8.0 and before, you had to surround the regular expressionwith double quotes so the Tcl backslash processor could convert thet to a literal tab character. The square brackets had tobe hidden from the backslash processor by adding backslashes beforethem, which made code harder to read and possibly more error-prone.Here's the previous example rewritten for Tcl 8.0 and before:

For more about the simplified 8.1 syntax,see the section Backslash Escapes.

Part 2. Regular Expressions in Tcl 8.1

Tcl 8.1 regular expressions are basically a superset of 8.0 REs.This howto document has an overview of the new features. Please see there_syntax(n) reference page for exact semantics and more details.

Non-Greedy Quantifiers

A quantifier specifies 'how many.' For example, the quantifier *in the RE z* matches zero or more zs. By default, regularexpression quantifiers are greedy:they match as much text as they can. Tcl 8.1 REs also have non-greedyquantifiers, which match the least text they can.To make a non-greedy quantifier, add a question mark (?) at the end.

Let's start by storing some HTML text in a variable, then using tworegexp commands to match it.The first RE is greedy, and the second is non-greedy:The first RE <EM>.*</EM> is 'greedy.'It matches from the first <EM>to the last </EM>.The second RE <EM>.*?</EM>, witha question mark (?) after the * quantifier, isnon-greedy: it matches as little text as possible after the first<EM>.Could you write a greedy RE that works like the non-greedy version?It isn't easy!A greedy RE like <EM>[^<]*</EM>would do it in this case -- but it wouldn't work if there were otherHTML tags (with a < character) between the pair of<EM> tags in the $x string.

Here are a new string and another pair of REs to match it:The greedy RE 3z* matches all the zs it can(three) under its 'zero or more' rule.The non-greedy RE 3z*? matches just 3 because it matchesthe fewest zs it can under its 'zero or more' rule.

To review, the greedy quantifiers from Tcl 8.0 are: *,+, and ?.So the non-greedy quantifiers (added in Tcl 8.1) are: *?,+?, and ??.Tcl 8.1 also has the new quantifiers {m}, {m,}, and{m,n}, as well as the non-greedy versions{m}?, {m,}?, and {m,n}?.The section on bounds explains -- and has moreexamples of non-greedy matching.

Backslash Escapes

A backslash () disables the metacharacter after it. For example,a* matches the character a followed by a literal asterisk(*) character. In Tcl 8.0 and before, it was legal to puta backslash before a non-metacharacter -- for instance,regexp {p} matched the character p. (Note thatregexp {n}matched the character n, which was a source of confusion. Toget a newline character into an RE before version 8.1, you had to writeregexp 'n'so Tcl processing inside double quotes would convert the Tcl Regular Expression Cheat Sheetnto a newline.)

The Tcl 8.1 regular expression engine interprets backslash escapesitself. So now regexp {n} matches a newline, notthe character n. REs are simpler to write in 8.1 because ofthis. (You can still write regexp 'n' -- and let Tcl conversionhappen inside the double quotes -- so most old code will still work.)

One of the most important changes in 8.1 is that a backslash inside abracket expression is treated as the start of an escape.In 8.0 and before, a backslash inside brackets was treated as a literalbackslash character.For example, in 8.0 and before, regexp {[an]} would matchthe characters a, , or n.But in 8.1, regexp {[an]} would match thecharacters a or newline (because n is thebackslash escape for 'newline').

Tcl 8.1 has also added many new backslash escapes. For instance,d matches a digit. Some of these are listed below, andthe re_syntax(n) reference page has the whole list.

In Tcl 8.1 regular expressions (but not in other parts of thelanguage), it's illegalto use a backslash before a non-metacharacter unless it makes a validescape. So regexp {p} is nowan error. If you have code that (for some bizarre reason) has regularexpressions with a backslash before a non-metacharacter, like regexp{p}, you'll need to fix it.

As explained above, the Tcl 8.1 regular expression engine now interpretsbackslash sequences like n to mean 'newline'. It also hasfour new kinds of escapes: character entry escapes, class shorthandescapes, constraint escapes, and back references. Here's an introduction.(The re_syntax(n) page has full details.)

  • A character entry escape is a convenient way to enter anon-printing or other difficult character. For instance, nrepresents a newline character. uwxyz (wherewxyz is hexadecimal) represents theUnicode character U+wxyz.
  • Class shorthand escapes are shorthand for commoncharacter classes. Forexample, d stands for [[:digit:]], which means 'anysingle digit.'
  • A constraint escape constrains an RE to match only at acertain place. For example, the constraint escape m matchesonly at the start of a word -- so the RE mhi will match thethird word in the string he said hi but won't match he saidthigh.
  • A back reference matches the same string that wasmatched by a previous parenthesized subexpression. (This works likesubexpressions in regsub, but it's used for matching insteadof extracting.) For example, (X.*Y)1 matches any doubledstring that starts with X and ends with Y, such asXYXY, XabcYXabcY, X--YX--Y, etc.

Finally, remember that (as in Tcl 8.0 and before) some applications,such as C compilers, interpret these backslash sequences themselves beforethe regular expression engine sees them. You may need to double (orquadruple, etc.) the number of backslashes for these applications.Still, in straight Tcl 8.1 code, writing backslash escapes is now bothsimpler and more powerful than in 8.0 and before.

Bounds

You've seen the quantifiers *, +

Tcl Regular Expression Cheat Sheet 2019

, and ?.They specify 'how many' (respectively, zero or more, one or more, andzero or one). Tcl 8.1 added new quantifiers that let you choose exactlyhow many matches: the bounds operators, {}.

These operatorscome in three greedy forms: {m}, {m,},and {m,n}. The corresponding non-greedy forms are {m}?,{m,}?, and {m,n}?.

  • The {m} quantifier matches exactly m occurrences.So does {m}?.For example, either #{70} or #{70}? match a string ofexactly 70 # characters.
  • The {m,} quantifier matches at least m occurrences.Here's a demo of the greedy and non-greedy versions:Notice that the first two number signs (##) in the string arenever matched because there aren't at least four of them.
  • The {m,n} quantifier matches at least m but nomore than n occurrences.

    For example,the RE http://([^/]+/?){1,3} would match Web URLs that have3 components (like http://xyz.fr/euro/billets.htm), or with 2components (like http://xyz.fr/euro/, or with just1 component (like http://xyz.fr).The RE matches a final slash (/) if there is one.As always, a greedy match will match aslong a string as possible: it would try for 3 matches.

    A non-greedy quantifier would try to match the least (1 match).But be careful: http://([^/]+/?){1,3}? won't match allthe way to a possible slash because it matches the fewest characterspossible!(With input http://xyz.fr/, that RE would match justhttp://x.)This brings up one of the many subtleties in these advanced regularexpressions: that the outer non-greedy quantifier overrides the innergreedy quantifiers and makes all quantifiers non-greedy!There's an explanation in re_syntax(n) reference page sectionnamed Matching.

Character Classes

A character class is a name for one or more characters.For example, punct stands for the 'punctuation' characters.A character class is always written as part of a bracketexpression, which is a list of characters enclosed in [].

For instance, the character class named digit stands for any ofthe digits 0-9 (zero through nine). The character class is writtenwith the class name inside a set of brackets and colons, like this:[[:digit:]]. The old familiar expression for digits is writtenas a range: [0-9]. When you compare the new character classto the old range version, you can see that the outer square brackets are thesame in both. So a character class is written[:classname:].

The table below describes the Tcl 8.1 character classes.
alphaA letter (includes many non-ASCII characters).
upperAn upper-case letter.
lowerA lower-case letter.
digitA decimal digit.
xdigitA hexadecimal digit.
alnumAn alphanumeric (letter or digit).
printAn alphanumeric. (Same as alnum.)
blankA space or tab character.
spaceA character producing white space in displayed text. (Includes en-space, hair space, many others.)
punctA punctuation character.
graphA character with a visible representation.
cntrlA control character.

You can use more than one character class in a bracket expression.You can also mix character classes with ranges and single characters.For instance, [[:digit:]a-cx-z] would match a digit (0-9),a, b, c, x, y, or z-- and [^[:digit:]a-cx-z] would matchany character except those. This syntax can take sometime to get familiar with! The key is to look for the characterclass (here, [:digit:]) inside the bracket expression.

The advantage of character classes (like [:alpha:]) overexplicit ranges in brackets (like [a-z]) is that characterclasses include characters that aren't easy to type on ASCII keyboards.For example, the Spanish language includes the character ñ.It doesn't fall into the range [a-z], but it is in the Tcl 8.1character class [:alpha:].In the same way, the Spanish punctuation character ¡ isn'tin a list of punctuation characters like [.!?,], but itis part of [:punct:].

Tcl 8.1 has a standard set of character classes that aredefined in the source code file generic/regc_locale.c.Tcl 8.1 has one locale defined: the Unicode locale.It may support other locales (and other character classes) in the future.

Collating Elements

A collating symbol lets you represent other characters unambiguously.A collating symbol is written surrounded by brackets and dots,like [.number-sign.]Collating symbolsmust be written in a bracket expression (inside []).So [[.number-sign.]] will match the character #, asyou can see here:Tcl 8.1 has a standard set of collating symbols that aredefined in the source code file generic/regc_locale.c.Note: Tcl 8.1 does not implement multi-character collatingelements like ch(which is the fourth character in the Spanish alphabeta, b, c, ch, d, e,f, g, h, i...)So the examples below are not supported in Tcl 8.1,but are here for completeness.(Future versions of Tcl may have multi-character collating elements.)

Suppose ch and c sort next to each otherin your dialect, and ch is treated as an atomic character.The example bracket expression below uses two collating symbols.It matches one or more of ch and c.But it doesn't match an h standing alone:Here's one tricky and surprising thing about collating symbols.A caret at the start of a bracket expression ([^...)means that, in a locale with multi-character collating elements,the symbol can match more than one character. For instance,the RE in the example below matches any characterother than c, followed by the character b. So theexpression matches all of chb:Again, the two previous examples are not supported in Tcl 8.1,but are here for completeness.

Equivalence Classes

An equivalence class is written as part of a bracketexpression, like

Tcl Regular Expression Cheat Sheet Pdf

[[=c=]].It's any collating element that has the same relative order inthe collating sequence as c.

Note: Tcl 8.1 only implements the Unicode locale.It doesn't define any equivalence classes.So, although the Tcl regular expression engine supports equivalenceclasses, the examples below are not supported in Tcl 8.1.(Future versions of Tcl may define equivalence classes.)

Let's imagine that both of the characters A and afall at the same place in the collating sequence;they belong to the same equivalence class.In that case, both of the bracket expressions [[=A=]b] and[[=a=]b] are equivalent to writing [Aab].As another example, if o and ô are membersof an equivalence class, then all of the bracket expressions[[=o=]], [[=ô=]], and [oô]match those same two characters.

Noncapturing Subpatterns

There are two reasons to put parentheses around all or part of anRE. One is to make a quantifier (like * or +) applyto the parenthesized part. For instance, the RE Oh,( no!)+would match Oh, no! as well as Oh, no! no!and so on.The other reason to use parentheses is that they capture thematched text. Captured text is used inback references,in 'matching' variables in the regexp command, as well as in theregsub command.

If you don't want parentheses to capture text, add ?:after the opening parenthesis.For instance, in the examplebelow, the subexpression (?:http ftp)matches either http or ftp but doesn't capture it.So the back reference 1 will hold the end of the URL (fromthe second set of parentheses):

Lookahead Assertions

There are times you'd like to be able to test for a pattern withoutincluding that text in the match. For instance, you might want tomatch the protocol in a URL (like http or ftp), butonly if that URL ends with .com.Or maybe you want to match the protocol only if the URL does not endwith .edu. In cases like those, you'd like to 'look ahead' andsee how the URL ends. A lookahead assertion is handy here.

A positive lookahead has the form (?=re).It matches at any place ahead where there's a substring like re.A negative lookahead has the form (?!re).It matches at any point where the regular expression re doesnot match.Let's see some examples:The regular expressions above may seem complicated, but they'rereally not bad! Find the lookahead expression in the firstregexp command above; it starts with(?= and ends at the corresponding parenthesis. The 'guts' of thislookahead expression is .*.com$, which stands for 'a stringthat ends with .com'. So the first regexp commandabove matches any string containing non-colon (:)characters, as long as the rest of the string ends with .com.The second regexp is similar but looks for a string endingwith .edu.Because regexp returns 0, you can see that this doesn't match.The third regexp looks for a string not ending with .edu.It matches because $x ends with .com.

Tcl 8.1 lets you document complex regular expressions by embeddingcomments.See the next section.

Switches

Tcl 8.1 added command switches to regexp and regsub

Tcl Regular Expression Match

.For a complete list, see the commands' reference pages.Let's look at two of the most important changes.

Complex REs can be difficult to document. The-expanded switch sets expanded syntax, whichlets you add comments within a regular expression. Comments start witha # character; whitespace is ignored.This is mostly for scripting -- but you can also use it on a commandline, as we'll do in the example below.Let's look the same RE twice:first in the standard compact syntax, and second in expanded syntax:In expanded syntax, you can use space and tab characters to indent andmake your code clear.To enter actual space and tab characters into your RE, use the escapess andt, respectively.

The other important new switch we'll cover here is -line.It enables newline-sensitive matching.By default (without -line), Tcl regular expressions have alwaystreated newlines as an ordinary character.For example, if a string contains several lines (separated bynewline characters), the end-of-string anchor $ wouldn'tmatch at any of the embedded newlines.To write code that matched line-by-line, you had to read input lines oneby one and do separate matches against each line.

With the -line switch, the metacharacters^,$,., and[] treat a newline as the end of a 'line.'So, for example, the regular expression ^San Jose matchesthe second line of input below:The -line switch actually enables two other switches.You can set part of the features from -line by choosingone of these switches instead:The -lineanchor switch makes^ and$ match at the beginning and end of a line.The -linestop switch makes. and[] stop matching at a newline character.

Regular

Options, Directors

This section introduces two more features from Tcl 8.1.Details are in the re_syntax(n) reference page.

An 8.1 RE can start with embedded options. These look like(?xyz), where xyz are one or more option letters.For instance, (?i)ouch matches OUCH because iis the 'case-insensitive matching' option.Other options include (?e), which marksthe rest of the regular expression as an 8.0-style RE --to let you avoid confusion with the new 8.1 syntax.

An RE can also start with three asterisks, which is a director.For example, ***= is the director that says the rest of theregular expression is literal text. So the RE***=(?i)ouch matches exactly (?i)ouch; the(?i) isn't treated as an option.

Part 3. Summary: Regular Expression Changes in Tcl 8.1

Tcl 8.1 added advanced regular expression syntax.The new re_syntax(n) reference page has details.

Regular Expression Tcl

This table below summarizes the new syntax:
{m}Matches m instances of the previous pattern item
{m}?Matches m instances of the previous pattern item. Non-greedy.
{m,}Matches m or more instances of the previous pattern item.
{m,}?Matches m or more instances of the previous pattern item. Non-greedy.
{m,n}Matches m through n instances of the previous pattern item.
{m,n}?Matches m through n instances of the previous pattern item. Non-greedy.
*?Matches zero or more of the previous pattern item. Non-greedy.
+?Matches one or more of the previous pattern item. Non-greedy.
??Matches zero or one of the previous pattern item. Non-greedy.
(?:re)Groups a subpattern, re, but does not capture the result.
(?=re)Positive lookahead. Matches the point where re begins.
(?!re)Negative lookahead. Matches any point where re does not begin.
cOne of many backslash escapes.
[. .]Delimits a collating element within a bracketed expression.
[= =]Delimits an equivalence class within a bracketed expression.
[: :]Delimits a character class within a bracketed expression.
(?abc)Embedded options a, b, and c
***Director
Some of the new switches for regexp and

Tcl Regular Expression Cheat Sheet

regsub are:

Tcl Regular Expression Cheat Sheet Download

-expandedEnable expanded syntax (for comments)
-lineEnable newline-sensitive matching
-linestopMake [] and . stop at newlines.
-lineanchorMake ^ and $ match the start and end of a line.
Add link to comments for /doc/howto/regexp81.tml

Tcl Regular Expression Cheat Sheet Online

This is the main Tcl Developer Xchange site,www.tcl-lang.org .About this Site [email protected]
Home About Tcl/Tk Software Core Development Community Documentation