🦋 Regex Subrules

2026-03-28

One of the biggest improvements Raku's regex engine has over traditional regex is the ability to compose patterns from named, reusable components. These are called subrules, and they turn messy regex into readable, maintainable code.

Built-in Character Classes

Raku ships with a rich set of predefined subrules that you can use inside any regex:

say "abc123" ~~ / <;alpha>;+ /;    # abc -- alphabetic characters
say "abc123" ~~ / <;digit>;+ /;    # 123 -- digits
say "abc123" ~~ / <;alnum>;+ /;    # abc123 -- alphanumeric
say "  hi  " ~~ / <;ws>; /;        # whitespace
say "abc_123" ~~ / <;ident>; /;    # abc_123 -- identifier
say "Hello!" ~~ / <;upper>; /;     # H
say "Hello!" ~~ / <;lower>;+ /;    # ello
say "café"   ~~ / <;print>;+ /;    # café -- printable characters

These are more readable and more correct than hand-written character classes like [a-zA-Z].

Defining Your Own Subrules

Use my regex, my token, or my rule to define reusable patterns:

my token number { '-'? \d+ ['.' \d+]? }
my token word   { <;alpha>; <;alnum>;* }
my token email  { <;[\w . -]>+ '@' <;[\w . -]>+ }

say "Price: 42.50" ~~ / <;number>; /;     # 42.50
say "Hello World"  ~~ / <;word>; /;        # Hello
say "user@host.com" ~~ / <;email>; /;      # user@host.com

Once defined, these can be called by name inside angle brackets < >.

Subrules as Building Blocks

The power comes from combining subrules:

my token year   { \d ** 4 }
my token month  { \d ** 2 }
my token day    { \d ** 2 }
my token date   { <;year>; '-' <;month>; '-' <;day>; }
my token time   { \d ** 2 ':' \d ** 2 ':' \d ** 2 }
my token datetime { <;date>; 'T' <;time>; }

my $stamp = "2026-03-28T14:30:00";
if $stamp ~~ / <;datetime>; / {
    say $<datetime>;<date>;<year>;;   # 2026
    say $<datetime>;<date>;<month>;;  # 03
    say $<datetime>;<time>;;         # 14:30:00
}

Each subrule call creates a named capture, so you get structured data for free.

Subrules in Grammars

Grammars are essentially collections of subrules. Every token, rule, and regex in a grammar is a subrule:

grammar URL {
    token TOP      { <;scheme>; '://' <;authority>; <;path>;? ['?' <;query>;]? }
    token scheme    { <;alpha>;+ }
    token authority { <;host>; [':' <;port>;]? }
    token host      { <;[\w . -]>+ }
    token port      { \d+ }
    token path      { '/' <;[\w . / -]>* }
    token query     { <;[\w = &; . -]>+ }
}

my $m = URL.parse("https://example.com:8080/api/data?format=json");

say $m<;scheme>;;             # https
say $m<;authority>;<host>;;    # example.com
say $m<;authority>;<port>;;    # 8080
say $m<;path>;;               # /api/data
say $m<;query>;;              # format=json

Calling Subrules with Arguments

Subrules can be parameterized:

my regex bracketed($open, $close) {
    $open <;( <;-[$close]>* )> $close
}

say "Hello (world) there" ~~ / <;bracketed('(', ')')> /;  # world
say "Hello [world] there" ~~ / <;bracketed('[', ']')> /;   # world

Note: Parameterized regex from variables has some limitations. Grammars provide a more robust way to handle this.

Lookahead and Lookbehind Subrules

Raku supports zero-width assertions:

# Lookahead: match only if followed by pattern
say "foobar" ~~ / foo <;?before bar>; /;    # foo

# Negative lookahead
say "foobar" ~~ / foo <;!before baz>; /;    # foo

# Lookbehind: match only if preceded by pattern
say "foobar" ~~ / <;?after foo>; bar /;     # bar

The <.subrule> Non-capturing Call

Prefix a subrule call with . to use it without capturing:

my $log = "2026-03-28 ERROR disk full";

# <.ws> matches whitespace but does not capture it
if $log ~~ / (\S+) <;.ws>; (\S+) <;.ws>; (.*) / {
    say $0;  # 2026-03-28
    say $1;  # ERROR
    say $2;  # disk full
}

This keeps your match object clean, containing only the data you care about.

Alternation with Proto Tokens

In grammars, you can use proto token to create extensible subrules:

grammar Literal {
    token TOP { <;value>; }

    proto token value {*}
    token value:sym<;integer>; { '-'? \d+ }
    token value:sym<;float>;   { '-'? \d+ '.' \d+ }
    token value:sym<;string>;  { '"' <;( <;-["]>* )> '"' }
    token value:sym<;bool>;    { 'true' | 'false' }
}

say Literal.parse('42');        # Matches value:sym<integer>
say Literal.parse('3.14');      # Matches value:sym<float>
say Literal.parse('"hello"');   # Matches value:sym<string>
say Literal.parse('true');      # Matches value:sym<bool>

Each sym variant is a separate subrule that the proto dispatches to.

Composing with Role Grammars

Grammars can compose rules from roles, just like classes:

role NumberRules {
    token integer { '-'? \d+ }
    token decimal { '-'? \d+ '.' \d+ }
}

role StringRules {
    token single-quoted { "'" <;( <;-[']>* )> "'" }
    token double-quoted { '"' <;( <;-["]>* )> '"' }
}

grammar MyLang does NumberRules does StringRules {
    token TOP { <;integer>; | <;decimal>; | <;single-quoted>; | <;double-quoted>; }
}

say MyLang.parse("42");
say MyLang.parse("3.14");
say MyLang.parse("'hello'");

Practical Example: Log Line Parser

my token ip-addr {
    [\d ** 1..3] ** 4 % '.'
}

my token timestamp {
    \d ** 4 '-' \d ** 2 '-' \d ** 2
    ' '
    \d ** 2 ':' \d ** 2 ':' \d ** 2
}

my token log-level {
    'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'FATAL'
}

my token log-line {
    '[' <;timestamp>; ']'
    \s+
    <;log-level>;
    \s+
    <;ip-addr>;?
    \s*
    $<message>;=[\N+]
}

my $line = "[2026-03-28 14:30:00] ERROR 10.0.0.1 Connection refused";
if $line ~~ / <;log-line>; / {
    say "Time:  {$<log-line><timestamp>}";
    say "Level: {$<log-line><log-level>}";
    say "IP:    {$<log-line><ip-addr>}";
    say "Msg:   {$<log-line><message>}";
}

Subrules transform regexes from write-only strings into modular, self-documenting patterns. Use them anywhere patterns start getting complex.