🦋 Web Scraping with Raku

2026-04-16

Web scraping involves fetching web pages and extracting structured data from HTML. While Raku does not have a massive ecosystem of scraping libraries like Python, its regex and grammar capabilities make it surprisingly effective for this task.

Fetching Web Pages

The simplest way to fetch a page is with curl via run:

sub fetch(Str $url --> Str) {
    my $proc = run 'curl', '-sS', '-L', '--max-time', '30', $url, :out;
    $proc.out.slurp(:close)
}

my $html = fetch('https://example.com');
say $html.chars ~ " characters fetched";

The -L flag follows redirects, and -sS keeps it quiet while still showing errors.

Using HTTP::Tiny

For more control, install the HTTP::Tiny module:

use HTTP::Tiny;

my $http = HTTP::Tiny.new;
my %response = $http.get('https://example.com');

if %response<;success>; {
    say "Status: %response<status>";
    say "Content length: {%response<content>.chars}";
} else {
    say "Failed: %response<reason>";
}

Extracting Data with Regex

For simple extraction, Raku regex works well:

Extract All Links

my $html = fetch('https://example.com');

my @links = $html.comb(/ '<a' .+? 'href="' <;( <;-["]>+ )> '"' /);
.say for @links;

Extract All Image URLs

my @images = $html.comb(/ '<img' .+? 'src="' <;( <;-["]>+ )> '"' /);
.say for @images;

Extract Page Title

if $html ~~ / '<title>' (.*?) '</title>' / {
    say "Title: $0";
}

Extract Meta Description

if $html ~~ / '<meta' .+? 'name="description"' .+? 'content="' (<-["]>+) '"' / {
    say "Description: $0";
}

A Simple HTML Tag Stripper

sub strip-html(Str $html --> Str) {
    $html
        .subst(/ '<script' .+? '</script>' /, '', :g)
        .subst(/ '<style' .+? '</style>' /, '', :g)
        .subst(/ '<' .+? '>' /, '', :g)
        .subst(/ '&nbsp;' /, ' ', :g)
        .subst(/ '&amp;' /, '&', :g)
        .subst(/ '&lt;' /, '<', :g)
        .subst(/ '&gt;' /, '>', :g)
        .subst(/ \s+ /, ' ', :g)
        .trim
}

my $text = strip-html('<p>;Hello <;b>;World<;/b>; &;amp; friends!</p>;');
say $text;  # Hello World & friends!

Grammar-Based HTML Parser

For structured extraction, a grammar is more robust:

grammar HTMLTable {
    token TOP { .*? <;table>; .* }
    token table { '<table' <;-[>]>* '>' <;row>;+ '</table>' }
    token row { \s* '<tr' <;-[>]>* '>' <;cell>;+ '</tr>' \s* }
    token cell { \s* '<t' <;[dh]> <;-[>]>* '>' <;( <;-[<]>* )> '</t' <;[dh]> '>' \s* }
}

class HTMLTable-Actions {
    method TOP($/) { make $<table>;.made }
    method table($/) { make $<row>;.map(*.made).list }
    method row($/) { make $<cell>;.map(*.made).list }
    method cell($/) { make $/.Str.trim }
}

my $html = q:to/END/;
<;table>;
<;tr>;<th>;Name<;/th>;<th>;Age<;/th>;<th>;City<;/th>;</tr>;
<;tr>;<td>;Alice<;/td>;<td>;30<;/td>;<td>;Toronto<;/td>;</tr>;
<;tr>;<td>;Bob<;/td>;<td>;25<;/td>;<td>;New York<;/td>;</tr>;
<;/table>;
END

my $result = HTMLTable.parse($html, actions => HTMLTable-Actions.new);
if $result {
    for $result.made -> @row {
        say @row.join(" | ");
    }
}

Output:

Name | Age | City
Alice | 30 | Toronto
Bob | 25 | New York

Practical Example: Scrape Headlines

sub scrape-headlines(Str $url --> List) {
    my $html = fetch($url);
    my @headlines;

    for $html.comb(/ '<h' (<[1..3]>) <;-[>]>* '>' (.*?) '</h' $0 '>' /) -> $match {
        my $text = $match.subst(/<[<]> .+? <;[>]>/, '', :g).trim;
        @headlines.push($text) if $text.chars >; 0;
    }

    @headlines
}

# Usage:
# .say for scrape-headlines('https://example.com');

Parallel Scraping

Fetch multiple pages concurrently:

sub scrape-all(@urls --> List) {
    my @promises = @urls.map: -> $url {
        start {
            my $html = fetch($url);
            my $title = '';
            if $html ~~ / '<title>' (.*?) '</title>' / {
                $title = $0.Str;
            }
            { url => $url, title => $title, size => $html.chars }
        }
    };

    await @promises
}

my @urls = (
    'https://example.com',
    'https://raku.org',
);

for scrape-all(@urls) -> %info {
    say "{%info<url>}: {%info<title>} ({%info<size>} bytes)";
}

Handling Pagination

sub scrape-paginated(Str $base-url, Int :$max-pages = 10 --> List) {
    my @all-items;

    for 1..$max-pages -> $page {
        my $url = "{$base-url}?page={$page}";
        my $html = fetch($url);

        # Extract items (adjust pattern for actual site)
        my @items = $html.comb(/ '<div class="item">' (.*?) '</div>' /).map(*.trim);

        last unless @items.elems;  # No more pages
        @all-items.append(@items);
        sleep 1;  # Be polite: wait between requests
    }

    @all-items
}

Saving Results

Save as CSV

sub save-csv(Str $file, @headers, @rows) {
    my $out = $file.IO;
    $out.spurt(@headers.join(",") ~ "\n");
    for @rows -> @row {
        $out.spurt(
            @row.map(-> $f { $f.contains(",") ?? "\"$f\"" !! $f }).join(",") ~ "\n",
            :append
        );
    }
    say "Saved {+@rows} rows to $file";
}

Save as JSON

use JSON::Fast;

sub save-json(Str $file, @data) {
    $file.IO.spurt(to-json(@data, :pretty));
    say "Saved {+@data} items to $file";
}

Respecting Robots.txt

Always check robots.txt before scraping:

sub check-robots(Str $domain, Str $path --> Bool) {
    my $robots = fetch("$domain/robots.txt");
    my $disallowed = False;

    my $in-user-agent = False;
    for $robots.lines -> $line {
        if $line ~~ /^ 'User-agent:' \s* '*' / {
            $in-user-agent = True;
            next;
        }
        if $line ~~ /^ 'User-agent:' / {
            $in-user-agent = False;
            next;
        }
        if $in-user-agent &;& $line ~~ /^ 'Disallow:' \s* (.*) / {
            $disallowed = True if $path.starts-with($0.Str.trim);
        }
    }

    !$disallowed
}

Best Practices

Respect robots.txt and the site's terms of service
Add delays between requests (at least 1 second)
Set a User-Agent header identifying your scraper
Handle errors gracefully: sites go down, pages change, connections time out
Cache responses during development to avoid hammering the server
Use APIs when available: many sites offer structured data through APIs, which is always better than scraping HTML

# Adding a polite delay and User-Agent
sub polite-fetch(Str $url, Real :$delay = 1.0 --> Str) {
    sleep $delay;
    my $proc = run 'curl', '-sS', '-L',
        '-A', 'RakuScraper/1.0 (educational)',
        '--max-time', '30',
        $url, :out;
    $proc.out.slurp(:close)
}

Raku may not be the most popular choice for web scraping, but its regex and grammar capabilities give it unique advantages for parsing complex HTML structures. For quick extraction tasks, Raku one-liners with .comb are hard to beat.