Web Scraping with Raku
2026-04-16
Web scraping involves fetching web pages and extracting structured data from HTML. While Raku does not have a massive ecosystem of scraping libraries like Python, its regex and grammar capabilities make it surprisingly effective for this task.
Fetching Web Pages
The simplest way to fetch a page is with curl via run:
sub fetch(Str $url --> Str) {
my $proc = run 'curl', '-sS', '-L', '--max-time', '30', $url, :out;
$proc.out.slurp(:close)
}
my $html = fetch('https://example.com');
say $html.chars ~ " characters fetched";
The -L flag follows redirects, and -sS keeps it quiet while still showing errors.
Using HTTP::Tiny
For more control, install the HTTP::Tiny module:
use HTTP::Tiny;
my $http = HTTP::Tiny.new;
my %response = $http.get('https://example.com');
if %response<success> {
say "Status: %response<status>";
say "Content length: {%response<content>.chars}";
} else {
say "Failed: %response<reason>";
}
Extracting Data with Regex
For simple extraction, Raku regex works well:
Extract All Links
my $html = fetch('https://example.com');
my @links = $html.comb(/ '<a' .+? 'href="' <( <-["]>+ )> '"' /);
.say for @links;
Extract All Image URLs
my @images = $html.comb(/ '<img' .+? 'src="' <( <-["]>+ )> '"' /);
.say for @images;
Extract Page Title
if $html ~~ / '<title>' (.*?) '</title>' / {
say "Title: $0";
}
Extract Meta Description
if $html ~~ / '<meta' .+? 'name="description"' .+? 'content="' (<-["]>+) '"' / {
say "Description: $0";
}
A Simple HTML Tag Stripper
sub strip-html(Str $html --> Str) {
$html
.subst(/ '<script' .+? '</script>' /, '', :g)
.subst(/ '<style' .+? '</style>' /, '', :g)
.subst(/ '<' .+? '>' /, '', :g)
.subst(/ ' ' /, ' ', :g)
.subst(/ '&' /, '&', :g)
.subst(/ '<' /, '<', :g)
.subst(/ '>' /, '>', :g)
.subst(/ \s+ /, ' ', :g)
.trim
}
my $text = strip-html('<p>Hello <b>World</b> &amp; friends!</p>');
say $text;
Grammar-Based HTML Parser
For structured extraction, a grammar is more robust:
grammar HTMLTable {
token TOP { .*? <table> .* }
token table { '<table' <-[>]>* '>' <row>+ '</table>' }
token row { \s* '<tr' <-[>]>* '>' <cell>+ '</tr>' \s* }
token cell { \s* '<t' <[dh]> <-[>]>* '>' <( <-[<]>* )> '</t' <[dh]> '>' \s* }
}
class HTMLTable-Actions {
method TOP($/) { make $<table>.made }
method table($/) { make $<row>.map(*.made).list }
method row($/) { make $<cell>.map(*.made).list }
method cell($/) { make $/.Str.trim }
}
my $html = q:to/END/;
<table>
<tr><th>Name</th><th>Age</th><th>City</th></tr>
<tr><td>Alice</td><td>30</td><td>Toronto</td></tr>
<tr><td>Bob</td><td>25</td><td>New York</td></tr>
</table>
END
my $result = HTMLTable.parse($html, actions => HTMLTable-Actions.new);
if $result {
for $result.made -> @row {
say @row.join(" | ");
}
}
Output:
Name | Age | City
Alice | 30 | Toronto
Bob | 25 | New York
Practical Example: Scrape Headlines
sub scrape-headlines(Str $url --> List) {
my $html = fetch($url);
my @headlines;
for $html.comb(/ '<h' (<[1..3]>) <-[>]>* '>' (.*?) '</h' $0 '>' /) -> $match {
my $text = $match.subst(/<[<]> .+? <[>]>/, '', :g).trim;
@headlines.push($text) if $text.chars > 0;
}
@headlines
}
Parallel Scraping
Fetch multiple pages concurrently:
sub scrape-all(@urls --> List) {
my @promises = @urls.map: -> $url {
start {
my $html = fetch($url);
my $title = '';
if $html ~~ / '<title>' (.*?) '</title>' / {
$title = $0.Str;
}
{ url => $url, title => $title, size => $html.chars }
}
};
await @promises
}
my @urls = (
'https://example.com',
'https://raku.org',
);
for scrape-all(@urls) -> %info {
say "{%info<url>}: {%info<title>} ({%info<size>} bytes)";
}
Handling Pagination
sub scrape-paginated(Str $base-url, Int :$max-pages = 10 --> List) {
my @all-items;
for 1..$max-pages -> $page {
my $url = "{$base-url}?page={$page}";
my $html = fetch($url);
my @items = $html.comb(/ '<div class="item">' (.*?) '</div>' /).map(*.trim);
last unless @items.elems;
@all-items.append(@items);
sleep 1;
}
@all-items
}
Saving Results
Save as CSV
sub save-csv(Str $file, @headers, @rows) {
my $out = $file.IO;
$out.spurt(@headers.join(",") ~ "\n");
for @rows -> @row {
$out.spurt(
@row.map(-> $f { $f.contains(",") ?? "\"$f\"" !! $f }).join(",") ~ "\n",
:append
);
}
say "Saved {+@rows} rows to $file";
}
Save as JSON
use JSON::Fast;
sub save-json(Str $file, @data) {
$file.IO.spurt(to-json(@data, :pretty));
say "Saved {+@data} items to $file";
}
Respecting Robots.txt
Always check robots.txt before scraping:
sub check-robots(Str $domain, Str $path --> Bool) {
my $robots = fetch("$domain/robots.txt");
my $disallowed = False;
my $in-user-agent = False;
for $robots.lines -> $line {
if $line ~~ /^ 'User-agent:' \s* '*' / {
$in-user-agent = True;
next;
}
if $line ~~ /^ 'User-agent:' / {
$in-user-agent = False;
next;
}
if $in-user-agent && $line ~~ /^ 'Disallow:' \s* (.*) / {
$disallowed = True if $path.starts-with($0.Str.trim);
}
}
!$disallowed
}
Best Practices
- Respect robots.txt and the site's terms of service
- Add delays between requests (at least 1 second)
- Set a User-Agent header identifying your scraper
- Handle errors gracefully: sites go down, pages change, connections time out
- Cache responses during development to avoid hammering the server
- Use APIs when available: many sites offer structured data through APIs, which is always better than scraping HTML
sub polite-fetch(Str $url, Real :$delay = 1.0 --> Str) {
sleep $delay;
my $proc = run 'curl', '-sS', '-L',
'-A', 'RakuScraper/1.0 (educational)',
'--max-time', '30',
$url, :out;
$proc.out.slurp(:close)
}
Raku may not be the most popular choice for web scraping, but its regex and grammar capabilities give it unique advantages for parsing complex HTML structures. For quick extraction tasks, Raku one-liners with .comb are hard to beat.