CSV Parser with Grammars
CSV looks simple until you encounter quoted fields, embedded commas, escaped quotes, and newlines inside values. A proper CSV parser needs to handle all of these. Raku grammars make this surprisingly approachable.The CSV Spec (Simplified)
A CSV file consists of:- Rows separated by newlines
- Fields separated by commas
- Fields optionally enclosed in double quotes
- Quoted fields can contain commas, newlines, and escaped quotes (
"")
A Minimal CSV Grammar
Let us start with the basics:Thegrammar CSV { token TOP { <record>+ % \n \n? } token record { <field>+ % ',' } token field { <quoted-field> | <plain-field> } token plain-field { <-[,\n"]>* } token quoted-field { '"' <( <quoted-content> )> '"' } token quoted-content { [ <-["]> | '""' ]* } } my $csv = q:to/END/.trim; name,age,city Alice,30,Toronto Bob,25,"New York" END say CSV.parse($csv);
<( and )> markers in quoted-field capture only the content between quotes, excluding the quotes themselves.
Adding Actions for Data Extraction
The grammar parses the structure, but we want usable data:Output:class CSV-Actions { method TOP($/) { make $<record>.map(*.made).list; } method record($/) { make $<field>.map(*.made).list; } method field($/) { make $<quoted-field> ?? $<quoted-field>.made !! $<plain-field>.made; } method plain-field($/) { make $/.; } method quoted-field($/) { make $<quoted-content>.made; } method quoted-content($/) { make $/..subst('""', '"', :g); # Unescape doubled quotes } } my $data = q:to/END/.trim; name,age,city Alice,30,Toronto Bob,25,"New York" "Carol ""CJ"" Jones",35,"San Francisco, CA" END my $result = CSV.parse($data, actions => CSV-Actions.new); for $result.made -> @row { say @row.join(' | '); }
name | age | city Alice | 30 | Toronto Bob | 25 | New York Carol "CJ" Jones | 35 | San Francisco, CA
Handling Edge Cases
Empty Fields
Empty fields are handled bymy $tricky = q:to/END/.trim; a,,c ,b, ,, END say CSV.parse($tricky, actions => CSV-Actions.new).made; # ((a c) ( b ) ( ))
plain-field matching zero characters.
Quoted Fields with Newlines
For CSV files where quoted fields span multiple lines, we need to adjust our grammar:Actually, the grammar above already handles embedded newlines becausegrammar CSV-Multiline { token TOP { <record>+ % \n \n? } token record { <field>+ % ',' } token field { <quoted-field> | <plain-field> } token plain-field { <-[,\n"]>* } token quoted-field { '"' <( <quoted-content> )> '"' } token quoted-content { [ <-["]> | '""' ]* } }
<-["]> in quoted-content matches newlines too. The tricky part is that our TOP rule splits records on \n, which would break multi-line fields. Let us fix that:
grammar CSV-Full { token TOP { ^ <record>+ % \n $ } token record { <field>+ % ',' } token field { <quoted-field> | <plain-field> } token plain-field { <-[,\n"]>* } token quoted-field { '"' <( <quoted-inner> )> '"' } token quoted-inner { [ <-["]> | '""' ]* } }
A Complete CSV Toolkit
Let us wrap everything in a reusable module-style structure:grammar CSV-Grammar { token TOP { <record>+ % \n \n? } token record { <field>+ % ',' } token field { <quoted-field> | <plain-field> } token plain-field { <-[,\n"]>* } token quoted-field { '"' <( <quoted-inner> )> '"' } token quoted-inner { [ <-["]> | '""' ]* } } class CSV-To-Arrays { method TOP($/) { make $<record>.map(*.made).list } method record($/) { make $<field>.map(*.made).list } method field($/) { make $<quoted-field> ?? $<quoted-field>.made !! $<plain-field>.made; } method plain-field($/) { make $/. } method quoted-field($/) { make $<quoted-inner>.made } method quoted-inner($/) { make $/..subst('""', '"', :g) } } class CSV-To-Hashes { has @.headers; method TOP($/) { my @records = $<record>.map(*.made).list; @!headers = @records.shift.list; make @records.map(-> @row { my %hash; for @!headers.kv -> $i, $h { %hash{$h} = @row[$i] // ''; } %hash }).list; } method record($/) { make $<field>.map(*.made).list } method field($/) { make $<quoted-field> ?? $<quoted-field>.made !! $<plain-field>.made; } method plain-field($/) { make $/. } method quoted-field($/) { make $<quoted-inner>.made } method quoted-inner($/) { make $/..subst('""', '"', :g) } }
Using the Toolkit
Output:my $csv-data = q:to/END/.trim; name,age,city,bio Alice,30,Toronto,"Software developer" Bob,25,"New York","Loves ""coding"" and coffee" Carol,35,"San Francisco, CA",Artist END # As arrays my $arrays = CSV-Grammar.parse($csv-data, actions => CSV-To-Arrays.new); for $arrays.made -> @row { say @row.raku; } # As hashes (first row = headers) my $hashes = CSV-Grammar.parse($csv-data, actions => CSV-To-Hashes.new); for $hashes.made -> %row { say "{%row<name>} from {%row<city>}: {%row<bio>}"; }
Alice from Toronto: Software developer Bob from New York: Loves "coding" and coffee Carol from San Francisco, CA: Artist
Writing CSV
For completeness, here is a CSV writer:Output:sub to-csv(@rows --> ) { @rows.map(-> @fields { @fields.map(-> $f { if $f ~~ / <[,"\n]> / { '"' ~ $f.subst('"', '""', :g) ~ '"' } else { ~$f } }).join(',') }).join("\n") } my @data = ( [<name age city>], ["Alice", 30, "Toronto"], ["Bob", 25, "New York"], ['Carol "CJ"', 35, "San Francisco, CA"], ); say to-csv(@data);
name,age,city Alice,30,Toronto Bob,25,New York "Carol ""CJ""",35,"San Francisco, CA"
Performance Considerations
For small to medium CSV files (up to a few MB), this grammar-based approach works well. For very large files (hundreds of MB), you might want to use a line-by-line approach or theText::CSV module, which is optimized for throughput. But for correctness and readability, grammars are hard to beat.
This CSV parser demonstrates the real-world value of Raku grammars: they handle complex parsing rules that would be painful with regular expressions alone, and the action classes give you clean data transformation as a bonus.