The first part of building suru was creating a parser, which was the part with the most false starts. The syntax I had in mind was relatively poorly defined at the time. Unfortunately, this lead to me a parser only to find out that it wouldn’t inherently work.
Nom
I literally searched up parsers on lib.rs
, and the first result I found was
nom. Nom is a combinator parsing library intended
for parsing binary formats, but also was serviceable for parsing text. It boasted
very fast speeds and seemed very popular, so I gave it a shot.
Unfortunately, nom is a non-backtracking parser. This means that nom does not support going back and checking something if something failed. As an exmaple, here’s a very simple task and recipe to be parsed. The task is written with a colon while the recipe is written with an arrow.
a.out: a.o
%.out < %.o
Until the colon or arrow is parsed, it is ambiguous whether what is being parsed is a task or recipe. Nom does not support backtracking, which meant that I couldn’t use nom for backtracking.
Logos
I then tried logos, which also boasted fast speeds and made extensive use of proc macros in order to create state machines. Unfortunately, logos also forbids backtracking, so logos was also unviable as a solution.
Pest
That left pest, which I had avoided because it used a DSL. I wanted to avoid learning another DSL, but unfortunately none of the other parsers supported my use case. Learning pest wasn’t too difficult, although the loose typing was rather annoying.
Pitfalls
The documentation of pest is unfortunately rather out of date. This isn’t too
big of a deal, but I spent a lot of time trying to find the isspace
function
which didn’t exist.
The WHITESPACE
implicit rule would also insert itself into my tokenization rule,
and they didn’t support making a rule silent and atomic at once. Because of that,
I had to explicitly specify a whitespace rule everywhere, which was somewhat
annoying.
Flaws
My major issue with pest that the types it returns are not particularly
structured. See, rules in pest match characters, and is returned as a Pair
type. The pair type contains a span of the matched characters as well as an enum
specifying which rule it matched. Unfortunately, all rules are specified under
the same enum, meaning that the resulting Rule could be anything. This means all
pattern matching has to deal with extraneous cases even if they can’t occur in
practice.
There’s a crate that allegedly deals with this, but unfortunately it’s unmaintained. There is an issue discussing this, but currently it’s unresolved.