🔬Build A Parser By Rust (Part 1)

This document is copied from Feishu documents for search purposes. There may be content format compatibility issues. It is recommended to view the original Feishu document

Background#

Recently, I have been following mbrubeck to learn how to write a simple browser engine using Rust with Robinson. During this process, since it is necessary to parse formats like HTML and CSS, you need to write relevant parsers to accomplish this.

Writing a parser from scratch is a very tedious and error-prone task, as you not only need to consider the specific protocol rules that need to be parsed but also need to think about error handling, extensibility, and parsing performance. Therefore, the author also mentioned in the article that it is recommended to optimize using existing third-party libraries like pest.

Reflecting on my daily development work, I encounter scenarios that require constructing parsers relatively infrequently. Often, if there is a need to parse a specific format or protocol, such as JSON or CSV, for efficiency, I directly use third-party libraries designed for parsing that format or protocol. However, not all protocols or formats have pre-written parsers, especially for various network communication protocols, and it is also difficult to customize existing parsers. Therefore, taking this opportunity, it is just right to learn about parsers and the related practices of constructing parsers, which will be convenient for future similar scenarios.

Note:

This document will not focus on deeply exploring the principles of parsers and parser libraries, but rather on "an introduction to parsers and their practice."
This series of documents is divided into two parts. The first part mainly discusses understanding parsers and using third-party libraries, while the second part mainly discusses specific practices.
The source code appearing in this series of documents can be viewed at https://github.com/catwithtudou/parser_toy

Prerequisite Knowledge#

Below are some background knowledge related to parsers to help understand the following content.

Parser#

The parser mentioned here is actually a broader definition, usually referring to a component that converts information of a certain format into a certain data structure.

It is like decoding information of a certain format into an organized data structure, making it easier to understand and process the information.

For example 🌰: At this time, there is a text of an arithmetic expression "1 + 2", and the expectation is that the program can calculate the result.

To enable the program to recognize arithmetic expressions, a parser for arithmetic expressions can be used to convert it into a structure (left, op, right) for calculation.

In the field of computer science, parsers are essential in the process of handling data and can be applied in various data processing scenarios, such as the more common ones:

In the underlying compiler or interpreter, the parser's main role is to perform lexical and syntactic analysis of the source code, extracting the abstract syntax tree (AST).
For the JSON text format, which is commonly used in web applications, the corresponding parser can serialize the required data structure for processing.
Others include parsing network communication protocols, scripting languages, database languages, etc. through parsers.

PEG#

Before introducing PEG (Parsing Expression Grammar), we can use (hypothetically) more common regular expressions for easier understanding.

The connection between regular expressions and PEG is mainly that both can match and parse character text through specific syntax when processing character text, while the differences are:

In terms of syntax: The former uses a specific syntax to describe string patterns, usually for simple string matching and searching. The latter uses a more complex syntax to describe language structures, usually for handling complex language parsing and analysis needs.
In terms of application domains: The former is mainly used for simple text processing needs, such as finding specific patterns in text or validating input formats. The latter is mainly used for handling complex language structures, such as syntax analysis and interpreter construction for programming languages.

Through this introduction, I believe everyone has a simple understanding of PEG.

The reason for introducing PEG is that tools that can be implemented through PEG (called Parser Generators) can be used to create custom Parsers.

Next, let's formally introduce PEG, which is Parsing Expression Grammar:

Introduction to PEG

Parsing Expression Grammar, abbreviated as PEG:

It is a type of analytical formal grammar. Introduced by Bryan Ford in 2004, it is related to the family of top-down parsing languages introduced in the 1970s.
As a grammar that describes language structures, it can handle more complex language structures compared to regular expressions, due to its recursive characteristics that can describe infinitely nested structures.
It uses a simple and flexible way to define grammar rules, which can be used to parse input strings and generate syntax trees.
It has advantages in ease of use, correctness, and performance, and provides features such as error reporting and reusable rule templates, making it widely used in parsing and analyzing text.

Introduction to PEG Applications

The syntax of PEG is similar to programming languages, using operators and rules to describe language structures:

Operators include “|” (or), “&” (and), “?” (optional), etc., while rules are used to describe the specific structure of the language.
For example, below is a simple PEG rule that describes the syntax of an integer:
```
int := [0-9]+
```

Since it can be directly converted into efficient parser code, there are currently many parsers implemented using PEG, such as ANTLR, PEG.js, etc.

Parser Combinator#

With the previous understanding of Parsers, understanding Parser Combinators becomes relatively easy.

Definition and Concept of Parser Combinator

Simply put, Parser Combinators are components built by combining various parser components.

The idea of Parser Combinators aligns well with software engineering; it is a technique for constructing parsers based on function composition, allowing for the construction of complex parsers by combining small, reusable, and testable parser components. This approach makes the construction of parsers more flexible and extensible, greatly improving development efficiency and facilitating subsequent maintenance.

Parser Combinator and Parser Generator

Parser Combinators are actually parallel concepts to the previously mentioned Parser Generators. Here’s an example:

If we consider the parser we want to implement (like a JSON Parser) as a large building,
Using a Parser Generator would mean starting from scratch to build that building each time, and similar parts between buildings (like doors and windows) cannot be reused.
However, using Parser Combinators is like building with LEGO blocks, continuously constructing small, reusable, and testable components, and then using these components to build the building. When we need to construct a new building, we can reuse the components created earlier, which is very convenient. At the same time, when a parser encounters issues, it is easy to locate a specific component, making subsequent maintenance easier.

Parser Combinator and Parser Generator Based on PEG

In terms of expression, Parser Combinators are more flexible in their expressive power, as they can directly use the features of programming languages to combine and define parsers. In contrast, PEG-based Parser Generators describe parsers through specific syntax rules, and their expressive power is limited by the syntax rules, meaning you need to learn to use the Parser Generator's interface and also master the syntax rules of PEG.

In terms of performance, the performance comparison between Parser Combinators and Parser Generators depends on the specific implementation and usage scenarios. However, generally speaking, Parser Generators typically generate efficient parser code, which may perform better when handling large grammars and complex inputs. On the other hand, Parser Combinators usually incur some performance overhead because they dynamically combine parsers at runtime.

However, currently in Rust, the performance of nom, which is based on Parser Combinators, is somewhat higher than that of pest, which is based on PEG.

Rust Parser Library#

Next, we will introduce classic third-party libraries used in Rust to implement parsers, namely Pest based on PEG and Nom based on Parser Combinators.

pest#

Introduction#

pest-parser/pest
The Elegant Parser
Rust5164279

Pest is a general-purpose parser written in Rust, focusing on accessibility, correctness, and performance. It uses PEG as input, providing enhanced expressive power needed to parse complex languages while defining and generating parsers in a concise and elegant way.

It also features automatic error reporting, automatically generating parser trait implementations through derive attributes, and the ability to define multiple parsers in a single file.

Usage Example#

Add pest dependency in cargo.toml

[dependencies]
pest = "2.6"
pest_derive = "2.6"

Create a new src/grammar.pest file to write parsing expression syntax

This syntax represents the parsing rules for the field, meaning each character is an ASCII digit and may include a decimal point and a negative sign, with + indicating that this pattern can appear multiple times.

field = { (ASCII_DIGIT | "." | "-")+ }

Create a new src/parser.rs file to define the parser

The following code defines a struct Parser, which automatically implements a parser that satisfies the syntax file patterns each time it is compiled.

use pest_derive::Parser;

#[derive(Parser)] 
#[grammar = "grammer.pest"]
pub struct Parser;

// Each time you compile this file, pest will automatically generate such items
#[cfg(test)]
mod test {
    use std::fs;

    use pest::Parser;
    
    use crate::{Parser, Rule};

    #[test]
    pub fn test_parse() {
        let successful_parse = Parser::parse(Rule::field, "-273.15");
        println!("{:?}", successful_parse);

        let unsuccessful_parse = Parser::parse(Rule::field, "China");
        println!("{:?}", unsuccessful_parse);
    }
}

Specific Usage#

Official Documentation

Parser API#

Pest provides various methods to access successfully parsed results. Below, we will introduce its methods according to the following syntax example:

number = { ASCII_DIGIT+ }                // one or more decimal digits
enclosed = { "(.." ~ number ~ "..)" }    // for instance, "(..1024..)"
sum = { number ~ " + " ~ number }        // for instance, "1024 + 12"

Tokens

Pest uses tokens to represent success, generating two tokens each time a rule matches, representing the start and end of the match, such as:

"3130 abc"
 |   ^ end(number)
 ^ start(number)

Currently, rustrover supports a pest format plugin that can validate rules and view tokens, etc.

Nested Rules

If a named rule contains another named rule, tokens will be generated for both, as shown below:

"(..6472..)"
 |  |   |  ^ end(enclosed)
 |  |   ^ end(number)
 |  ^ start(number)
 ^ start(enclosed)

In some scenarios, tokens may not appear at different character positions:

"1773 + 1362"
 |   |  |   ^ end(sum)
 |   |  |   ^ end(number)
 |   |  ^ start(number)
 |   ^ end(number)
 ^ start(number)
 ^ start(sum)

Interface

Tokens will be exposed in the form of a Token enum, which has Start and End variants, and you can call tokens on the parsing result to get an iterator:

let parse_result = DemoParser::parse(Rule::sum, "1773 + 1362").unwrap();
let tokens = parse_result.tokens();

for token in tokens {
    println!("{:?}", token);
}

Pairs

If you consider matching pairs of tokens to explore the parsing tree, Pest provides a Pair type to represent a pair of matched tokens, with the main usage as follows:

Determine which rule produced the Pair
Use Pair as the original &str
Check the internal naming rules that generated the Pair

let pair = DemoParser::parse(Rule::enclosed, "(..6472..) and more text")
    .unwrap().next().unwrap();

assert_eq!(pair.as_rule(), Rule::enclosed);
assert_eq!(pair.as_str(), "(..6472..)");

let inner_rules = pair.into_inner();
println!("{}", inner_rules); // --> [number(3, 7)]

Pairs can have any number of internal rules, and you can return Pairs as an iterator through Pair::into_inner():

let pairs = DemoParser::parse(Rule::sum, "1773 + 1362")
    .unwrap().next().unwrap()
    .into_inner();

let numbers = pairs
    .clone()
    .map(|pair| str::parse(pair.as_str()).unwrap())
    .collect::<Vec<i32>>();
assert_eq!(vec![1773, 1362], numbers);

for (found, expected) in pairs.zip(vec!["1773", "1362"]) {
    assert_eq!(Rule::number, found.as_rule());
    assert_eq!(expected, found.as_str());
}

Parse Method

The derived Parser provides a parse method that returns Result<Pairs, Error>. To access the underlying parsing tree, you need to match or unwrap the result:

// check whether parse was successful
match Parser::parse(Rule::enclosed, "(..6472..)") {
    Ok(mut pairs) => {
        let enclosed = pairs.next().unwrap();
        // ...
    }
    Err(error) => {
        // ...
    }
}

Parsing Expression Syntax#

The basic logic of PEG syntax is actually very simple and direct, which can be summarized in three steps:

Try to match the rule
If successful, try the next step
If failed, try another rule

The main characteristics of its syntax are as follows:

Eagerness

When running repeated PEG expressions on an input string, it will greedily (as many times as possible) run the expression, resulting in the following:

If the match is successful, it will consume any content it matched and pass the remaining input to the next step of the parser.
If the match fails, it will not consume any characters, and if that failure propagates, it will ultimately lead to a parsing failure unless the failure is caught during propagation.

// Expression
ASCII_DIGIT+      // one or more characters from '0' to '9'

// Matching process
"42 boxes"
 ^ Running ASCII_DIGIT+

"42 boxes"
   ^ Successfully took one or more digits!

" boxes"
 ^ Remaining unparsed input.

Ordered Choice

The syntax has an ordered choice operator |, for example, one|two means trying the former one first, and if it fails, then trying the latter two.

If there is a requirement for order, you need to pay attention to the placement of rules in the expression, for example:

In the expression "a"|"ab", when matching the string "abc", hitting the previous rule "a" will not parse the subsequent "bc".

Therefore, it is common practice when writing a choice parser to place the longest or most specific choices first and the shortest or most general choices last.

Non-backtracking

In the parsing process, an expression either succeeds or fails.

If successful, it proceeds to the next step; if it fails, the expression fails, and the engine will not backtrack to try again, which is different from regular expressions that have backtracking.

For example, in the following case (where ~ indicates the next step after the preceding rule matches successfully):

word = {     // to recognize a word...
    ANY*     //   take any character, zero or more times...
    ~ ANY    //   followed by any character
}

"frumious"

When matching the string "frumious", ANY* will first consume the entire string, while the next step ANY will not match any content, causing it to fail parsing.

"frumious"
 ^ (word)

"frumious"
         ^ (ANY*) Success! Continue to ANY with remaining input "".
 
""
 ^ (ANY) Failure! Expected one character, but found end of string.

In this scenario, for systems with backtracking capabilities (like regular expressions), it would backtrack one step, "spitting out" a character and trying again.

Unambiguous

Each rule in PEG runs on the remaining part of the input string, consuming as much input as possible. Once a rule completes, the remaining input will be passed to other parts of the parser. For example, the expression ASCII_DIGIT+ indicates matching one or more digits, which will always match the largest possible contiguous sequence of digits. There is no unexpected backtracking that allows subsequent rules to steal some digits in a non-intuitive and non-local manner.

This sharply contrasts with other parsing tools, such as regular expressions and CFG, where the results of rules often depend on some distant code.

Parser Syntax & Built-in Rules#

Important Syntax

Pest's syntax is much less than that of regular expressions. Below is a brief display of the main syntax and meanings; for details, you can search for yourself:

Syntax	Meaning	Syntax	Meaning
`foo = { ... }`	regular rule	`baz = @{ ... }`	atomic
`bar = _{ ... }`	silent	`qux = ${ ... }`	compound-atomic
`#tag = ...`	tags	`plugh = !{ ... }`	non-atomic
`"abc"`	exact string	`^"abc"`	case insensitive
`'a'..'z'`	character range	`ANY`	any character
`foo ~ bar`	sequence	`baz	qux`
`foo*`	zero or more	`bar+`	one or more
`baz?`	optional	`qux{n}`	exactly n
`qux{m, n}`	between m and n (inclusive)
`&foo`	positive predicate
`PUSH(baz)`	match and push	`!bar`	negative predicate
`POP`	match and pop
`DROP`	pop without matching	`PEEK`	match without pop
`PEEK_ALL`	match entire stack

Built-in Rules

In addition to ANY, Pest also provides many built-in rules to make parsing text more convenient. Here are a few commonly used ones; for details, you can look it up yourself:

Built-in Rule	Equivalent	Built-in Rule	Equivalent
ASCII_DIGIT	`'0'..'9'`	ASCII_ALPHANUMERIC	any digit or letter `ASCII_DIGIT
UPPERCASE_LETTER	Uppercase	NEWLINE	any line feed format `"\n"
LOWERCASE_LETTER	Lowercase	SPACE_SEPARATOR	Space separator
MATH_SYMBOL	Math symbol	EMOJI	Emoji

nom#

Introduction#

rust-bakery/nom
Rust parser combinator framework
Rust10166838

Nom is a parser combinator library written in Rust, which has the following features:

Build safe parsers without compromising speed or memory consumption.
Rely on Rust's powerful type system and memory safety to generate both correct and efficient parsers.
Provide functions, macros, and traits to abstract most easily error-prone pipelines, while also easily combining and reusing parsers to build complex parsers.

Nom can support a wide range of application scenarios, such as the following common scenarios:

Binary format parsers: Nom's performance is as fast as parsers written in C, and it is not affected by buffer overflow vulnerabilities, with common processing patterns built-in.
Text format parsers: It can handle formats like CSV and more complex nested formats like JSON, not only managing data but also built-in with multiple useful tools.
Programming language parsers: Nom can serve as a prototype parser for languages, supporting custom error types and reporting, automatically handling whitespace, and constructing AST in place.
In addition to the above scenarios, it also supports streaming formats (like HTTP network processing) and parser combinators that are more in line with software engineering.

Usage Example#

Here, we will introduce the "Hex Color Parser" example provided in the nom repo README:

The specific format of hex colors is:

Starts with "#", followed by six characters, with every two characters representing the values of the red, green, and blue color channels.

For example, "#2F14DF" means "2F" represents the value of the red channel, "14" represents the value of the green channel, and "DF" represents the value of the blue channel.

Add nom dependency in cargo.toml

[dependencies]
nom = "7.1.3"

Create a new src/nom/hex_color.rs file to construct the parsing method hex_color

tag will match the starting character pattern, tag("#") returns a function whose return value is IResult<Input,Input,Error>
- Here, Input is the type of the function's input parameter, the first value is the input value after removing the matched pattern, the second is the matched content, and the last is the error value.
The take_while_m_n method provided by nom has the first two parameters as the minimum and maximum match counts, and the last parameter as the matching rule, returning a result similar to the above.
The map_res method provided by nom can convert the result obtained from the first parameter according to the pattern of the second parameter.
The tuple method provided by nom accepts a group of combinators, applies them in order to the input, and then returns the parsing result as a tuple.

use nom::{AsChar, IResult};
use nom::bytes::complete::tag;
use nom::bytes::complete::take_while_m_n;
use nom::combinator::map_res;
use nom::sequence::tuple;

#[derive(Debug, PartialEq)]
pub struct Color {
    pub red: u8,
    pub green: u8,
    pub blue: u8,
}

// Check if it is a hex digit
pub fn is_hex_digit(c: char) -> bool {
    c.is_hex_digit()
}

// Convert string to decimal result
pub fn to_num(input: &str) -> Result<u8, std::num::ParseIntError> {
    u8::from_str_radix(input, 16)
}

// Match input by two digits according to is_hex_digit rule and convert the result to decimal
pub fn hex_primary(input: &str) -> IResult<&str, u8> {
    map_res(
        take_while_m_n(2, 2, is_hex_digit),
        to_num,
    )(input)
}

// Hex color parser
pub fn hex_color(input: &str) -> IResult<&str, Color> {
    let (input, _) = tag("#")(input)?;
    let (input, (red, green, blue)) = tuple((hex_primary, hex_primary, hex_primary))(input)?;

    Ok((input, Color { red, green, blue }))
}

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn test_hex_color() {
        assert_eq!(hex_color("#2F14DF"), Ok(("", Color {
            red: 47,
            green: 20,
            blue: 223,
        })))
    }
}

Specific Usage#

Parser Result#

In the previous example, the return of the nom parsing method is IResult, which is one of the core structures of nom, representing the return result of nom parsing.

First, the results defined by the parsers constructed by nom are as follows:

Ok(...) indicates the content found after successful parsing, while Err(...) indicates that no corresponding content was found during parsing.
If parsing is successful, it will return a tuple, where the first will contain all content not matched by the parser, and the second will contain all content matched by the parser.
If parsing fails, it may return multiple errors.

                                   ┌─► Ok(
                                   │      what the parser didn't touch,
                                   │      what matched the regex
                                   │   )
             ┌─────────┐           │
 my input───►│my parser├──►either──┤
             └─────────┘           └─► Err(...)

Thus, to represent this model, nom defines the structure IResult<Input, Output, Error>:

It can be seen that Input and Output can actually be defined as different types, while Error can be any type that implements the ParseError trait.

Tag & Character Classes#

Tag Byte Set Tag

Nom refers to simple byte sets as tags. Since these are very common, it also has a built-in tag() function that returns a parser for the given string.

For example, to parse the string "abc", you can use tag("abc").

It should be noted that nom has multiple different definitions of tag, and unless otherwise specified, it is usually recommended to use the following definition to avoid unexpected errors:
pub use nom::bytes::complete::tag;

The signature of the tag function is as follows, showing that tag returns a function, and that function is a parser that takes &str and returns IResult:

At the same time, here, the function that creates the parser returns its parser function, which is a common pattern in nom.

pub fn tag<T, Input, Error: ParseError<Input>>(
    tag: T
) -> impl Fn(Input) -> IResult<Input, Input, Error> where
    Input: InputTake + Compare<T>,
    T: InputLength + Clone,

Here is an example of a function that implements the use of tag:

use nom::bytes::complete::tag;
use nom::IResult;

pub fn parse_input(input: &str) -> IResult<&str, &str> {
    tag("abc")(input)
}

#[cfg(test)]
mod test {
    use super::*;
    #[test]
    fn test_parse_input() {
        let (leftover_input, output) = parse_input("abcWorld!").unwrap();
        assert_eq!(leftover_input, "World!");
        assert_eq!(output, "abc");
        assert!(parse_input("defWorld").is_err());
    }
}

Character Classes

Considering that tag can only be used for characters at the beginning of sequences, nom provides pre-written parsers known as character classes, which allow accepting any one of a set of characters. Below are some commonly used built-in parsers:

Parser	Functionality	Parser	Functionality
alpha0/alpha1	Recognize zero or more lowercase and uppercase letter characters; the latter requires at least one character to be returned	multispace0/multispace1	Recognize zero or more whitespace, tab, carriage return, and newline characters; the latter requires at least one character to be returned
alphanumeric0/alphanumeric1	Recognize zero or more digit or letter characters; the latter requires at least one character to be returned	space0/space1	Recognize zero or more spaces and tabs; the latter requires at least one character to be returned
digit0/digit1	Recognize zero or more digit characters; the latter requires at least one character to be returned	newline	Recognize newline characters

Here’s a simple example of how to use them:

use nom::character::complete::alpha0;
use nom::IResult;

fn parse_alpha(input: &str) -> IResult<&str, &str> {
    alpha0(input)
}

#[test]
fn test_parse_alpha() {
    let (remaining, letters) = parse_alpha("abc123").unwrap();
    assert_eq!(remaining, "123");
    assert_eq!(letters, "abc");
}

Alternatives & Composition#

Alternatives

Nom provides the alt() combinator to satisfy the choice of multiple parsers, which will execute each parser in the tuple until it finds one that succeeds.

If all parsers in the tuple fail to parse, then an error will be returned.

Below is a simple example to illustrate:

use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::IResult;

fn parse_abc_or_def(input: &str) -> IResult<&str, &str> {
    alt((
        tag("abc"),
        tag("def"),
    ))(input)
}

#[test]
fn test_parse_abc_or_def() {
    let (leftover_input, output) = parse_abc_or_def("abcWorld").unwrap();
    assert_eq!(leftover_input, "World");
    assert_eq!(output, "abc");
    let (_, output) = parse_abc_or_def("defWorld").unwrap();
    assert_eq!(output, "def");
    assert!(parse_abc_or_def("ghiWorld").is_err());
}

Composition

In addition to choosing multiple parsers, combining parsers is also a very common requirement, so nom provides built-in combinators.

For example, tuple(), which takes a tuple of parsers and returns Ok with all successfully parsed results as a tuple, or returns the first failing Err parser.

use nom::branch::alt;
use nom::bytes::complete::tag_no_case;
use nom::IResult;
use nom::sequence::tuple;

fn parse_base(input: &str) -> IResult<&str, &str> {
    alt((
        tag_no_case("a"), // Case-insensitive tag
        tag_no_case("t"),
        tag_no_case("c"),
        tag_no_case("g"),
    ))(input)
}

fn parse_pair(input: &str) -> IResult<&str, (&str, &str)> {
    tuple((
        parse_base, parse_base
    ))(input)
}

#[test]
fn test_parse_pair() {
    let (remaining, parsed) = parse_pair("aTcG").unwrap();
    assert_eq!(parsed, ("a", "T"));
    assert_eq!(remaining, "cG");
    assert!(parse_pair("Dct").is_err());
}

In addition to the above, Rust also supports the following parsers with similar operations:

Combinator	Usage	Input	Output
delimited	`delimited(char('('), take(2), char(')'))`	"(ab)cd"	Ok(("cd", "ab"))
preceded	`preceded(tag("ab"), tag("XY"))`	"abXYZ"	Ok(("Z", "XY"))
terminated	`terminated(tag("ab"), tag("XY"))`	"abXYZ"	Ok(("Z", "ab"))
pair	`pair(tag("ab"), tag("XY"))`	"abXYZ"	Ok(("Z", ("ab", "XY")))
separated_pair	`separated_pair(tag("hello"), char(','), tag("world"))`	"hello,world!"	Ok(("!", ("hello", "world")))

Parsers With Custom Return Types#

As mentioned, the Input and Output in IResult can actually be different types. If we want to convert the results of tags to specific values, we can use the nom provided value combinator to convert the successful results of parsing into specific values. For example, in the following example:

use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::combinator::value;
use nom::IResult;

fn parse_bool(input: &str) -> IResult<&str, bool> {
    alt((
        value(true, tag("true")),    // Convert to bool type
        value(false, tag("false")),
    ))(input)
}

#[test]
fn test_parse_bool() {
    let (remaining, parsed) = parse_bool("true|false").unwrap();
    assert_eq!(parsed, true);
    assert_eq!(remaining, "|false");
    assert!(parse_bool(remaining).is_err());
}

Repeating Predicates and Parsers#

Repeating With Predicates

The predicates here are actually like the while loops we encountered earlier. To satisfy the functionality of repeating parser processing that includes specific conditions, nom provides several different categories of predicate parsers, mainly including take_till, take_until, and take_while:

Combinator	Functionality	Usage	Input	Output
`take_till`	Continuously consumes input until its input satisfies the predicate	`take_while(is_alphabetic)`	"abc123"	Ok(("123", "abc"))
`take_while`	Continuously consumes input until its input does not satisfy the predicate	`take_till(is_alphabetic)`	"123abc"	Ok(("abc", "123"))
`take_until`	Consumes until the first occurrence that satisfies the predicate	`take_until("world")`	"Hello World"	Ok(("World", "Hello "))

Here, we can also add that:

The above combinators actually have "twins," meaning those with 1 at the end of their names, which differ in that they require at least one matching character to be returned; otherwise, an error will occur.
The take_while_m_n used earlier is actually similar to take_while, serving as a special case that ensures consuming [m,n] bytes.

Repeating Parsers

In addition to repeating single parsers with predicates, nom also provides combinators for repeating parsers, such as many0, which can apply a parser as many times as possible and return a vector of these parsing results, as shown in the following example:

use nom::bytes::complete::tag;
use nom::IResult;
use nom::multi::many0;

fn repeat_parser(s: &str) -> IResult<&str, Vec<&str>> {
    many0(tag("abc"))(s)
}

#[test]
fn test_repeat_parser() {
    assert_eq!(repeat_parser("abcabc"), Ok(("", vec!["abc", "abc"])));
    assert_eq!(repeat_parser("abc123"), Ok(("123", vec!["abc"])));
    assert_eq!(repeat_parser("123123"), Ok(("123123", vec![])));
    assert_eq!(repeat_parser(""), Ok(("", vec![])));
}

Below are some commonly used combinators:

Combinator	Usage	Input	Output
count	`count(take(2), 3)`	"abcdefgh"	Ok(("gh", vec!["ab", "cd", "ef"]))
many0	`many0(tag("ab"))`	"abababc"	Ok(("c", vec!["ab", "ab", "ab"]))
many_m_n	`many_m_n(1, 3, tag("ab"))`	"ababc"	Ok(("c", vec!["ab", "ab"]))
many_till	`many_till(tag( "ab" ), tag( "ef" ))`	"ababefg"	Ok(("g", (vec!["ab", "ab"], "ef")))
separated_list0	`separated_list0(tag(","), tag("ab"))`	"ab,ab,ab."	Ok((".", vec!["ab", "ab", "ab"]))
fold_many0	`fold_many0(be_u8,		0,
fold_many_m_n	`fold_many_m_n(1, 2, be_u8,		0,
length_count	`length_count(number, tag("ab"))`	"2ababab"	Ok(("ab", vec!["ab", "ab"]))

Error Management#

Nom's errors are designed with multiple needs in mind:

Indicate which parser failed and the position in the input data.
Accumulate more context as errors propagate up the parser chain.
Have very low overhead since calling parsers typically discards errors.
Be modifiable according to user needs, as some languages require more information.

To meet these needs, the result type of nom parsers is designed as follows:

pub type IResult<I, O, E=nom::error::Error<I>> = Result<(I, O), nom::Err<E>>;

pub enum Err<E> {
    Incomplete(Needed),    // Indicates that the parser did not have enough data to make a decision, usually encountered in streaming scenarios
    Error(E),              // Normal parser error, for example, if a child parser of the alt combinator returns Error, it will try another child parser
    Failure(E),            // Unrecoverable error, for example, if a child parser returns Failure, the alt combinator will not try other branches
}

Common Error Types in nom::Err<E>

The default error type nom::error::Error, which will return the specific parser's error and the position of the error in the input.
```
  #[derive(Debug, PartialEq)]
  pub struct Error<I> {
    /// position of the error in the input data
    pub input: I,
    /// nom error code
    pub code: ErrorKind,
  }
```
- This error type is fast and has low overhead, suitable for parsers that are called repeatedly, but its functionality is also limited, as it does not return call chain information.

To get more information, nom::error::VerboseError, which will return more information about the parser chain that encountered the error, such as parser types, etc.

  #[derive(Clone, Debug, PartialEq)]
  pub struct VerboseError<I> {
    /// List of errors accumulated by `VerboseError`, containing the affected
    /// part of input data, and some context
    pub errors: crate::lib::std::vec::Vec<(I, VerboseErrorKind)>,
  }
  
  #[derive(Clone, Debug, PartialEq)]
  /// Error context for `VerboseError`
  pub enum VerboseErrorKind {
    /// Static string added by the `context` function
    Context(&'static str),
    /// Indicates which character was expected by the `char` function
    Char(char),
    /// Error kind given by various nom parsers
    Nom(ErrorKind),
  }

By viewing the original input and error chain, more user-friendly error messages can be constructed. The nom::error::convert_error function can be used to build such messages.

Custom Error Types via ParseError Trait

You can define your own error types by implementing the ParseError<I> trait.

Since all nom combinators use a common error type, you only need to define it in the result type of the parser, and it will be used everywhere.

pub trait ParseError<I>: Sized {
    // Indicate which parser encountered an error based on the input position and ErrorKind enum
    fn from_error_kind(input: I, kind: ErrorKind) -> Self;
    // Allow creating a series of errors when backtracking the parser tree (various combinators will add more context)
    fn append(input: I, kind: ErrorKind, other: Self) -> Self;
    // Create an error indicating which character was expected
    fn from_char(input: I, _: char) -> Self {
        Self::from_error_kind(input, ErrorKind::Char)
    }
    // Allow choosing (or accumulating) between errors in combinators like alt
    fn or(self, other: Self) -> Self {
        other
    }
}

You can also implement the ContextError trait to support VerboseError<I> using the context() combinator.

Below is a simple example to introduce its usage, where a debug error type is defined to print additional information each time an error is generated:

use nom::error::{ContextError, ErrorKind, ParseError};

#[derive(Debug)]
struct DebugError {
    message: String,
}

impl ParseError<&str> for DebugError {
    // Print the specific parser type of the error
    fn from_error_kind(input: &str, kind: ErrorKind) -> Self {
        let message = format!("【{:?}】:\t{:?}\n", kind, input);
        println!("{}", message);
        DebugError { message }
    }

    // If encountering multiple errors in combinators, print additional context information
    fn append(input: &str, kind: ErrorKind, other: Self) -> Self {
        let message = format!("【{}{:?}】:\t{:?}\n", other.message, kind, input);
        println!("{}", message);
        DebugError { message }
    }

    // Print the specific expected character
    fn from_char(input: &str, c: char) -> Self {
        let message = format!("【{}】:\t{:?}\n", c, input);
        print!("{}", message);
        DebugError { message }
    }

    fn or(self, other: Self) -> Self {
        let message = format!("{}\tOR\n{}\n", self.message, other.message);
        println!("{}", message);
        DebugError { message }
    }
}

impl ContextError<&str> for DebugError {
    fn add_context(_input: &str, _ctx: &'static str, other: Self) -> Self {
        let message = format!("【{}「{}」】:\t{:?}\n", other.message, _ctx, _input);
        print!("{}", message);
        DebugError { message }
    }
}

Debugging Parsers

During the process of writing parsers, if you need to track the execution process of the parsers, you can use the dbg_dmp function to print the input and output of the parsers:

fn f(i: &[u8]) -> IResult<&[u8], &[u8]> {
    dbg_dmp(tag("abcd"), "tag")(i)
}

let a = &b"efghijkl"[..];

// Will print the following message:
// tag: Error(Error(Error { input: [101, 102, 103, 104, 105, 106, 107, 108], code: Tag })) at:
// 00000000        65 66 67 68 69 6a 6b 6c         efghijkl
f(a);

Summary#

Through this article, we have basically understood the prerequisite knowledge of parsers (Parser, PEG, and Parser Combinator) and how to use classic third-party libraries (pest and nom) in Rust to implement parsers. Whether based on PEG (pest) or Parser Combinator (nom), both can meet the general scenarios for implementing custom parsers, eliminating the need to write parsers from scratch, significantly reducing the cost of implementing custom parsers. If considering more complex situations (such as performance, implementation costs, usage documentation, etc.), it is necessary to choose the appropriate third-party library for implementation based on specific scenarios.

In the next article, we will use both pest and nom to implement several common parsers to better master the vision of parsers.

References#

https://zhuanlan.zhihu.com/p/427767002

https://zh.wikipedia.org/wiki/%E8%A7%A3%E6%9E%90%E8%A1%A8%E8%BE%BE%E6%96%87%E6%B3%95

https://zhuanlan.zhihu.com/p/355364928

https://ohmyweekly.github.io/notes/2021-01-20-pest-grammars/#

https://pest.rs/book/parser_api.html

https://rustmagazine.github.io/rust_magazine_2021/chapter_4/nom_url.html

https://tfpk.github.io/nominomicon/chapter_1.html