🔬Build A Parser By Rust (Part 2)

This document contains content copied from Feishu documents for search purposes. There may be formatting incompatibilities, so it is recommended to view the original Feishu document

Background#

Through the previous text (Build A Parser By Rust (Part 1)), we have a basic understanding of the concepts related to parsers and their usage in Rust's parser libraries. In this article, we will put what we learned in the previous text into practice with a specific case, helping everyone better understand the concepts and master their practical applications. We will also learn about some design aspects of parser implementations, providing ideas for implementing custom parsers in the future.

We will guide you through a simple implementation of a relatively common and simple JSON parser using both nom and pest.

Note:

The versions of the dependencies used below for nom or pest are the latest versions.
[dependencies]
pest = "2.6"
pest_derive = "2.6"
nom = "7.1.3"
In the following practice, only some important or previously unmentioned API meanings will be explained. For more information, please refer to the previous text or the library documentation.

The parser constructed in this document only verifies the correctness of the parsing and does not conduct related performance testing. Interested readers can try it themselves.

The specific source code can be found at:

catwithtudou/parser_toy

build parser by rust

Rust20

JSON Standard#

JSON (JavaScript Object Notation) is a lightweight data interchange format. It represents data in an easily readable text format, making it easy for humans to read and write. The JSON format consists of key-value pairs and uses a syntax similar to JavaScript objects, hence the name JavaScript Object Notation. JSON is commonly used for transmitting data over networks, storing configuration information, and exchanging data between different systems. It is very common in web development, such as for data transmission in APIs and front-end and back-end data interaction. JSON is also widely used in mobile application development and big data processing. Due to its simplicity and readability, JSON has become a standard format for data exchange in many applications.

If we want to implement a JSON parser, we first need to understand the JSON standard protocol, which is mainly divided into the following six parts:

JSON Annotation	Description	Specific Definition
Whitespace	Whitespace can be inserted between any pair of tokens
Number	Numbers are very similar to C or Java numbers, except octal and hexadecimal formats are not used
String	A string is a sequence of zero or more Unicode characters enclosed in double quotes, using backslashes for escaping. A character is a single string
Value	A value can be a string enclosed in double quotes, a number, true, false, null, an object, or an array. These structures can be nested
Array	An array is an ordered collection of values. An array starts with a left bracket `[` and ends with a right bracket `]`, with values separated by commas `,`
Object	An object is an unordered collection of name/value pairs. An object starts with a left brace `{` and ends with a right brace `}`. Each name is followed by a colon `:`, and name/value pairs are separated by commas `,`

As we can see in the JSON standard protocol, the definitions of data types and specific parsing situations are very clear and relatively simple.

Next, we will implement the parser using nom and pest based on this standard. The specific code paths are in nom/json and pest/json.

Implementation Based on Nom#

JSON Model#

Here we use an enum to represent JSON Value excluding whitespace:

#[derive(Debug, PartialEq)]
pub enum JsonValue {
    Str(String),
    Boolean(bool),
    Num(f64),
    Array(Vec<JsonValue>),
    Object(HashMap<String, JsonValue>),
    Null,
}

Specific Type Parsing#

Whitespace

From the previous text, we can see that whitespace elements can be any of the following, and during processing, it will consume input until it encounters other elements, ultimately resulting in whitespace:

space -> " "
linefeed -> "\n"
carriage return -> "\r"
horizontal tab -> "\t"

Here, nom has two implementation methods: one can directly use the built-in function multispace0, and the other is to use take_while to build the parsing function:

As mentioned earlier, take_while as a predicate parser continuously consumes input until its input no longer satisfies the predicate.

// whitespace JSON space parsing (equivalent to nom built-in function multispace0)
fn whitespace(i: &str) -> IResult<&str, &str> {
    let chars = " \t\r\n";
    take_while(move |c| chars.contains(c))(i)
}

Number

From the previous text, we can see that JSON supports positive and negative numbers, decimals, and scientific notation. Although we can parse it using combinations of parsers like alt and be_f64, it is more common to use the built-in function double provided by nom in this scenario. The usage can refer to the example:

use nom::number::complete::double;

let parser = |s| {
  double(s)
};

assert_eq!(parser("1.1"), Ok(("", 1.1)));
assert_eq!(parser("123E-02"), Ok(("", 1.23)));
assert_eq!(parser("123K-01"), Ok(("K-01", 123.0)));
assert_eq!(parser("abc"), Err(Err::Error(("abc", ErrorKind::Float))));

String

Here we need to discuss both the string and the situation of the string within the quotes:

First, we can see that in the string, there are three situations to the right of the left quote. Except for the case where there are empty characters between the quotes, the other situations can be handled by combinators to remove the quotes and obtain the content within the quotes. There are many ways to use combinators; here are two common approaches:
- alt + delimited: parse according to the overall structure of the characters
- preceded + cut + terminated: parse according to the order of the characters

// string entire string parsing
fn string(i: &str) -> IResult<&str, &str> {
    context(
        "string",
        preceded(char('\"'), cut(terminated(parse_str, char('\"')))))(i)
        // parse_str will be described later
}
fn string(i: &str) -> IResult<&str, &str> {
    context(
        "string",
        alt((tag("\"\""), delimited(tag("\""), parse_str, tag("\"")))),
    )(i)
}

The cut combinator's role is to prevent backtracking; it will immediately stop parsing if the parsing fails and will not attempt other possible parsing paths. This is very useful for avoiding unnecessary performance overhead and parsing errors. Here is an official example for better understanding:
use nom::combinator::cut;

fn parser(input: &str) -> IResult<&str, &str> {
  alt((
    preceded(one_of("+-"), cut(digit1)),
    rest
  ))(input)
}

assert_eq!(parser("+10 ab"), Ok((" ab", "10")));
assert_eq!(parser("ab"), Ok(("", "ab")));
assert_eq!(parser("+"), Err(Err::Failure(Error { input: "", code: ErrorKind::Digit })));

Then, after obtaining the string within the quotes, we need to handle escape characters to get the actual content. Currently, nom provides a special function for handling escape characters called escaped, which takes parameters escaped(normal, control, escapable), where the parameters represent:
- normal: used to match ordinary character parsers but cannot accept characters containing control characters
- control: control characters (e.g., the \ used in most languages)
- escapable: matches the escape characters

// Official example
use nom::bytes::complete::escaped;
use nom::character::complete::one_of;

fn esc(s: &str) -> IResult<&str, &str> {
  // digit1: built-in parser function, matches at least one digit
  // '\\': represents the backslash character '\'
  // r#""n\"#: through "r#"{constructing the string content of a raw string literal}"#, this indicates that the escape characters that can be matched are ", n, \
  escaped(digit1, '\\', one_of(r#""n\"#))(s)
}

assert_eq!(esc("123;"), Ok((";", "123")));
assert_eq!(esc(r#"12\"34;"#), Ok((";", r#"12\"34"#)));

Finally, based on the escaped function and the JSON standard, we construct the parse_str function, where the meanings of the three parameters filled in this scenario are:
- normal: matches "Any codepoint except " or \ or control characters"
- '\\': the escape character in JSON is also the backslash character
- escapable: matches the escape characters described in the standard, such as ",\,/,b, etc. It should be noted that hexadecimal numbers also need to be handled separately.
  - Here, it is particularly noted that the peek built-in function used for hexadecimal processing does not consume input after parsing, allowing subsequent parsing to proceed normally.

// parse_str single string parsing
fn parse_str(i: &str) -> IResult<&str, &str> {
    escaped(normal, '\\', escapable)(i)
}

// normal ordinary character parsing
fn normal(i: &str) -> IResult<&str, &str> {
    take_till1(|c: char| c == '\\' || c == '"' || c.is_ascii_control())(i)
}

// escapable escape character parsing
fn escapable(i: &str) -> IResult<&str, &str> {
    context(
        "escaped",
        alt((
            tag("\""),
            tag("\\"),
            tag("/"),
            tag("b"),
            tag("f"),
            tag("n"),
            tag("r"),
            tag("t"),
            hex
        )))(i)
}

// hex hexadecimal character parsing
fn hex(i: &str) -> IResult<&str, &str> {
    context(
        "hex",
        preceded(
            peek(tag("u")),
            take_while_m_n(5, 5, |c: char| c.is_ascii_hexdigit() || c == 'u'),
        ))(i)
}

Value

Having implemented the parsers for whitespace, numbers, and strings, we can now complete the basic types boolean and null:

// boolean boolean data type parsing
fn boolean(i: &str) -> IResult<&str, bool> {
    alt((
        value(true, tag("true")),
        value(false, tag("false"))
    ))(i)
}

// null Null parsing
fn null(i: &str) -> IResult<&str, JsonValue> {
    map(tag("null"), |_| JsonValue::Null)(i)
}

Currently, based on the implemented type parsers, we can construct the value parser (composite types will be implemented later):

In the implementation below, there is a syntax that may be difficult to understand. Here is a brief explanation:

The parameter types of the map function are the nom parser trait and the closure function FnMut(O1) -> O2

Here we can utilize the constructor function of the enum type itself as the anonymous function, so we can use it directly.

// json_value JsonValue parsing
fn json_value(i: &str) -> IResult<&str, JsonValue> {
    context(
        "json value",
        delimited(
            whitespace,
            alt((
                map(string, |s| JsonValue::Str(String::from(s))),
                map(double, JsonValue::Num),
                map(boolean, JsonValue::Boolean),
                null,
                map(array, JsonValue::Array),
                map(object, JsonValue::Object)
            )),
            whitespace,
        ),
    )(i)
}

Array

According to the standard description of arrays:

First, use delimited to remove the left and right brackets, making it easier to parse the content in between.
Use the built-in function separated_list0 to parse the content within the brackets to obtain an array of Vec<JsonValue>:

// array array parsing
fn array(i: &str) -> IResult<&str, Vec<JsonValue>> {
    context(
        "array",
        delimited(
            tag("["),
            separated_list0(tag(","), delimited(whitespace, json_value, whitespace)),
            tag("]"),
        ),
    )(i)
}

Object

For complex parsers like objects, we can implement them by splitting sub-parsers:

First, parse the format of name/value pairs in the object using separated_pair + preceded:

// key_value kv format parsing
fn key_value(i: &str) -> IResult<&str, (&str, JsonValue)> {
    separated_pair(preceded(whitespace, string), cut(preceded(whitespace, char(':'))), json_value)(i)
}

Then, for the overall structure of the object, the parsing idea is:
- left brace -> content within the braces -> parse according to the (previously implemented) key-value pair format to construct an array -> convert the array to a HashMap type -> right brace

// object object format parsing
fn object(i: &str) -> IResult<&str, HashMap<String, JsonValue>> {
    context(
        "object",
        preceded(
            char('{'),
            cut(terminated(
                map(
                    separated_list0(preceded(whitespace, char(',')), key_value),
                    |tuple_vec| {
                        tuple_vec.into_iter().map(|(k, v)| (String::from(k), v)).collect()
                    },
                ),
                preceded(whitespace, char('}')),
            )),
        ),
    )(i)
}

Top-Level Parsing Function#

We have implemented all the annotated types in the JSON standard. Finally, we just need to construct the top-level function to use this parser.

Here, the outermost result of JSON can either be an object or an array, so our top-level function is:

fn root(i: &str) -> IResult<&str, JsonValue> {
    delimited(
        whitespace,
        alt((
            map(object, JsonValue::Object),
            map(array, JsonValue::Array),
        )),
        opt(whitespace),
    )(i)
}

Finally, you can run the following test function to see if the final return result is normal:

#[cfg(test)]
mod test_json {
    use crate::nom::json::json::root;

    #[test]
    fn test_parse_json() {
        let data = "  { \"a\"\t: 42,
  \"b\": [ \"x\", \"y\", 12 ] ,
  \"c\": { \"hello\" : \"world\"}
  } ";
        println!("will try to parse valid JSON data:\n\n**********\n{}\n**********\n", data);
        //
        // will try to parse valid JSON data:
        //
        //     **********
        // { "a" : 42,
        //     "b": [ "x", "y", 12 ] ,
        //     "c": { "hello" : "world"}
        // }
        // **********


        println!(
            "parsing a valid file:\n{:#?}\n",
            root(data)
        );
        // parsing a valid file:
        //     Ok(
        //         (
        // "",
        // Object(
        //     {
        //         "c": Object(
        //         {
        //             "hello": Str(
        //             "world",
        //             ),
        //         },
        //         ),
        //         "b": Array(
        //             [
        //                 Str(
        //         "x",
        //         ),
        //         Str(
        //             "y",
        //         ),
        //         Num(
        //             12.0,
        //         ),
        //         ],
        //         ),
        //         "a": Num(
        //         42.0,
        //         ),
        //     },
        // ),
        // ),
        // )
    }
}

Thus, the JSON parser implemented using nom is complete. No specific performance testing has been conducted; interested readers can perform stress tests themselves.

Implementation Based on Pest#

JSON Model#

Similar to the previous implementation with nom, here we use an enum to construct JSON Value excluding whitespace.

#[derive(Debug, PartialEq)]
pub enum JsonValue<'a> {  
    Number(f64),
    String(&'a str),
    Boolean(bool),
    Array(Vec<JsonValue<'a>>),
    Object(Vec<(&'a str, JsonValue<'a>)>),
    Null,
}

In fact, it is not necessary to declare a lifetime; you can directly use String. The reason for declaring it is that &str is introduced, which saves the need for type conversion later.

Considering that the JSON values obtained after parsing are better displayed and processed, we add a serializer for JsonValue:

pub fn serialize_json_value(val: &JsonValue) -> String {
    use JsonValue::*; // for convenience in subsequent enums

    match val {
        Number(n) => format!("{}", n),
        String(s) => format!("\"{}\"", s),
        Boolean(b) => format!("{}", b),
        Array(a) => {
            let contents: Vec<_> = a.iter().map(serialize_json_value).collect();
            format!("[{}]", contents.join(","))
        }
        Object(o) => {
            let contents: Vec<_> = o
                .iter()
                .map(|(key, value)| format!("\"{}\":{}", key, serialize_json_value(value)))
                .collect();
            format!("{{{}}}", contents.join(","))
        }
        Null => "null".to_string(),
    }
}

It should be noted that when processing arrays and objects as composite types, recursion is needed to obtain the specific values under the composite types.

Pest Grammar Parsing#

Here we create json.pest to write the JSON standard we need to parse using Pest Grammar.

Whitespace

According to the standard description, we implement it using the optional operator |:

WHITESPACE = _{ " " | "\t" | "\r" | "\n" }

There are a few syntax special treatments that need to be explained in advance:

If a rule is prefixed with _, it indicates that a silent rule has been created. Unlike ordinary rules, during parsing, it will not produce token pairs and will not report errors. Ultimately, it will only obtain a pair of token pairs at the outermost level.
In pest, if WHITESPACE is defined separately, it will be implicitly inserted between each sequence or repetition (except for atomic rules).
- It is important to note the mention of "except for atomic rules" as there will be rules related to this information later.
- Similar implicit conventions also exist for the COMMENT rule, which considers the handling of whitespace in character content.

In summary, in the subsequent file, all sequences or repetitions will ignore whitespace during parsing except for atomic rules.

Number

Based on the standard description, we use the sequence operator ~ to add different parsing conditions for expressions, and we can utilize pest's built-in rules related to numbers:

// 2. number
number = @{
    "-"?
    ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
    ~ ("." ~ ASCII_DIGIT*)?
    ~ (^"e" ~ ("+"|"-")? ~ ASCII_DIGIT+)?
}

Here, there are also syntax special treatments that need to be explained for better understanding:

If a rule is prefixed with @, it indicates that an atomic rule has been created, which has the following characteristics:
- It will not be affected by the previously mentioned WHITESPACE handling, meaning it will not hide internal whitespace, and there will be no character ignoring between sequences constructed with ~.
- In atomic rules, other rules called will also be treated as atomic rules.
- In atomic rules, all rules matched internally will be treated as silent, meaning only the outermost parsing result of the entire rule can be obtained.
Adding operators ?, *, and + to the rule suffix indicates at most one match, match all, and match at least one character, respectively.
Adding the operator ^ to the rule prefix indicates case insensitivity.

String

Based on the standard description, we will combine three rules for string parsing to clarify:

// 3. string
string = ${ "\"" ~ inner ~ "\"" }
inner = @{ char* }
char = {
    !( "\"" | "\\") ~ ANY
    | "\\" ~ ("\"" | "\\" | "/" | "b" | "f" | "n" | "r" | "t")
    | "\\" ~ ("u" ~ ASCII_HEX_DIGIT{4})
}

Pest does not provide built-in functions for handling escape characters like nom, so we need to manually include escape symbols during parsing.

Here are some syntax special treatments used:

If a rule is prefixed with $, it indicates that a composite atomic rule has been created. Similar to atomic rules, but with differences that need to be noted, it has the following characteristics:
- It will not be affected by the previously mentioned WHITESPACE handling.
- In composite atomic rules, there is no treatment of other rules as atomic rules and internal matching rules as silent; everything else is similar to ordinary rules.
!(...) ~ ANY indicates matching any character except those given in the parentheses.

Value

Similar to the previous implementation, we first complete the basic types boolean and null:

// 4. boolean
boolean = {"true" | "false"}
// 5. null
null = {"null"}

Combining the parsing rules for each data type to construct values, considering that we do not care about the parsing process of different rules within the values, we mark the _ silent rule to reduce nesting**:

value = _{ number | string | boolean | array | object | null}

For arrays and objects, we will describe them below.

Array

According to the standard description, we separate empty arrays and non-empty arrays, using ~ and \* to indicate the possible existence of multiple values:

// 6. array
array = {
    "[" ~ "]"|
    "[" ~ value ~ ("," ~ value)* ~ "]"
}

Object

Here, we separately split the object values into pairs, utilizing the previous string and value rules, and handle them similarly to arrays, distinguishing between empty objects and non-empty objects:

// 7. object
object = {
    "{" ~ "}"|
    "{" ~ pair ~ ("," ~ pair)* ~ "}"
}
pair = { string ~ ":" ~ value }

Final Rule

Finally, we need a final rule to represent the entire JSON, and the only valid content for JSON is an object or an array.

Additionally, considering that we only need the parsed values themselves and the EOI rule, we mark the rule as silent:

// 9. json
json = _{ SOI ~ (object | array) ~ EOI}

Thus, we have completed the pest rules we need to write. Next, we will generate the AST based on the rules.

AST Generation and Parsing#

Define the Structure Bound to Pest Rules

Pest needs to bind the grammar macro to the Rust structure:

use pest::Parser;
use pest_derive::Parser;

#[derive(Parser)]
#[grammar = "pest/json/json.pest"] // according to your project file
pub struct JsonParser;

Construct the AST Generation Function

By binding the pest rules to JsonParser, we can use the pest::Parser's parse method to obtain the AST result generated by the previous JSON rules:

pub fn root(content: &str) -> Result<JsonValue, Error<Rule>> {
    let json = JsonParser::parse(Rule::json, content)?.next().unwrap();
    // ......
}

Since json is a silent rule, it only has the final generated token pair, i.e., the Pair<Rule> type, so we only need to call next() once.

Our goal is to parse the AST to obtain the final JsonValue, so we need another method to parse this Pair<Rule>.

Parsing AST Function

We add the parse_json_value function:

Using the obtained AST result, we assign values to each type in JsonValue according to the previous pest rules.
The handling of arrays and objects as composite types requires recursive functions to search for nested values.
If the matched Rule is not one of the types in JsonValue, it directly reports an error and exits.

pub fn parse_json_value(pair: Pair<Rule>) -> JsonValue {
    match pair.as_rule() {
        Rule::number => JsonValue::Number(pair.as_str().parse().unwrap()),
        Rule::string => JsonValue::String(pair.into_inner().next().unwrap().as_str()),
        Rule::boolean => JsonValue::Boolean(pair.as_str().parse().unwrap()),
        Rule::null => JsonValue::Null,
        Rule::array => JsonValue::Array(pair.into_inner().map(parse_json_value).collect()),
        Rule::object => JsonValue::Object(
            pair.into_inner()
                .map(|pair| {
                    let mut inner_rules = pair.into_inner();
                    let key = inner_rules
                        .next() // get the pair rule
                        .unwrap()
                        .into_inner()
                        .next() // get the first token pair of the pair rule, i.e., key
                        .unwrap()
                        .as_str();
                    let value = parse_json_value(inner_rules.next().unwrap());
                    (key, value)
                })
                .collect()
        ),
        _ => unreachable!()
    }
}

Final Top-Level Function and Function Testing

Based on the previous parsing function, we can complete the top-level function root to obtain the final parsing result:

pub fn root(content: &str) -> Result<JsonValue, Error<Rule>> {
    let json = JsonParser::parse(Rule::json, content)?.next().unwrap();
    Ok(parse_json_value(json))
}

Finally, you can run the following test function to verify the parsing results:

#[cfg(test)]
mod test {
    use crate::pest::json::json::{JsonValue, root, serialize_json_value};

    #[test]
    fn test_parse_json_by_pest() {
        let data = "  { \"a\"\t: 42,
  \"b\": [ \"x\", \"y\", 12 ] ,
  \"c\": { \"hello\" : \"world\"}
  } ";
        println!("will try to parse valid JSON data:\n\n**********\n{}\n**********\n", data);

        // will try to parse valid JSON data:
        //
        //     **********
        // { "a" : 42,
        //     "b": [ "x", "y", 12 ] ,
        //     "c": { "hello" : "world"}
        // }
        // **********


        let json_result: JsonValue = root(data).expect("unsuccessful JSON");

        println!("{}", serialize_json_value(&json_result))
        // {"a":42,"b":["x","y",12],"c":{"hello":"world"}}
    }
}

Thus, we have completed the JSON parser implemented based on pest.

Summary#

We have practiced constructing a JSON parser using nom and pest. Both nom and pest are classic parser libraries that complete parsing based on different excellent implementation ideas, meeting most of the related demands of parser libraries.

Through reading these two articles, I believe everyone can basically master how to quickly build their own custom parsers using Rust parser libraries, freeing themselves from the pain of manually coding parsers. At the same time, in this process, we also learned about concepts such as Parser, Parser Combinator, and JSON standards.

Finally, thank you all for reading, and I hope this can help those interested in parsers or similar parser needs.

In the future, the repo will also prepare to update the Redis protocol parser, but due to space considerations, it will not be included here.

References#

https://zhuanlan.zhihu.com/p/146455601

rust-bakery/nom

Rust parser combinator framework

Rust10155838

https://www.json.org/json-zh.html

https://pest.rs/book/examples/json.html