Compling

The parser library is used in a similar way to Parsec which is a library for writing parsers in Haskell. It is based on higher-order parser combinators, so a complicated parser can be made out of many smaller ones.

This is a two stage process:

  • Lexer - This takes the input string and turns it into a list of Tokens.
  • Parser - This takes the list of tokens and outputs the code.
parser top level

The advantage of using two stages, like this, is that things like whitespace and comments don't need to be considered in every parser rule.

The Idris parser library differs from Parsec in that you need to say in the Recogniser whether a rule is guaranteed to consume input. This means that the Idris type system prevents the construction of recognisers/rules that can't consume the input.

The Lexer is similar but works at the level of characters rather than tokens, and you need to provide a TokenMap which says how to build a token representation from a string when a rule is matched.

Test Code

Since this code is from Idris 2, I have adapted it to run under Idris 1:

mjb@localhost:~> cd Idris-dev/libs/algebra
mjb@localhost:~/Idris-dev/libs/algebra> idris --clean algebra.ipkg
Removed: outputFormat.ibc
Removed: Core2.ibc
Removed: Quantity2.ibc
Removed: Lexer2.ibc
Removed: ParserLexer2.ibc
Removed: ParserCore2.ibc
Removed: Parser2.ibc
Removed: Support2.ibc
Removed: IDEModeCommands2.ibc
Removed: IDEModeParser2.ibc
Removed: 00algebra-idx.ibc
mjb@localhost:~/Idris-dev/libs/algebra> idris --build algebra.ipkg
Type checking ./outputFormat.idr
Type checking ./Core2.idr
Type checking ./Quantity2.idr
Type checking ./Lexer2.idr
Type checking ./ParserLexer2.idr
Type checking ./ParserCore2.idr
Type checking ./Parser2.idr
Type checking ./Support2.idr
Type checking ./IDEModeCommands2.idr
Type checking ./IDEModeParser2.idr
mjb@localhost:~/Idris-dev/libs/algebra> idris --install algebra.ipkg
Installing outputFormat.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing Core2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing Quantity2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing Lexer2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing ParserLexer2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing ParserCore2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing Parser2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing Support2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing IDEModeCommands2.ibc to /home/mjb/.
                                         cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
Installing IDEModeParser2.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/
                                          idris-1.3.2/libs/algebra
Installing 00algebra-idx.ibc to /home/mjb/.cabal/share/x86_64-linux-ghc-8.6.4/idris-1.3.2/libs/algebra
mjb@localhost:~/Idris-dev/libs/algebra> idris -p algebra
     ____    __     _                                          
    /  _/___/ /____(_)____                                     
    / // __  / ___/ / ___/     Version 1.3.2
  _/ // /_/ / /  / (__  )      http://www.idris-lang.org/      
 /___/\__,_/_/  /_/____/       Type :? for help               

Idris is free software with ABSOLUTELY NO WARRANTY.            
For details type :warranty.
Idris> :module Core2
*Core2> :doc lex
Core2.lex : TokenMap a ->
    String -> (List (TokenData a), Int, Int, String)
    Given a mapping from lexers to token generating functions (the
    TokenMap a) and an input string, return a list of recognised tokens,
    and the line, column, and remainder of the input at the first point in
    the string where there are no recognised tokens.
    
    The function is: Total & public export
*Core2>  

You should now be able to run the examples:

Lexer

tokenise diagram

To lex some string to a list of tokens we define the structures using recognisers:

recognisers

There are constructors and combinators to allow the construction of the lexer definition:

A simple recogniser is 'Pred' which uses a predicate (Char -> Bool) to test whether to accept the character. It can be constructed using the 'is' function:

Idris> :module Lexer2
*Lexer2> is 'a'
Pred (\ARG =>
           intToBool (prim__eqChar ARG 'a'))
                              : Recognise True

Recognisers can be combined, for example,

<+> means sequence two recognisers. If either consumes a character, the sequence is guaranteed to consume a character.

*Lexer2> is 'a' <+> is 'b'
SeqEat (Pred (\ARG => intToBool (prim__eqChar ARG 'a')))
       (Delay (is 'b')) : Recognise True
*Lexer2> 
 

<|> means if both consume, the combination is guaranteed to consumer a character:

*Lexer2> is 'a' <|> is 'b'
Alt (Pred (\ARG => intToBool (prim__eqChar ARG 'a')))
    (Pred (\ARG => intToBool (prim__eqChar ARG 'b'))) : Recognise True
*Lexer2> 
 

Use of <+> and <|> allows more complicated lexers to be built up. Complex lexers can be split over many functions, for example, here is the lexer for Idris2 from src/Parser/Lexer.idr:

comment : Lexer
comment = is '-' <+> is '-' <+> many (isNot '\n')

toEndComment : (k : Nat) -> Recognise (k /= 0)
toEndComment Z = empty
toEndComment (S k) 
             = some (pred (\c => c /= '-' && c /= '{')) 
                      <+> toEndComment (S k)
           <|> is '{' <+> is '-' <+> toEndComment (S (S k))
           <|> is '-' <+> is '}' <+> toEndComment k
           <|> is '{' <+> toEndComment (S k)
           <|> is '-' <+> toEndComment (S k)

blockComment : Lexer
blockComment = is '{' <+> is '-' <+> toEndComment 1
              
docComment : Lexer
docComment = is '|' <+> is '|' <+> is '|' <+> many (isNot '\n')

-- An identifier starts with alpha character followed by alpha-numeric
ident : Lexer
ident = pred startIdent <+> many (pred validIdent)
  where
    startIdent : Char -> Bool
    startIdent '_' = True
    startIdent x = isAlpha x

    validIdent : Char -> Bool
    validIdent '_' = True
    validIdent '\'' = True
    validIdent x = isAlphaNum x

holeIdent : Lexer
holeIdent = is '?' <+> ident

doubleLit : Lexer
doubleLit 
    = digits <+> is '.' <+> digits <+> opt
           (is 'e' <+> opt (is '-' <|> is '+') <+> digits)

-- Do this as an entire token, because the contents will be processed by
-- a specific back end
cgDirective : Lexer
cgDirective
    = exact "%cg" <+> 
      ((some space <+> 
           some (pred isAlphaNum) <+> many space <+>
           is '{' <+> many (isNot '}') <+> 
           is '}') 
         <|> many (isNot '\n'))

mkDirective : String -> Token
mkDirective str = CGDirective (trim (substr 3 (length str) str))

-- Reserved words
keywords : List String
keywords = ["data", "module", "where", "let", "in", "do", "record",
            "auto", "implicit", "mutual", "namespace", "parameters",
            "with", "impossible", "case", "of",
            "if", "then", "else", "forall", "rewrite",
            "using", "interface", "implementation", "open", "import",
            "public", "export", "private",
            "infixl", "infixr", "infix", "prefix",
            "Type", "Int", "Integer", "String", "Char", "Double",
            "total", "partial", "covering",
            "Lazy", "Inf", "Delay", "Force"]

-- Reserved words for internal syntax
special : List String
special = ["%lam", "%pi", "%imppi", "%let"]

-- Special symbols - things which can't be a prefix of another symbol, and
-- don't match 'validSymbol'
export
symbols : List String
symbols 
    = [".(", -- for things such as Foo.Bar.(+)
       "@{",
       "(", ")", "{", "}", "[", "]", ",", ";", "_", 
       "`(", "`"]

export
opChars : String
opChars = ":!#$%&*+./<=>?@\\^|-~"

validSymbol : Lexer
validSymbol = some (oneOf opChars)

-- Valid symbols which have a special meaning so can't be operators
export
reservedSymbols : List String
reservedSymbols
    = symbols ++ 
      ["%", "\\", ":", "=", "|", "|||", "<-", "->", "=>", "?", "&", "**", ".."]

So far, this is static code, to define the lexical structure. To lex a given text we need to pass this to the runtime code.

However this can only cut up the string into a list of substrings, these must be converted into tokens so we need a way to construct tokens. This will also depend on the lexical structure we require.

TokenMap

We then need to generate a TokenMap. This is a mapping from lexers to the tokens they produce. This is a list of pairs:
(Lexer, String -> tokenType)
For each Lexer in the list, if a substring in the input matches, run the associated function to produce a token of type `tokenType`

from Core2:


TokenMap : (tokenType : Type) -> Type
TokenMap tokenType = List (Lexer, String -> tokenType)
  

Here is the code that generates the TokenMap for Idris2 lexer from src/Parser/Lexer.idr:

rawTokens : TokenMap Token
rawTokens =
    [(comment, Comment),
     (blockComment, Comment),
     (docComment, DocComment),
     (cgDirective, mkDirective),
     (holeIdent, \x => HoleIdent (assert_total (strTail x)))] ++
    map (\x => (exact x, Symbol)) symbols ++
    [(doubleLit, \x => DoubleLit (cast x)),
     (digits, \x => Literal (cast x)),
     (stringLit, \x => StrLit (stripQuotes x)),
     (charLit, \x => CharLit (stripQuotes x)),
     (ident, \x => if x `elem` keywords then Keyword x else Ident x),
     (space, Comment),
     (validSymbol, Symbol),
     (symbol, Unrecognised)]
  where
    stripQuotes : String -> String
    -- ASSUMPTION! Only total because we know we're getting quoted strings.
    stripQuotes = assert_total (strTail . reverse . strTail . reverse)

 

So, for our code, we can then turn this into a tokenMap by using a function like this:.

myTokenMap : TokenMap Token
myTokenMap = [(is 'a',CharLit)]

We then call 'lex' (from Core),this gives a list of tokens with pointers to input string

*MyParser> :doc lex
Core.lex : TokenMap a ->
    String -> (List (TokenData a), Int, Int, String)
    Given a mapping from lexers to token generating
    functions (the TokenMap a) and an input string,
    return a list of recognised tokens, and the line,
    column, and remainder of the input at the first
    point in the string where there are no recognised
    tokens.
    
    The function is: Total & export
Parser.Lexer.lex : String ->
    Either (Int, Int, String) (List (TokenData Token))
    
    
    The function is: Total & export

We should then be able to lex a string to tokens. If it is successful it will generate a list of TokenData like this:

Idris> :module Core2
*Core2> lex myTokenMap "a"
([MkToken 0 0 (CharLit "a")], 0, 1, "") :
                 (List (TokenData Token),
                 Int,
                 Int,
                 String)
TokenData is just an instance of a token together with a pointer to the String it comes from.
||| A token, and the line and column where it was
||| in the input

record TokenData a where
  constructor MkToken
  line : Int
  col : Int
  tok : a
If it is not successful it will generate an error message:
*Core2> lex myTokenMap "b"
([], 0, 0, "b") :
       (List (TokenData Token), Int, Int, String)

Again, for a more complicated example, we can look at the lexer for Idris2 from src/Parser/Lexer.idr:

public export
data Token = Ident String
           | HoleIdent String
           | Literal Integer
           | StrLit String
           | CharLit String
           | DoubleLit Double
           | Symbol String
           | Keyword String
           | Unrecognised String
           | Comment String
           | DocComment String
           | CGDirective String
           | EndInput

 

The reason for doing it this way is to separate generating the lexer from running the lexer. The TokenMap can be created in code and then called at runtime:

lex runtime

'Pred' is not the only recogniser. Here is the full data structure:

 

So far we have just looked at a trivially simple lexer, to create something more useful we need to combine simple recognisers to build up more complex recognisers.

Recognisers can be combined using <+> and <|> operators as follows:

<|> operator

Alternative recognisers. If both consume, the combination is guaranteed to consumer a character.
(<|>) : Recognise c1 ->
	  Recognise c2 ->
	  Recognise (c1 && c2)
(<|>) = Alt

Examples

This recognises optional characters:
||| Recognise no input
||| (doesn't consume any input)
empty : Recognise False
empty = Empty

||| Recognise a lexer or recognise ||| no input. This is not guaranteed ||| to consume input. ||| /`l`?/ opt : (l : Lexer) -> Recognise False opt l = l <|> empty
This recognises a CR and LF together or on their own:
||| Recognise a single newline sequence.
||| Understands CRLF, CR, and LF
||| /\\r\\n|[\\r\\n]/
newline : Lexer
newline = let crlf = "\r\n" in
              exact crlf <|> oneOf crlf
  

<+> operator

Sequence two recognisers. If either consumes a character, the sequence is guaranteed to consume a character.

If the first operand consumes the second operand can be infinite.

(<+>) : {c1 : Bool} ->
        Recognise c1 -> 
        inf c1 (Recognise c2) ->
        Recognise (c1 || c2)
(<+>) {c1 = False} = SeqEmpty
(<+>) {c1 = True} = SeqEat

Examples

This recognises a sequence like this:

start <+> middle <+> end

||| Recognise all input between `start` and `end` lexers.
||| Supports balanced nesting.
|||
||| For block comments that don't support 
||| nesting (such as C-style comments),
||| use `surround`
blockComment : (start : Lexer) -> (end : Lexer) -> Lexer
blockComment start end = start <+> middle <+> end
  where
    middle : Recognise False
    middle = manyUntil end (blockComment start end <|> any)

another example

This recognises a sequence of digits which may optionally have a minus sign at the start:
||| Recognise a single digit 0-9
||| /[0-9]/
digit : Lexer
digit = pred isDigit
||| Recognise one or more digits ||| /[0-9]+/ digits : Lexer digits = some digit
||| Recognise an integer literal (possibly with a '-' prefix) ||| /-?[0-9]+/ intLit : Lexer intLit = opt (is '-') <+> digits

Since Idris is eager by default it would try to eagerly evaluate all branches. If combinators are recursive then they need to occur in a lazy context, hence use of Inf in <+>.

For example, the code to recognise an identifier requires an alphabetic character followed by any number of alpha-numeric characters.

This is implemented by 'many', which requires recursion, and so <+> needs to handle infinite data.

ident : Lexer
ident = pred startIdent <+> many (pred validIdent)
  where
    startIdent : Char -> Bool
    startIdent '_' = True
    startIdent x = isAlpha x

    validIdent : Char -> Bool
    validIdent '_' = True
    validIdent '\'' = True
    validIdent x = isAlphaNum x

Lexer Code

Given a TokenMap and an input String we can create a list of Tokens.

This is done here: Idris2/src/Text/Lexer/Core.idr

 
||| Given a mapping from lexers to token generating functions (the
||| TokenMap a) and an input string, return a list of recognised tokens,
||| and the line, column, and remainder of the input at the first point in the
||| string where there are no recognised tokens.
export
lex : TokenMap a -> String -> (List (TokenData a), (Int, Int, String))
lex tmap str 
   = let (ts, (l, c, str')) = tokenise (const False) 0 0 [] tmap (mkStr str) in
         (ts, (l, c, getString str'))

The TokenMap is a mapping from lexers to the tokens they produce. This is a list of pairs `(Lexer, String -> tokenType)` For each Lexer in the list, if a substring in the input matches, run the associated function to produce a token of type `tokenType`

 
||| A mapping from lexers to the tokens they produce.
||| This is a list of pairs `(Lexer, String -> tokenType)`
||| For each Lexer in the list, if a substring in the input matches, run
||| the associated function to produce a token of type `tokenType`
public export
TokenMap : (tokenType : Type) -> Type
TokenMap tokenType = List (Lexer, String -> tokenType)

The TokenMap depends on the language being lexed, for Idris2 it is defined in src/Parser/Lexer.idr rawTokens:

export
rawTokens : TokenMap Token
rawTokens = 
    [(comment, Comment),
     (blockComment, Comment),
     (docComment, DocComment),
     (cgDirective, mkDirective),
     (holeIdent, \x => HoleIdent (assert_total (strTail x)))] ++
    map (\x => (exact x, Symbol)) symbols ++
    [(doubleLit, \x => DoubleLit (cast x)),
     (digits, \x => Literal (cast x)),
     (stringLit, \x => StrLit (stripQuotes x)),
     (charLit, \x => CharLit (stripQuotes x)),
     -- An identifier starts with alpha character followed by alpha-numeric
     (ident, \x => if x `elem` keywords then Keyword x else Ident x),
     (space, Comment),
     (validSymbol, Symbol),
     (symbol, Unrecognised)]
  where
    stripQuotes : String -> String
    -- ASSUMPTION! Only total because we know we're getting quoted strings.
    stripQuotes = assert_total (strTail . reverse . strTail . reverse)

So we can try running the lexer with an input string of "test 3 " like this:

 
*MyParser> lex rawTokens "test 3"
([MkToken 0 0 (Ident "test"),
  MkToken 0 4 (Comment " "),
  MkToken 0 5 (Literal 3)],
 0,
 6,
 "") : (List (TokenData Token), Int, Int, String)
*MyParser> 

Note: in order for this to displayed properly I needed to change calls to fspan in Core.idr into the standard span function otherwise it displays lazy code to produce the output rather than the output itself.

We can see from the example above that the input string "test 3" gives the following sequence of tokens:

Token

Idris2/src/Parser/Lexer.idr

public export
data Token = Ident String
           | HoleIdent String
           | Literal Integer
           | StrLit String
           | CharLit String
           | DoubleLit Double
           | Symbol String
           | Keyword String
           | Unrecognised String
           | Comment String
           | DocComment String
           | CGDirective String
           | EndInput

also here:

Idris2/src/Text/Token.idr

||| A token of a particular kind and the text that was recognised.
record Token k where
  constructor Tok
  kind : k
text : String

TokenMap

Idris2/src/Text/Lexer/Core.idr

||| A mapping from lexers to the tokens they produce.
||| This is a list of pairs `(Lexer, String -> tokenType)`
||| For each Lexer in the list, if a substring in the input matches, run
||| the associated function to produce a token of type `tokenType`
public export
TokenMap : (tokenType : Type) -> Type
TokenMap tokenType = List (Lexer, String -> tokenType)

lexer

Idris2/src/Text/Lexer/Core.idr

tokenise : (TokenData a -> Bool) ->
           (line : Int) -> (col : Int) ->
           List (TokenData a) -> TokenMap a ->
           StrLen -> (List (TokenData a), (Int, Int, StrLen))
tokenise pred line col acc tmap str
    = case getFirstToken tmap str of
           Just (tok, line', col', rest) =>
           -- assert total because getFirstToken must consume something
               if pred tok
                  then (reverse acc, (line, col, MkStrLen "" 0))
                  else assert_total (tokenise pred line' col' (tok :: acc) tmap rest)
           Nothing => (reverse acc, (line, col, str))
  where
    countNLs : List Char -> Nat
    countNLs str = List.length (filter (== '\n') str)

    getCols : String -> Int -> Int
    getCols x c
         = case fspan (/= '\n') (reverse x) of
                (incol, "") => c + cast (length incol)
                (incol, _) => cast (length incol)

    getFirstToken : TokenMap a -> StrLen -> Maybe (TokenData a, Int, Int, StrLen)
    getFirstToken [] str = Nothing
    getFirstToken ((lex, fn) :: ts) str
        = case takeToken lex str of
               Just (tok, rest) => Just (MkToken line col (fn tok),
                                         line + cast (countNLs (unpack tok)),
                                         getCols tok col, rest)
               Nothing => getFirstToken ts str

this uses:

 
-- If the string is recognised, returns the index at which the token
-- ends
scan : Recognise c -> Nat -> StrLen -> Maybe Nat
scan Empty idx str = pure idx
scan Fail idx str = Nothing
scan (Lookahead positive r) idx str
    = if isJust (scan r idx str) == positive
         then Just idx
         else Nothing
scan (Pred f) idx (MkStrLen str len)
    = if cast {to = Integer} idx >= cast len
         then Nothing
         else if f (assert_total (prim__strIndex str (cast idx)))
                 then Just (idx + 1)
                 else Nothing
scan (SeqEat r1 r2) idx str
    = do idx' <- scan r1 idx str
         -- TODO: Can we prove totality instead by showing idx has increased?
         assert_total (scan r2 idx' str)
scan (SeqEmpty r1 r2) idx str
    = do idx' <- scan r1 idx str
         scan r2 idx' str
scan (Alt r1 r2) idx str
    = maybe (scan r2 idx str) Just (scan r1 idx str)

takeToken : Lexer -> StrLen -> Maybe (String, StrLen)
takeToken lex str
    = do i <- scan lex 0 str -- i must be > 0 if successful
         pure (substr 0 i (getString str), strTail i str)

example of use:

Here is an example using 'digits' which is a lexer that regognises numeric digits.
*MyParser> takeToken digits (mkStr "23 test")
Just ("23", MkStrLen " test" 5) : Maybe (String, StrLen)
*MyParser> takeToken digits (mkStr "test 23")
Nothing : Maybe (String, StrLen)
*MyParser> 

 

In src/Parser/Lexer.idr we have:

lexTo : (TokenData Token -> Bool) ->
        String -> Either (Int, Int, String) (List (TokenData Token))
lexTo pred str 
    = case lexTo pred rawTokens str of
           -- Add the EndInput token so that we'll have a line and column
           -- number to read when storing spans in the file
           (tok, (l, c, "")) => Right (filter notComment tok ++ 
                                      [MkToken l c EndInput])
           (_, fail) => Left fail
    where
      notComment : TokenData Token -> Bool
      notComment t = case tok t of
                          Comment _ => False
                          DocComment _ => False -- TODO!
                          _ => True

Idris2/src/Text/Lexer/Core.idr

lexTo : (TokenData a -> Bool) ->
        TokenMap a -> String -> (List (TokenData a), (Int, Int, String))
lexTo pred tmap str 
   = let (ts, (l, c, str')) = tokenise pred 0 0 [] tmap (mkStr str) in
(ts, (l, c, getString str'))

Top level view

lexer

Why we need Inf in 'Recognise' and <+>

Take example of an arithmatic expression, we might have an addition operator:

a + b

but 'a' might itself be an expression containig an addition and so on to any debth. If we are rpresenting this in BNF-like form it might look like this:

expression ::= expression + expression

which requires lazy evaluation.

Parser

parser overview

The laguage we are parsing is defined by a grammar.

Our grammar is defined by this data structure:
||| Description of a language's grammar. The `tok` parameter is the type
||| of tokens, and the `consumes` flag is True if the language is guaranteed
||| to be non-empty - that is, successfully parsing the language is guaranteed
||| to consume some input.
public export
data Grammar : (tok : Type) -> (consumes : Bool) -> Type -> Type where
     Empty : (val : ty) -> Grammar tok False ty
     Terminal : String -> (tok -> Maybe a) -> Grammar tok True a
     NextIs : String -> (tok -> Bool) -> Grammar tok False tok
     EOF : Grammar tok False ()

     Fail : Bool -> String -> Grammar tok c ty
     Commit : Grammar tok False ()
     MustWork : Grammar tok c a -> Grammar tok c a

     SeqEat : Grammar tok True a -> Inf (a -> Grammar tok c2 b) ->
              Grammar tok True b
     SeqEmpty : {c1, c2 : Bool} ->
                Grammar tok c1 a -> (a -> Grammar tok c2 b) ->
                Grammar tok (c1 || c2) b
     Alt : {c1, c2 : Bool} ->
           Grammar tok c1 ty -> Grammar tok c2 ty ->
           Grammar tok (c1 && c2) ty

We can construct simple grammar elements and then combine them in more complicated ways. (In a similar way to combining the recognisers in the lexer)

First we need a function of type: (tok -> Maybe a)

myTokenPred : Token -> Maybe String
myTokenPred t = Just "a"
Then we can construct a simple grammar element:
*MyParser> Terminal "a" myTokenPred
Terminal "a" myTokenPred : Grammar Token True String

There are some functions in Idris2/src/Parser/Support.idr that construct grammars.

For example: 'intLit' recognises a number.

*MyParser> Support2.intLit
Terminal "Expected integer literal"
         (\x =>
            case tok x of
              Literal i => Just i
              x => Nothing) :
                 Grammar (TokenData Token) True Integer
If we parse a grammar containig only intLit. It succeeds with "1" but fails with "a".
*MyParser> runParser "1" Support2.intLit
Right 1 : Either ParseError Integer
*MyParser> runParser "a" Support2.intLit
Left (ParseFail "Expected integer literal"
                (Just (0, 0))
                [Ident "a", EndInput])
                  : Either ParseError Integer
There are more built in elements of grammar.
*MyParser> constant
Terminal "Expected constant"
         (\x =>
            case tok x of
              Literal i => Just (BI i)
              StrLit s => case escape s of
                            Nothing => Nothing
                            Just s' => Just (Str s')
              CharLit c => case getCharLit c of
                             Nothing => Nothing
                             Just c' => Just (Ch c')
              DoubleLit d => Just (Db d)
              Keyword "Int" => Just IntType
              Keyword "Integer" => Just IntegerType
              Keyword "String" => Just StringType
              Keyword "Char" => Just CharType
              Keyword "Double" => Just DoubleType
              x => Nothing) :
                  Grammar (TokenData Token) True Constant
 
*MyParser> symbol "a"
Terminal "Expected 'a'"
         (\x =>
            case tok x of
              Symbol s => if s == req
                     then Just ()
                     else Nothing
              req => Nothing)
                : Grammar (TokenData Token) True ()
 
*MyParser> strLit
Terminal "Expected string literal"
         (\x =>
            case tok x of
              StrLit s => Just s
              x => Nothing) :
                 Grammar (TokenData Token) True String
 
*MyParser> keyword "data"
Terminal "Expected 'data'"
         (\x =>
            case tok x of
              Keyword s => if s == req
                      then Just ()
                      else Nothing
              req => Nothing) :
                  Grammar (TokenData Token) True ()
 
*MyParser> operator
Terminal "Expected operator"
         (\x =>
            case tok x of
              Symbol s => if elem s reservedSymbols
                   then Nothing
                          else Just s
              x => Nothing)
                : Grammar (TokenData Token) True String
 
*MyParser> runParser "Int" (keyword "Int")
Right () : Either ParseError ()
*MyParser> runParser "Hello" (keyword "Int")
Left (ParseFail "Expected 'Int'"
                (Just (0, 0))
                [Ident "Hello", EndInput])
                     : Either ParseError ()
I'm not quite sure why 'symbol' always fails?
*MyParser> runParser "a" (symbol "a")
Left (ParseFail "Expected 'a'"
                (Just (0, 0))
                [Ident "a", EndInput])
                    : Either ParseError ()
*MyParser> runParser "a" (symbol "b")
Left (ParseFail "Expected 'b'"
                (Just (0, 0))
                [Ident "a", EndInput])
                  : Either ParseError ()
 
 

Example of use: see here.

We can now combine these simple grammars using >>= and <|> to build more complicated grammars.

grammar

When the grammar uses (TokenData Token) then we can just use 'Rule' as a simpler type.

public export
Rule : Type -> Type
Rule ty = Grammar (TokenData Token) True ty

 

Other parser combinators :

||| Sequence a grammar with value type `a -> b` and a grammar
||| with value type `a`. If both succeed, apply the function
||| from the first grammar to the value from the second grammar.
||| Guaranteed to consume if either grammar consumes.
export
(<*>) : Grammar tok c1 (a -> b) ->
        Grammar tok c2 a ->
        Grammar tok (c1 || c2) b
(<*>) x y = SeqEmpty x (\f => map f y)

||| Sequence two grammars. If both succeed, use the value of the first one.
||| Guaranteed to consume if either grammar consumes.
export
(<*) : Grammar tok c1 a ->
       Grammar tok c2 b ->
       Grammar tok (c1 || c2) a
(<*) x y = map const x <*> y

||| Sequence two grammars. If both succeed, use the value of the second one.
||| Guaranteed to consume if either grammar consumes.
export
(*>) : Grammar tok c1 a ->
       Grammar tok c2 b ->
       Grammar tok (c1 || c2) b
(*>) x y = map (const id) x <*> y

Parser Runtime

At runtime the function 'runParser' in Idris2/src/Parser/Support.idr gets called.

 
runParserTo : (TokenData Token -> Bool) ->
              String -> Grammar (TokenData Token) e ty -> Either ParseError ty
runParserTo pred str p 
    = case lexTo pred str of
           Left err => Left $ LexFail err
           Right toks => 
              case parse p toks of
                   Left (Error err []) => 
                          Left $ ParseFail err Nothing []
                   Left (Error err (t :: ts)) => 
                          Left $ ParseFail err (Just (line t, col t))
                                               (map tok (t :: ts))
                   Right (val, _) => Right val

export
runParser : String -> Grammar (TokenData Token) e ty -> Either ParseError ty
runParser = runParserTo (const False)

This uses 'lexTo' to convert the string into tokens.

There are different lexers depending on what language we are parsing.

*MyParser> :doc lexTo
Core.lexTo : (TokenData a -> Bool) ->
    TokenMap a -> String -> (List (TokenData a), Int, Int, String)
    
    
    The function is: Total & export
Parser.Lexer.lexTo : (TokenData Token -> Bool) ->
    String -> Either (Int, Int, String) (List (TokenData Token))

Yaffle

see Idris2/src/Yaffle/REPL.idr - calls runParser : String -> Grammar (TokenData Token) e ty -> Either ParseError ty from Idris2/src/Parser/Support.idr

 
repl : {auto c : Ref Ctxt Defs} ->
       {auto m : Ref MD Metadata} ->
       {auto u : Ref UST UState} ->
       Core ()
repl
    = do coreLift (putStr "Yaffle> ")
         inp <- coreLift getLine
         case runParser inp command of
              Left err => do coreLift (printLn err)
                             repl
              Right cmd =>
                  do if !(processCatch cmd)
                        then repl
                        else pure ()

 

It also uses 'command' from Idris2/blob/master/src/TTImp/Parser.idr.

 
-- TTImp REPL commands
export
command : Rule ImpREPL
command
    = do symbol ":"; exactIdent "t"
         tm <- expr "(repl)" init
         pure (Check tm)
  <|> do symbol ":"; exactIdent "s"
         n <- name
         pure (ProofSearch n)
  <|> do symbol ":"; exactIdent "es"
         n <- name
         pure (ExprSearch n)
  <|> do symbol ":"; exactIdent "gd"
         l <- intLit
         n <- name
         pure (GenerateDef (fromInteger l) n)
  <|> do symbol ":"; exactIdent "missing"
         n <- name
         pure (Missing n)
  <|> do symbol ":"; keyword "total"
         n <- name
         pure (CheckTotal n)
  <|> do symbol ":"; exactIdent "di"
         n <- name
         pure (DebugInfo n)
  <|> do symbol ":"; exactIdent "q"
         pure Quit
  <|> do tm <- expr "(repl)" init
         pure (Eval tm)

Grammar does a similar job in the parser as Recognise does in the lexer:

Lexer Parser
data Recognise : (consumes : Bool) -> Type
data Grammar : (tok : Type) ->
               (consumes : Bool) -> Type -> Type
Lexer : Type
Lexer = Recognise True
Rule : Type -> Type
Rule ty = Grammar (TokenData Token) True ty
TokenMap : (tokenType : Type) -> Type
TokenMap tokenType = List (Lexer, String 
                                                -> tokenType)
 

 


metadata block
see also:
  • Monadic parsers are discussed on this page.
Correspondence about this page

This site may have errors. Don't use for critical systems.

Copyright (c) 1998-2023 Martin John Baker - All rights reserved - privacy policy.