Did you know that the parser used in Castalia is open source?
Well, it is. You can find it on GitHub.
It’s a fork of the original parser which was written by Martin Waldenburg in the late 1990s, and abandoned sometime before 2003, when I found the code and began working on it.
It’s been used in a few projects you may know of:
- Castalia, my collection of Delphi IDE plugins
- Anders Ohlsson’s Unicode Statistics Tool, which helps in migrating older Delphi applications to newer Delphi versions which support Unicode
- The Delphi Mock Wizard, a mock object framework for Delphi
This post is a brief introduction to the parser and a short tutorial on how you might use it.
The Castalia Delphi Parser isn’t a component that you install. It’s a collection of units that you use in your project, and write classes that descend from the parser’s “generic” classes.
There are 4 units in the Delphi Parser:
The interesting parts in CastaliaPasLex.pas and CastaliaSimplePasPar.pas. The other two units are just convenient places to declare the many types required for the lexer and parser.
…Which brings us to the basic structure of the parser.
The parser consists of two parts: The lexer, and the parser. The lexer breaks the input into tokens, which are then analyzed by the parser.
Here’s where people who to try to user the parser get confused: The parser doesn’t produce any output. You have to do that part yourself.
The lexer is a fast, hand-written lexical analysis class that takes a string and breaks it into tokens. A token is the basic building block of source code. A token might represent an identifier, a keyword, a punctuation mark, etc…
For example, consider the code ShowMessage(‘hello, world!’);
This code would split into 5 tokens:
- Identifier (ShowMessage)
- String (hello, world!)
If you’re interested in the hardcore computer science of tokenization, read Chapter 3 of The Dragon Book.
(Note: The Castalia Delphi Parser lexer is not generated with Lex or any similar tool. It is hand-written.)
The parser takes that stream of tokens produced by the lexer and analyzes them according to Delphi syntax.
The Castalia Delphi Parser is a recursive-descent parser, which means it analyzes the tokens starting at the beginning, and makes decisions as it encounters new tokens, calling procedures recursively as it decides which grammatical element it has encountered. It’s also hand written, very fast, and efficient.
As it comes from GitHub, the parser doesn’t produce any output, but it will validate whether some code is syntactically correct or not.
Again, if you’re interested in the theory and hardcore computer science behind recursive-descent parsers, read Chapter 4 of The Dragon Book.
Using the Lexer
If all you care about is the individual tokens, and not the grammar, you can use the lexer by itself without the parser:
First, include CastaliaPasLex in the uses clause of your code.
When you want to use the Lexer, declare a variable of type TmwPasLex:
Create the lexer like any other object:
Lexer := TmwPasLex.Create;
Now, assuming you have the source code you want to examine in a string, you’ll need to tell the lexer where to find the code. This is done with the TmwPasLex.Origin property, which is a PChar. Simply point the Origin to the first character of the string:
Lexer.Origin := PChar(SourceString);
Finally, initialize the lexer by calling Init:
At this point, the Lexer will be ready to produce a token stream from the source code in SourceString. To move to the first token, call Next:
The lexer will now be able to give you information about the first token. You can get the type of the token from Lexer.TokenID (see CastaliaPasLexTypes for all the various token types), or the actual text of the token from Lexer.Token. You can also find the token’s location in the source code with properties like Lexer.RunPos and Lexer.PosXY.
Continue calling Lexer.Next; for each token until Lexer.TokenID is ptNull. When the TokenID is ptNull, the lexer has reached the end of the string. Don’t forget to free the Lexer when you’re finished with it.
As a demonstration, here is a simple function that takes a string containing Delphi source code and returns the number of identifier tokens in the string:
function CountTokens(ASource: string): Integer; var Lexer: TmwPasLex; Count: Integer; begin Count := 0; Lexer := TmwPasLex.Create; try Lexer.Origin := PChar(ASource); Lexer.Init; Lexer.Next; while Lexer.TokenID <> ptNull do begin if Lexer.TokenID = ptIdentifier then Inc(Count); Lexer.Next; end; finally Lexer.Free; end; Result := Count; end;
Using the Parser
The parser is only a little more complicated than the lexer. To use the parser, you must create your own parser class that descends from TmwSimplePasPar:
TMyParser = class(TmwSimplePasPar)
TmwSimplePasPar declares a virtual procedure for every grammar rule in the Delphi grammar. The grammar is based first on the Delphi grammar as published in the “Delphi Language Guide” that used to come with older versions of Delphi, with additions by me as the language has been expanded.
Speaking of Delphi Grammar….
…Since Embarcadero no longer publishes the Delphi Language Guide, Joe White has published a reverse-engineered Delphi grammar at dgrok.excastle.com. The Castalia Delphi Parser is not related to this grammar in any way (in fact, I just found it a few minutes ago via Google), but it may help you understand how Delphi’s grammar works. The procedure names in the Castalia Delphi Parser are probably different from Joe’s rule names.
In order to utilize the parser, you’ll need to override the procedures for the grammar rules you want to work with. For example, if you wanted to get a list of all of the units listed in a uses clause, you would override the UsedUnitName procedure.
To get information about the source code during parsing, you access the parser’s built in lexer with the Lexer property. Here is an example of an overridden UsedUnitName procedure that writes the name of a used unit to the console:
procedure TMyParser.UsedUnitName; begin Writeln(Lexer.Token); inherited; end;
(Always ensure that your overridden method calls inherited, or parsing will stop and you won’t get the information you want).
There are about 250 procedures that can be overridden. Here are a few useful examples:
- ProcedureMethodName, FunctionMethodName: the name of a procedure or function
- WithStatement: a with..do clause
- Block: A begin..end block
Take a look at the definition of TmwSimplePasPar to see the complete list of procedures that you can override. If you’re lucky enough to still have an old Delphi Language Guide lying around, the names of the procedures pretty closely parallel the names of those grammar rules.
Now that you’ve got your own parser class, you need to run it on your source code. Again, we’ll assume you have your source code in a string called SourceString.
Unlike the lexer, the parser expects the code to be in a TMemoryStream (really, any descendant of TCustomMemoryStream). I’ll show below how to write your string to a TMemoryStream, but once you have a TMemoryStream with the right data, you call the parser’s Run method, passing in the Unit name (which may be blank, depending on your usage, and the memory stream object:
To wrap up, here’s an example of a complete procedure that takes a source string as input and outputs all of the used units to the console using our TMyParser as defined above:
procedure ListUsedUnits(ASourceCode: string); var Parser: TMyParser; SourceStream: TMemoryStream; begin SourceStream := TMemoryStream.Create; try SourceStream.WriteBuffer(Pointer(ASourceCode)^, Length(ASourceCode)); SourceStream.Position := 0; Parser := TMyParser.Create; try Parser.Run('', SourceStream); finally Parser.Free; end; finally SourceStream.Free; end; end;
This has been a very basic overview of how to use the Castalia Delphi Parser. I love hearing about the creative things people have done with the parser, so if you find it useful, please drop me a line and let me know what you’ve done with it.