Lexical Scanner

From GPWiki

Files:GUITutorial_warn.gif The Game Programming Wiki has moved! Files:GUITutorial_warn.gif

The wiki is now hosted by GameDev.NET at wiki.gamedev.net. All gpwiki.org content has been moved to the new server.

However, the GPWiki forums are still active! Come say hello.

A lexical scanner is used to sort out a character stream into tokens. The tokens represent different types of groups and/or specific representation.
All compilers use some sort of lexical scanner for reading the file/character-stream.

An example problem that needs to be solved with a lexical scanner:

  A 2D computer-game with a dynamic environment that has a lot of events in it.
  We need to avoid hardcoding the events into the source and separate the actions
  in an event from the actual engine.

A normal solution is to have files on the local filesystem which is loaded by the software and then executed at a later stage or at loadtime (during parsing). The first thing that allows us to parse the file is the lexical scanner (can be done without it but it will become more static). The scanner is designed to take either a filepointer or a filestream into account and then wait for commands to start reading from it.

The key components of a lexical scanner is:

Scanning
Scans the current stream and groups together a set of characters (if more then one)
Error-handling
The scanner can throw an exception or exit the software where a certian token was expected
Matches
Match a certain token and/or specific value which can throw an error if needed.

Example

A simple lexical-scanner draft. This scanner uses only 4 tokens for strings, numerics, characters and one for end-of-file (which is used for visual aid). This design is not the best to use since one should use reuse code where possible. One could create an interface and a superclass that implements most common scanner-functionallity. This allows additions for reading direct from compressed streams where the data is uncompressed on-the-fly and the compressed file/stream is returned as tokens.

A scanner can handle all the tokens which is the default. Such tokens as operators, keywords and so on. This scanner on the other hand only groups numerics, strings and characters.

Note that this interface is written with a pseudo-C++ code showing how inheritance and interfaces can play a part of the scanner structure. You can write the scanner in any language you want where you could use the builtin functions in the language instead of your own isAlpha(), isNumerc(), etc, functions.

Languages with builtin/extensions for character recognition:

  • C/C++: <ctype.h> (ISO C99)
  • PHP: ctype extension (see php manual)
/* alpha characters: T_STRING */
#define CHARS_ALPHA "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
/* numeric characters: T_NUMERIC */
#define CHARS_DIGIT "0123456789"
 
/* tokens */
#define T_EOF       0
#define T_STRING    1
#define T_NUMERIC   2
#define T_CHARACTER 3
 
/* lexical scanner interface */
class Scanner_Interface
    /* end-of-stream ? */
    bool isEof()
    /* check for alpha character */
    bool isAlpha()
    /* check for numeric character */
    bool isNumeric()
    /* check for alpha-numeric character */
    bool isAlphaNumeric()
    /* match current value with a string, throws exception */
    void match(char *)
    /* match current token, throws exception */
    void matchToken(token_t)
    /* get an alpha-value, throw exception if the current character
     * was'nt an alpha character. */
    void getAlpha()
    /* get a numeric-value, throw exception if the current character
     * was'nt a numeric character. */
    void getNumeric()
    /* get the current token */
    token_t getToken()
    /* write the value to the pointer given.
     * could also return a const char * to local value, */
    void getAlphaValue(char *)
    /* return the value as a signed integer */
    int getNumericValue()
    /* read next character in stream */
    void readNext()
    /* perform a scan */
    void scan()
 
/* a lexical scanner that uses a characterstream.
 * fast since it has a buffer in memory to read from at the expense of memory.
 */
class Scanner_CharStream extends Scanner_Interface
    /* make a local copy of the data passed to us */
    char * mStream
    /* current character being processed */
    char mCurrent
    /* last value found in a scan or getAlpha/Numeric() */
    char mValue [ 512 ]
    /* token from the last scan or getAlpha/Numeric() */
    token_t mToken
    /* constructor */
    Scanner( const char * data );
    /* interface implementation */
 
/* a lexical scanner that uses a filepointer.
 * saves memory at the expense of speed.
 */
class Scanner_FileStream extends Scanner_Interface
    FILE * mFilePointer
    char mCurrent
    char mValue [ 512 ]
    token_t mToken
    Scanner( FILE * fp )
    /* interface implementation */

Now for some internal workings with the use of pseudo-code:

function Scanner :: match (matchString)
    if not mString equal matchString then
        throw ScannerException ("expected: '" + matchString + "'")
 
'' getAlpha and getNumeric is similar
'' getNumeric calls isNumeric() only instead
function Scanner :: getAlpha ()
    '' check first character that it is an alphacharacter
    if not isAlpha () then
        throw ScannerException ("expected T_STRING")
    '' in VB and/or PHP clear the valuestring first
    '' read alphanumerics until not found anymore
    while not isAlphaNumeric () do
        '' build the string
        mValue = mValue + mCurrent
        readNext ()
    '' in C/C++ end the string with NULL-character
    mToken = T_STRING
 
function Scanner :: scan ()
    '' get alpha
    if isAlpha () then
        getAlpha ()
    '' get numeric
    else if isNumeric() then
        getNumeric()
    '' otherwise just set it as a character
    '' readNext checks if eof has occured
    else
        '' check for eof incase the isEof() is'nt used
        if isEof () then
            throw ScannerException ("unexpected end-of-file")
        mValue = mCharacter
        '' C/C++: add NULL-char
        readNext ()
 
function Scanner :: readNext ()
    '' check for eof, if so then reset variables so they cant be used
    if isEof () then
        mCharacter = ""
        mToken = T_EOF
        mValue = NULL
    '' read the character at the current position in stream
    mCharacter = mValue [ current position in stream ]
    '' increment position
    current position in stream += 1

This covers most of the basics for setting up a lexical scanner.

You dont have to write your own lexical scanner though. There are some splendid scanner-generators out there which can be used for free. See the links list.

External links