A programming language is low level when its programs require attention to the irrelevant.

Alan Perlis



Home

Introduction

Frequently in CS projects we face situations where it is necessary to break some sort of data into tokens, and I'm not just talking about projects to design programming languages. Sometimes, prior to take an action or perform a task an aplication needs to tokenize configuration files, users input, database records, etc...

The process of converting data which is part (for example) of a text into a sequence of identifiers which have a meaning is called lexical analysis. The tool used to perform a lexical analisys could be called a "tokenizer" (a "lexer" is also a good definition for it).

In this topic I will take as example the tokenizer used to create the Meeseeks BASIC language, making it an independent class to be used as a general tokenizer in Delphi applications.

Strings

The tokenizer presented here can identify and extract tokens from a Delphi string. The rules to identify the tokens will be hardcoded in the class methods, but it will be very easy to a Delphi programmer to remove or customize these rules, giving them different functionalities or even adding new ones.

Tokens

Prior to construct our tokenizer, we have to define what kind of tokens could be identified from the Delphi input string. I will keep the basics necessary to build a programing language, in that case the tokenizer needs to cope with:

Note that the set of possible tokens described above are actually the first steps to construct a lexer for a programming language.

Reserved Words

Lets make our tokenizer recognize the following words as reserved words:

new
run
load
edit
help
save
list
quit

These names are treated by the tokenizer as special words, they can be used by the Delphi application as instructions to perform some sort of action when found amid the input string. Further during the development of this article, you will understand how to make the tokenizer identify your own set of reserved words.

Identifiers

Identifiers are names which are not reserved words, they must start by an alphabetic char or by a underline char ( _ ) that could be followed by none or many alphanumeric chars and underline chars. When we are talking about programming languages, identifiers generally denote variables and functions names. Identifiers cannot contain spaces but could be separated by a dot ( . ). Below there is a list of valid identifiers that could be recognized by our tokenizer:

a
b
id
_id
var1
this.is.valid
this.is.als0.valid

I think it would be also important to show NOT valid identifiers:

1a
.c
this.1is.not.valid

Note that any identifier separated by dots can have numbers amid or at the end of each part, but a number cannot appear just after a dot.

Numbers

A number is the representation of a numeric literal, the tokenizer we are about to develop will distinguish integer and floating point literals. For example:

Integer numbers
40
132
78
1000

Floating point numbers
.5
100.15
20.1
1000.001
1.1e-8

Quoted text

Any text surrounded by quotes (single or double) will be treated as a quoted text. The tokenizer will identifiy all the text between the quotes as a single token.

"This is a string"

'So is this'

Symbols

Symbols are characters that do not fit in any of the definitions mentioned above, but have a meaning to the tokenizer. Generally, the keyboard special characters like ! @ & * / + - ( ) [ ] are treated as symbols by the tokenizer.

The Tokenizer Project

I used Delphi 10 Seattle to develop the tokenizer, I was able to compile it also using RAD Studio XE7. I prefer to use these newer Delphi versions instead of the oldest, so I could take advantage of the char and string helpers, also the tokenizer can be easily compiled in all platforms supported by the new Delphi compilers.

So, in the Delphi IDE, select "File" -> "New" -> "VCL Forms Application - Delphi".

NewVcl

This will open a fresh VCL application in the Delphi IDE. Note that the tokenizer class we are about to develop is platform independent, I chose to create it using a VCL application but it could be a Firemonkey application without any problems.

Put two memo controls and one panel control at the form, the screen must be similar to the one below.

NewVcl

Lets organize this form in order to make it more presentable and useful. Select the "Memo1" control and change the "Align" property of it to "alTop", so the control will occupy the top of the form.

NewVcl

Proceed in the same manner regarding the "Panel1" control. This panel must be aligned on top of the form, just below the "Memo1" control.

NewVcl

Now, select the "Memo2" control and change its "Align" property to "alClient", so it will occupy all the remaining space left on the form.

NewVcl

Now, clean the content of "Memo1" and "Memo2" controls by selecting the "Lines" property and cleaning all the contents of it, also select the "Panel1" control and turn the "ShowCaption" property to "False".

Go back to the "Lines" property of "Memo1" and change its contents to:

10
7891.63 1.1e-8
load new ident 'This is a string'
"So is this" ok?
var1 var2 ident_with_underlines
< > = + - * / % $ @ &
obj.var
obj.func()
this.identifier.1is.NOT.valid
this.is.a.valid.identifier

Also, go to the Delphi Tool Palette and drag a button to the Panel on the form, change its "Align" property to "alLeft" so it will be positioned on the left side of the panel. The final aspect of the form should be similar to the one below.

NewVcl

Ok. Now we have the application's main form ready, I saved it using "uMain.pas" as the unit name but if you prefer you can chose another name for it, I kept "Form1" as the form name. Let's put the main unit aside for the moment, because at this point we are about to start the development of the tokenizer class. In order to proceed, create a new unit that will maintain the tokenizer class and all the types necessary to make it work.

In the Delphi main menu, select "File" -> "New" -> "Unit - Delphi"

NewVcl

At this point, it's a good idea to save the project, I selected "uTokenizer.pas" as the new unit filename, again, you can name it differently if that pleases you.

The Tokenizer Class

I mentioned a few paragraphs above that our tokenizer would be able to identify certain tokens as special (reserved) words. Actually, the first step to build a tokenizer is to define what tokens are going to be processed by it. The enumeration type TToken holds the labels which will identify all the token types recognized by our tokenizer.

TToken = (tokNew, tokRun, tokLoad, tokEdit, tokHelp, tokSave, tokList,
    tokQuit, tokInteger, tokFloat, tokString, tokIdentifier,
    tokPipe, tokAt, tokAmpersand, tokEqual, tokRoundOpen, tokRoundClose,
    tokStar, tokSlash, tokPlus, tokComma, tokColon, tokSquareOpen,
    tokSquareClose, tokPower, tokSemiColon, tokMinus, tokQuestion,
    tokLower, tokGreater, tokSymbol, tokNone, tokCRLF, tokNull, tokUnknown);

That means each time a token is identified at the source text, the tokenizer will store informations about it, like its size, the start position related to the source text, the string that represents the token and the label that identifies the token type.

The TTokenInfo record type will be used to keep the token informations. In order to store the informations for multiple tokens that could be present at the source text, an array of the TTokenInfo type is defined as the TTokenArray type.

TTokenInfo = record
  tok: TToken;
  token: String;
  pos, len: Integer;
  n: Double;
end;
TTokenArray = array of TTokenInfo;

The meaning of each field in the structure above is described below.

All the types presented above are necessary to identify and store the tokens extracted from the source input. These type are actually auxiliary and will be used by some of the tokenizer methods during the processing of the input text.

Now we can create the structure for the tokenizer class.

TTokenizer = class
private
public
end;

Not much I admit, but that's just the class skeleton. Now we can create the elements (properties and methods) that will define our tokenizer.

TTokenizer = class
private
  source: PChar;
  tokens: TTokenArray;
  IP, idx: integer;
public
end;

As the starting point, lets check what are these private variables we just added to our class skeleton.

Before start to construct the tokenizer itself, lets see two auxiliary methods which are very important during the token identification process.

TTokenizer = class
private
  source: PChar;
  tokens: TTokenArray;
  IP, idx: integer;

  function StringCode(s: String): integer;
  function StrToFloat2(const s: String; out ok: boolean): double;
public
end;

implementation;

...

//Calculate string code
function TTokenizer.StringCode(s: String): integer;
var
  i: integer;
begin
  Result := 0;
  for i := 0 to Pred(s.Length) do
    Result := Result + Ord(s.Chars[i]);
end;

//Customized conversion from string to float.
function TTokenizer.StrToFloat2(const s: String; out ok: boolean): double;
var
  d: double;
  i: integer;
  s2: String;
begin
  s2 := s;
  if Length(s2) > 0 then
    for i := 1 to Length(s2) do
      if s2[i] = FormatSettings.DecimalSeparator then s2[i] := '.';
  Val(s2, d, i);
  ok := i = 0;
  result := d;
end;

The StringCode method returns the sum between all the charcodes at the string passed in the parameter "s". For example, suppose the method's parameter is the string 'INPUT'. In such case we would have:

I: char code 73
N: char code 78
P: char code 80
U: char code 85
T: char code 84

the total sum for all these codes is 400.

So, the StringCode for the string 'INPUT' is 400. That's useful when the tokenizer is about to check if an extracted token is a reserved word or an identifier. If the tokenizer is informed about the stringcodes prior to this operation, it could first check if the extracted code is in the range of the reserved words minimum and maximum values, if the extracted code is out of that range the tokenizer could immediately assume it as an identifier. This will be more clear when we start to see the tokenizer working.

The StrToFloat2 method is useful to correctly convert floating point numbers from the original string representation to its numerical representation. Our tokenizer will treat as a numeric literal any extracted value made only by numbers, if these numbers are separated by a dot ( . ) then it is a floating point number, otherwise it will be an integer number. The problem is that there are countries where the decimal separator is not a point but for example a comma. The StrToFloat2 method checks the parameter string using the Delphi defined decimal point separator, if this character is found it is replaced by the '.' char, making sure the returned result will be the decimal number converted from the parameter string.

TTokenizer = class
private
  source: PChar;
  tokens: TTokenArray;
  IP, idx: integer;

  function StringCode(s: String): integer;
  function StrToFloat2(const s: String; out ok: boolean): double;
public
  constructor Create;
  destructor Destroy; override;
end;

implementation;

...

constructor TTokenizer.Create;
begin
  inherited Create;

  //Set length for 1 single token during object creation.
  SetLength(tokens, 1);
end;

destructor TTokenizer.Destroy;
begin
  inherited Destroy;
end;

The class destructor just inherits the code from the Delphi class ancestor since that's mandatory, but the class constructor does the job of initialize space for the tokens dynamic array for one element, more space could be allocated during the processing of the input data.

TTokenizer = class
private
  source: PChar;
  tokens: TTokenArray;
  IP, idx: integer;

  function StringCode(s: String): integer;
  function StrToFloat2(const s: String; out ok: boolean): double;
public
  constructor Create;
  destructor Destroy; override;
  procedure Load(Code: String);
  function IdentKind(tokStr: String): TToken;
  procedure GetToken(var TokenStr: String; var tokenPos, tokenLen: Integer; var tok: TToken);
end;

implementation;

...
procedure TTokenizer.Load(Code: String);
var
  ok: boolean;
  data: String;
  id: TToken;
  tokPos, tokLen: integer;
begin
  source := PChar(Code);
  idx := 0; //source code pointer index
  IP := 0; //Instruction pointer

  repeat
    //Get next token
    GetToken(data, tokPos, tokLen, id);

    //Configure token informations
    tokens[IP].tok := id;
    tokens[IP].token := data; //token data (string representation)
    tokens[IP].pos := tokPos; //token start position
    tokens[IP].len := tokLen; //token length

    if (id = tokInteger) or (id = tokFloat) then
      tokens[IP].n := StrToFloat2(data, ok) //If a number, keep the value
    else
      tokens[IP].n := StringCode(UpperCase(data)); //it must be in "uppercase"

    Inc(IP); //Next instruction pointer
    SetLength(tokens, Length(tokens)+1); //allocate space for a new token
  until (id = tokNull) or (IP = MAXINSTR);

  //Keep end of program representation
  tokens[IP].tok := tokNull;
  tokens[IP].token := '';
  tokens[IP].pos := tokPos;

  GotoToken(0); //Position back to the first token after the array is built
end;

The Load method takes the input string passed at the Code parameter and try to tokenize it. First, it makes the source class variable points to the parameter string, the indexes IP and idx are initialized with zero. The GetToken method is called inside a loop that ends when the end of the input data is reached. After the call for GetToken the local variables, data, tokPos, tokLen and id are initialized with respectively the token string representation, the token position at the source input, the length of the token string and the TToken label that identifies it.

After the token informations are collected, they are stored as a new entry into the tokens array, if the token is a numeric literal the conversion is made to also store its equivalent numerical value, if the token is not a numeric literal its string code is stored using the token chars all converted to uppercase since our tokenizer is not case sensitive.

The last operation inside the loop is to increment the IP index and allocate space for a new item into the tokens array.

When the loop is finished, the last entry allocated in tokens is filled as a tokNull type token and GotoToken(0) is called to make sure the IP index is pointing to the first token in the array.

function TTokenizer.IdentKind(tokStr: String): TToken;
var
  HashCode: integer;
begin
  Result := tokIdentifier;
  tokStr := UpperCase(tokStr);
  HashCode := StringCode(tokStr);

  if (HashCode < 234) or (HashCode > 323) then Exit;

  case HashCode of
    234: if tokStr = 'NEW' then Result := tokNew;
    245: if tokStr = 'RUN' then Result := tokRun;
    288: if tokStr = 'LOAD' then Result := tokLoad;
    294: if tokStr = 'EDIT' then Result := tokEdit;
    297: if tokStr = 'HELP' then Result := tokHelp;
    303: if tokStr = 'SAVE' then Result := tokSave;
    316: if tokStr = 'LIST' then Result := tokList;
    323: if tokStr = 'QUIT' then Result := tokQuit;
  end;
end;

The IdentKind method checks if the parameter string is a reserved word or an identifier and returns the appropriated TToken label for it. First, this method converts all the parameter chars to uppercase (remember, our tokenizer is not case sensitive) and checks if the parameter stringcode is in the range of the reserved word codes, if that's not the case it will assume the parameter as an identifier and returns tokIdentifier as the result.

If the parameter string code is in the range of the reserved words, it will compare it with the reserved words equivalent code, if a match is found then it checks if the parameter text is equal to the reserved word name, if that's positive then returns the reserved word equivalent TToken label otherwise returns tokIdentifier. Words can have the same string code without share the same text.

procedure TTokenizer.GetToken(var TokenStr: String; var tokenPos, tokenLen: Integer; var tok: TToken);
var
  d: Double;
  ok: Boolean;
  Ch, Ch2: Char;
begin
  //Skip blanks
  Ch := source[idx];
  while Ch.IsInArray([#8,#9,#32]) do
  begin
    Inc(idx);
    Ch := source[idx];
  end;
  tokenLen := 1; //token initial size
  tokenPos := idx; //token initial position
  case source[idx] of
    'A' .. 'Z', 'a' .. 'z', '_': //identifier
    begin
      tokenPos := idx;
      Inc(idx);

      //do while current char is valid to build an identifier.
      Ch := source[idx];
      while Ch.IsInArray(
        ['A','B','C','D','E','F','G','H','I','J','K','L','M',
         'N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
         'a','b','c','d','e','f','g','h','i','j','k','l','m',
         'n','o','p','q','r','s','t','u','v','w','x','y','z',
         '0','1','2','3','4','5','6','7','8','9','_','.']) do
      begin
        if Ch = '.' then //if '.' notation is being used
        begin //'.' must be succeeded by alpha or '_'.
          Ch2 := source[idx+1];
          if not(Ch2.IsInArray(['A','B','C','D','E','F','G','H','I','J','K','L',
                                'M','N','O','P','Q','R','S','T','U','V','W','X',
                                'Y','Z','a','b','c','d','e','f','g','h','i','j',
                                'k','l','m','n','o','p','q','r','s','t','u','v',
                                'w','x','y','z','_'])) then
          begin //otherwise it's an error
            tokenLen := idx - tokenPos; //calculate token length
            SetString(tokenStr, source+tokenPos, tokenLen); //update tokenStr
            tok := tokUnknown; //As already said, it's an error.
            Exit;
          end;
        end;

        Inc(idx); //inc idx
        Ch := source[idx]; //update Ch
      end;

      tokenLen := idx - tokenPos; //calculate token length
      SetString(tokenStr, source+tokenPos, tokenLen); //update tokenStr
      tok := IdentKind(tokenStr); //or it's a identifier or command

      //Skip blanks
      while Ch.IsInArray([#8,#9,#32]) do
      begin
        Inc(idx);
        Ch := source[idx];
      end;
    end;
    '0' .. '9', '.': //number
    begin
      tokenPos := idx;
      Inc(idx);
      tok := tokInteger; //An integer, at first
      Ch := source[idx];
      while Ch.IsInArray(['0','1','2','3','4','5','6','7','8','9','.','e','E']) do
      begin
        case source[idx] of
          '.': tok := tokFloat; //if there is a '.' ...
          'e', 'E':
          begin
            Inc(idx);
            Ch := source[idx]; //...update Ch...
            tok := tokFloat; //...it's a floating pointer number
          end;
        end;
        Inc(idx);
        Ch := source[idx]; //update Ch
      end;
      tokenLen := idx - tokenPos;
      SetString(tokenStr, source + tokenPos, tokenLen);
      if source[tokenPos] = '.' then
      begin
        tok := tokFloat;
        //Allows interpretation of floating numbers started by the dot, like:
        // '.5' or '.9'
        tokenStr := '0' + tokenStr;
      end;
      d := StrToFloat2(tokenStr, ok); //conversion
      if not ok then tok := tokUnknown; //not ok, returns unknown type
      if d > 2147483647.0 then tok := tokFloat; //it's a floating
      if source[idx] = '#' then Inc(idx);
    end;
    //If want to convert this project to a different platforms:
    //
    // end of line in MS-Windows: #13#10
    // in UNIX: #10
    // in OS-X: #13
    #10: //CRLF
    begin
      tok := tokCRLF;
      tokenStr := System.sLineBreak;
      tokenPos := idx;
      Inc(idx);
    end;
    #13: //CRLF
    begin
      tok := tokCRLF;
      tokenStr := System.sLineBreak;
      tokenPos := idx;
      Inc(idx);
      if source[idx] = #10 then
        Inc(idx);
    end;
    //Symbols
    '!', '&', '(' .. '-', '/', ':' .. '@', '[' .. '^', '{'..'~':
    begin
      tokenPos := idx;
      case source[idx] of
        '|': tok := tokPipe;
        '@': tok := tokAt;
        '&': tok := tokAmpersand;
        '=': tok := tokEqual;
        '(': tok := tokRoundOpen;
        ')': tok := tokRoundClose;
        '*': tok := tokStar;
        '/': tok := tokSlash;
        '+': tok := tokPlus;
        ',': tok := tokComma;
        ':': tok := tokColon;
        '[': tok := tokSquareOpen;
        ']': tok := tokSquareClose;
        '^': tok := tokPower;
        ';': tok := tokSemiColon;
        '-': tok := tokMinus;
        '?': tok := tokQuestion;
        '<': tok := tokLower;
        '>': tok := tokGreater;
        else tok := tokSymbol;
      end;
      Inc(idx); //increment idx
      tokenLen := idx - tokenPos; //calculate token size
      SetString(tokenStr, source + tokenPos, tokenLen);
    end;
    #34: //It's a string
    begin
      tok := tokString;
      repeat
        case source[idx] of //end of line not allowed as string content
          #0, #10, #13:
          begin
            Dec(idx);
            tok := tokUnknown;
            Break;
          end;
        end;
        Inc(idx); //increment idx
      until source[idx] = #34;

      Inc(idx); //go after the last "

      tokenPos := tokenPos + 1; //calculate string size
      tokenLen := idx - tokenPos - 1; //calculate position
      SetString(tokenStr, source + tokenPos, tokenLen);
    end;
    #39: //It's a string
    begin
      tok := tokString;
      repeat
        case source[idx] of //end of line not allowed as string content
          #0, #10, #13:
          begin
            Dec(idx);
            tok := tokUnknown;
            Break;
          end;
        end;
        Inc(idx); //increment idx
      until source[idx] = #39;

      Inc(idx); //go after the last '

      tokenPos := tokenPos + 1; //calculate string size
      tokenLen := idx - tokenPos - 1; //calculate position
      SetString(tokenStr, source + tokenPos, tokenLen);
    end;
    #0: //null = end of program
    begin
      tok := tokNull;
      tokenStr := #0;
      tokenPos := idx;
    end;
    else
    begin //if none of the above tests were satisfied...
      tokenPos := idx; //...token unknown
      Inc(idx); //increment idx
      tok := tokUnknown; //obvious
      tokenLen := idx - tokenPos; //size of "unknown" token
      SetString(tokenStr, source + tokenPos, tokenLen);
    end;
  end;
end;

The GetToken method is the heart of the tokenizer. This method searches for the next token through the chars of the input text pointed by the source class variable, if a valid token is recognized, the parameters variables are update with the token informations, the idx index is updated in order to point to the beginning of the next token so it can be processed when this method is called again. If the end of the input string is reached, GetToken updates the token variables using the tokNull label type.

This method works first by jumping any blank spaces from the current position in the input text, when a char different than a blank is found a case structure is used to test it, the method will try to extract the token according to this char, for example. If a numeric char is found, it will try to extract a number from the input text; if an alphabetic or underline char is found, it will try to identify a reserved word or an identifier, and so on.

Those who are familiar with compiler construction techniques are going to recognize this method as a finite state automata, which in fact it really is.

Now it's time to finish our tokenixer.

TTokenizer = class
private
  source: PChar;
  tokens: TTokenArray;
  IP, idx: integer;

  function StringCode(s: String): integer;
  function StrToFloat2(const s: String; out ok: boolean): double;
public
  constructor Create;
  destructor Destroy; override;
  procedure Load(Code: String);
  function IdentKind(tokStr: String): TToken;
  procedure GetToken(var TokenStr: String; var tokenPos, tokenLen: Integer; var tok: TToken);
  procedure Advance; //Advance one token
  procedure PutBack; //get back to the anterior token
  function CurrS: String; //String representation of token
  function CurrN: double; //Numeric representation of token
  function CurrPos: integer; //Current position
  function CurrTok: TToken; //Current token
  function PrevTok: TToken; //Previous token
  function NextTok: TToken; //Next token
  procedure GotoToken(n: integer); //Goto a specific token index
  function TotalTokens: Cardinal; //Total of processed tokens
end;

We have our tokenizer practically finished. Now it's just a matter to create a few more methods to give it extra functionalities which are commonly necessary in applications that need to tokenize data.

procedure TTokenizer.Advance; //Advance to the next token
begin
  Inc(IP);
end;

The Advance method just increments the value at the IP index in order to point to the next token stored in the tokens array.

procedure TTokenizer.PutBack; //Move back to anterior token
begin
  Dec(IP);
end;

The PutBack method is just the opposite of Advance, it decrements the IP index in order to point to the previous token in the tokens array.

function TTokenizer.CurrS: String;
begin
  result := Tokens[IP].token;
end;

The CurrS method returns the string representation of the current token in the array.

function TTokenizer.CurrN: double;
begin
  Result := Tokens[IP].n;
end;

The CurrN method returns the numerical value stored for the current token. It could be the numeric representation if the token is an integer or floating point number, otherwise it is the string code.

function TTokenizer.CurrPos: integer;
begin
  Result := Tokens[IP].pos; //token position
end;

The CurrPos method returns the position of the current token in relation to the input source text.

function TTokenizer.CurrTok: TToken;
begin
  Result := TToken(Tokens[IP].tok); //token id
end;

The CurrTok method returns the TToken label that identifies the type for the current token.

function TTokenizer.NextTok: TToken; //next token
begin
  Result := TToken(Tokens[IP + 1].tok);
end;

The NextTok method returns the TToken label that identifies the next token in the array, but does not change the value at the IP index.

function TTokenizer.PrevTok: TToken; //previous token
begin
  if IP > 0 then
    Result := TToken(Tokens[IP - 1].tok)
  else
    Result := tokNone;
end;

The PrevTok method returns the TToken label that identifies the previous token in the array, but does not change the value at the IP index.

procedure TTokenizer.GotoToken(n: integer); //move index to a specific token
begin
  if n < 0 then Exit;

  IP := n;
end;

The GotoToken method accepts an integer value as parameter and uses it to change the IP index value, so we could use it to move the index to any token in the array.

function TTokenizer.TotalTokens: Cardinal;
begin
  Result := Length(Tokens);
end;

TotalTokens is an auxiliary method that just returns the total of tokens currently stored in the tokens array. it can be used as a delimiter when interacting through the array elements.

With these auxiliary methods we have just completed our tokenizer class. Let's put it to work.

Tokenizer App

Move back to the main form in the tokenizer aplication we have just created at the beginning of this article. Use the combination keys <ALT>+<F11> to include the tokenizer class unit "uTokenizer.pas" in the main form uses clause, if you did not change the original filename.

Double click the "Button1" button in order to create the Delphi method that will run when this button is clicked. Add the following lines to this method.

procedure TForm1.Button1Click(Sender: TObject);
var
  i: Integer;
begin
  tokenizer := TTokenizer.Create;
  tokenizer.Load(Memo1.Text);
  for i := 0 to Pred(tokenizer.TotalTokens) do
  begin
    tokenizer.GotoToken(i);
    if (tokenizer.CurrTok <> tokCRLF) and (tokenizer.CurrTok <> tokNull) then
      Memo2.Lines.Add(tokenizer.CurrS+' : '+GetEnumName(TypeInfo(TToken), integer(tokenizer.CurrTok)))
    else
    begin
      if tokenizer.CurrTok = tokCRLF then
        Memo2.Lines.Add('<CRLF> : tokCRLF');
      if tokenizer.CurrTok = tokNull then
        Memo2.Lines.Add('<Null> : tokNull');
    end;
  end;
  tokenizer.Free;
end;

So, when we have the application running and the button control is pressed, the Button1Click method is called. Inside this method a TTokenizer object is created and the Load method of this object is called using the Memo1 lines as the parameter.

The Load method of the TTokenizer object tokenizes the text given as parameter and builds the array with the informations for all the tokens identified on it. After this process is complete we can traverse the entire array and show the informations present on it.

If you kept the suggested text at the "Memo1" control, you should see a screen like the one below after you press the button.

NewVcl

The demo application just interact through all the array elements, showing the string representation of the token at the current position and its TToken label properly converted to string.

The original tokens sequence in the Memo1 control I proposed at the beginning of this article, includes purposely some tokens which are not treated by the tokenizer and even an invalid token formation. In this case I just keep showing the tokens but in a real application it would be a good idea to stop the process and send an error message to the user. The tokenizer can get confused after dealing with an invalid token and the token just following the erroneous one can be misinterpreted.

Enough writing, here is what you want... the complete source code for the tokenizer project. Cheers.