Do the simplest thing possible, but not the simpler.

Albert Einstein



Wanna go
Home to part 1 to part 3

The C++ compiler

The original BASIC interpreter project was created to run under the Linux 14.04 LTS 32 bits Operating System. The project was compiled using Code::Blocks (svn build rev 10122) configured to the g++ 4.8.4 32 bits compiler. Later, the project was successfully ported to the Windows 7 Professional (32bits Service Pack 1), using Code::Blocks 13.12 configured to the mingw32-g++ (tdm-1) 4.7.1 compiler.

Different C++ compilers or operating systems could demand additional customizations in the project source code.


The Interpreter Source Code

The source code is divided using the following structure:

File Description
common.hpp Common features
basic.hpp
basic.cpp All classes and types related to the interpreter functionality
baslib.hpp
baslib.cpp Extension libraries to the BASIC interpreter
utils.hpp
utils.cpp Utilities functions

Note: The complete project structure, is now available to download.


The Interpreter files

The header files are enough to resume the interpreter modules functionality.


common.hpp

This small file contains a definition named _LINUX_ that must be active if compiling under Linux. If the interpreter is part of a Windows project, then this definition must be turned to a comment, otherwise MinGW will have problems to cope with the source code.

The version of MinGW identified above and used to compile the interpreter under Windows presented some annoying bugs. I had to create a small template in order have conversion between types, because the compiler do not recognize methods like stoi, stol, stoul, stoll, stof, stod, stold, to_string, etc... as part of the std namespace.

A namespace called "patch" was implemented to hold the necessary methods; to use it, just follow a syntax similar to:

patch::to_string(<value>)

or

patch::to_number<type>(<value>)


basic.hpp

This C++ header file keeps all the types and class definitions of the BASIC interpreter. The class implementations are in the basic.cpp file.

There are six class headers in this file:

BASIC Interpreter classes

Class CBasic

This is the only class that have to be present in the host application. Some of its relevant methods are:

int Compile(string source, TProgramFunctionsDictionary funcs);

This method compiles a BASIC source program passed as a string. A dictionary with C++ functions binding informations is also passed to be later integrated to the stack machine (this dictionary could be empty). If the program was compiled without errors, the method returns zero, otherwise it will return the error position.

void ExecuteProgram();

If a call to Compile returned zero, the BASIC program can be executed with the ExecuteProgram method.

The two methods described above are, in my humble impression, the minimum necessary to make the BASIC interpreter useful, but there is one more to be considered relevant.

bool ExecuteUserFunction(string fnsignature, TAsmData *params, TStackType &rettype, TAsmData &retval);

This method is used to execute a specific user function in the compiled BASIC code.

The first two arguments are the BASIC function signature and a dynamically allocated vector holding its parameters, the number of elements in the vector must correspond to the BASIC function total of arguments.

The last two arguments will have their value updated if the function call was successful; rettype is the type returned by the function, retvalue holds the value returned by it.

Class CBasicLexer

A BASIC program must be correctly tokenized before the parser performs the syntactic analysis on it. The CBasicLexer class has all the methods necessary to break a BASIC program in tokens.

Some of the relevant methods in this class are:

void LoadProg(string source);

This method reads the BASIC source program passed as a string, and break it into a list of tokens.

void Advance();

Advance to the next token in the list.

string CurrS();

Gets the string representation to the current token.

double CurrN();

If the current token is a number, this method returns its value.

void GotoToken(int n);

Sets the current token to the one with the index given by n.

Class CBasicParser

This can be considered the main interpreter class. All the operations performed by the interpreter after a BASIC program is tokenized are managed inside CBasicParser.

Some of the relevant methods in this class are:

int Compile(string input, TStringItems &ASMOutput, TProgramFunctionsDictionary &libfuncs);

The Compile method in CBasic is actually a wrapper to this one.

From this method the parser coordinates all the steps necessary to compile a BASIC program. It calls the lexer to tokenize the BASIC source code, parse the list of tokens, sends the intermediate code produced to the preprocessor and receives back from it the final assembly code.

After a successful call to Compile, the parameter ASMOutput is updated to the list of final assembly instructions; libfuncs is updated with the signatures and entry points for all the functions available to the program.

int ErrorPos();

int ErrorLine();

string LastError();

If an error occurs during a call to Compile, the methods above can be accessed to return respectively the error position, error line number and description message.

Class CBasicAsmLexer

Used by the preprocessor and stack machine to tokenize the postfix code.

Relevant methods:

void LoadLine(string p);

Load and tokenize a line in postfix format.

void Advance(string &data, int *tokenpos, TAsmToken *tok);

Advances through the tokens at the loaded line.

After its execution, the parameters are updated respectively to the token textual representation, token start position and the token enum code.

Class CPreProcessor

Tries to process the intermediate code produced by the parser, in order to create the assembly instructions executed by the stack machine.

The relevant methods in this class are:

TPreResult PreProcessing(TStringItems &source, TProgramFunctionsDictionary &funcs, vector<string> aliases);

Preprocess the intermediate code.

The source parameter initially contains the intermediate code; it will be updated to the final assembly instructions if all the preprocessing steps are successful. The parameter funcs is updated with the informations about the program's available functions; aliases is a list of strings to help in the functions finding.

The argument aliases is useful, for example, when using the '.' character in functions names. A good test case are The BASIC interpreter libraries defined in the baslib.cpp file.

There is, among many other functions, a set of math related functions which names start with the prefix 'math.', like:

math.abs(value)
math.sgn(value)
math.sin(value)
math.cos(value)
math.tan(value)
...

Normally, to access such functions we should use the complete function name, like:

REM returns the absolute value 10
PRINT math.abs(-10)

The preprocessor always tries to find the entry point for a function, first searching for the original function name, if that attempt fails, it tries to find it by adding each string at the aliases list in front of the original function name. In that case, if the string 'math.' is part of the aliases list, any function which name is preceded by 'math.' could be called just using the name part that follows the alias.

For example:

REM returns the absolute value 10
PRINT abs(-10)

This is a valid call to the math.abs() function, if the string 'math.' is in the aliases list.

Use the method AddAlias in the CBasic class, to add a new alias to the aliases list.

TUserFunctionsDictionary ReturnRegisteredFunctions();

This method returns a dictionary with informations about the user defined functions. These are the functions declared by the user in the BASIC source code, any C++ function imported to the interpreter engine will not be part of it.

This method will be useful if after the BASIC source code compilation, instead of execute the entire program we want to execute just a specific function.

Class CExec

Implements the stack machine which interprets the compiled BASIC instructions.

The main methods are:

void LoadSource(TStringItems &ls);

Load the assembly instructions to execute.

void ExecuteFunction(unsigned int entry, unsigned int argc, TStackType *argt, TAsmData *params, TStackType &rettype, TAsmData &retvalue);

Internally called by the stack machine to execute a function.

void ExecuteProgram();

Internally called by the stack machine to reset all its registers and start the execution of the loaded instructions from the first one.


At this point, I think it's important explain how the stack machine works. Let's take as a first example the simple code displayed below:

PRINT 5 + 5

The program above just prints 10 as output. All too easy.

But the actual code runned by the stack machine will be similar to the listing:

, 1
PUSHC 5
PUSHC 5
ADD
PUSHC 1
PRINT
, 2
END

This is the assembly code produced by the preprocessor after all the steps to "compile" the BASIC program are complete.

We could read this simple program like:

;a semicolon starts a comment until the end of the line
, 1 ;a comma is used to identify the line in the BASIC program that generates the next instructions
PUSHC 5 ;push the integer 5 onto the stack
PUSHC 5 ;push another integer 5 onto the stack
ADD ;pop the two highest values from the stack, add these two values, push the result back onto the stack.
PUSHC 1 ;push the integer 1 onto the stack
PRINT ;call the PRINT instruction
, 2 ;line two in the BASIC source program
END ;finish execution

Now, let's check the substitutions the preprocessor does in the intermediate code. Take a look at the program below:

a = -10
PRINT abs(a)

First, the program sets the variable a with the negative integer value -10, then calls the function abs with the variable a as the parameter, then prints the result.

If we could see the intermediate code produced by the parser, we should see something like:

,
PUSHC 10
INV
POPSTORE @a
,
PUSH @a
PUSHC 1
CALLFAR 'abs@n'
PUSHC 1
PRINT
,
END

After the preprocessor finishes to process the intermediate code received, we should see something like:

, 1
PUSHC 10
INV
POPSTORE 0
, 2
PUSH 0
PUSHC 1
CALLFAR 134717472
PUSHC 1
PRINT
, 3
END

Note that now the source line numbers are placed after the commas, the variable a identifier was replaced by its address, also the abs function signature was replaced by its entry point in memory. The interesting part is that the displayed number was "actually" the C++ function memory entry point in my equipment, because the abs function is imported from a C++ library.


Even I finding this subject very interesting, I have to stop by now, otherwise this text will grow much more than I can write here. A good way to fully understand the stack machine, is by taking a look on its source code at the file basic.cpp, implementation of the class CExec.


baslib.hpp

The baslib.hpp and baslib.cpp files keep respectively the headers and implementations for the BASIC interpreter standard libraries.

void RegisterNumFuncs(TProgramFunctionsDictionary &);

Simple "math" library.

void RegisterStrFuncs(TProgramFunctionsDictionary &);

Simple "string" library.

void RegisterStdFuncs(TProgramFunctionsDictionary &);

Simple "standard" library.

void RegisterSysFuncs(TProgramFunctionsDictionary &);

Simple "system" library.

void RegisterVectorFuncs(TProgramFunctionsDictionary &);

This is the "vector" library. It's an experiment about how to handle C++ objects with the BASIC interpreter.

This library adds a series of "quick and dirty" functions to handle linear collections of values. It can be used as a starting point for a more elaborated library.


It is really simple to add new functions to the interpreter dictionary. There are just small details we have to take care.

Any function that will be integrated must have the following format:

TAsmData <function name>(TAsmData *args) {
    ...
}

As a practical example, take a look at the source code of the "sgn" function from the standard math library:

inline TAsmData n_sgn(TAsmData *args)
{
    TAsmData dt;

    dt.n = 0;
    if (args[0].n > 0) dt.n = 1;
    else if(args[0].n < 0) dt.n = -1;
    return dt;
}

This function will return -1 if the value passed as parameter is lower than zero; 1 if the parameter is greater than zero and zero if the parameter is also zero. To add this function to the interpreter dictionary, we could use the code as follows:

void RegisterSgnFunc(TProgramFunctionsDictionary &lib)
{
    TLinkBasicFunc fnData;

    fnData.farcall = true; //added C++ functions are always "far"
    fnData.entry = &n_sgn; //pointer to the C++ function
    lib["math.sgn@n"] = fnData; //dictionary entry
}

The TProgramFunctionsDictionary is the type for the BASIC interpreter functions dictionary.

The TLinkBasicFunc type holds individual data for each function in the dictionary.

That's almost everything we need to know. We are on the verge to add to the BASIC interpreter a function called math.sgn(), which returns a numerical value and have only one numeric argument.

The index used to store and recover functions data from the dictionary is a string. This string is called function signature and is formed by the function name, followed by the character @ and the function arguments types. For example:

fnc@nnn

defines a numerical function named "fnc" with three numerical arguments.

fnc$@$n

defines a string function named "fnc$" with two arguments, the first one is of type string, the second is numerical.

fnc#@#$n

defines a pointer function named "fnc#" with three arguments of types respectively pointer, string and numerical.

fnc@

defines a numerical function named "fnc" with no arguments.

Pay attention to never duplicate a function signature in the dictionary, otherwise during runtime if the duplicated function is called, the interpreter will always find the first entrance for it and that certainly could cause unexpected results.

To conclude the new function adding to the interpreter engine, it would be a matter of call the Compile method in the CBasic class, using the funcs variable as the second parameter for it.

We will see practical examples next in this material.


utils.hpp

The files utils.hpp and utils.cpp are respectively the header and definition files for a series of C++ utilities methods, which are used by the interpreter classes.


Embedding the interpreter in a C++ project...