Up until now, we talked about syntax analysis in general, using Jack as an example. In this unit and the next one, we are going to actually roll our sleeves and build a Jack analyzer. And to remind you, our overall objective is developing a compiler. We decided that we will split the development of the compiler into two separate projects. In the current project, we're developing a syntax analyzer. In the next project and in the next module, we'll develop a code generator. And we had faced this challenge of unit testing the syntax analyzer in isolation from the rest of the system. And we decided that in order to do it, we are going to have the syntax analyzer generate an XML code, an XML code file. And by looking at this file, we'll be able to tell whether or not the syntax analyzer is well behaving. And we'll have a more operational and specific definition of well behaving in the next unit, when we talk about project ten in detail. So the syntax analyzer that we're going to develop is going to consist of three modules. At least, this is the implementation that we propose, that you use. So we're going to use a JackTokenizer module, a CompliationEngine, and an overall kind of main program called JackAnalyzer, which is going to drive the show. And what we'll do in this unit is we'll discuss each one of these modules and give its API so that you'll be able If you want to follow our implementation and develop your analyzer. All right, I think it makes sense to start from the end and illustrate how we actually use the JackAnalyzer once we finish developing it. So if this language, and I'm sorry, if this program would have been written say, in Java, then from the shell, I would write the name of the program, which is JackAnalyzer, and I will supply it with some input. And the input could be one of two things. It could either be the name of some file .jack, something like myprogram.jack, or it could be the name of a directory that contains actually zero or more such jack files. And the output, in the first case, if it's a single Jack file, the output would be an XML file that has the same prefix, the same file name, .xml. And if the input were a directory name, the output would be one xml file for every Jack file in the input directory. And we're going to store everything within the same directory that we were given. So this is the desired behavior of our JackAnalyzer. Now, here's an example of what this program will actually do. If we give it this particular input, it will end up producing this particular output. And how will it do it? Well, the JackAnalyzer is going to use the services of a JackTokenizer. And everything, both the tokenizer [COUGH] I'm sorry, both the tokenizer and the compilation engine that we haven't discussed yet are going to be developed by you according to the Jack grammar and according to all the guidelines that we gave before in this module. In particular, this is the main document that guides our work, this is the Jack grammar. And if you recall, the first section of the Jack grammar describes the lexicon of the language or describes the kind of tokens that the language recognizes. And we said that we have five categories of tokens, keyword, symbol, and so on. And so the Jack tokenizer is going to be responsible for handling such tokens. And the rest of the grammar will be handled by the compilation engine. So what we'll do from now until the end of this unit is discuss these two modules. We'll first talk about how to develop a tokenizer, and then we'll talk about how to develop the compilation engine. Now, if we go back to this example that we had before, you have to realize that the tokenizer operates in the context of the compilation engine. So the compilation engine uses the services of the tokenizer. And therefore, for example, we can focus only on the tokens in the XML output, and these tokens will be handled by the tokenizer. So what we see here is part of the output which is handled once again by the tokenizer. And let me shorten this example to make some space on the slide. And once again, the code that is going to create these tokens is going to be informed by this part of the grammar that deals with the lexical elements of the language, of which we have five, right? Keyword, symbol, integer constants, and so on. So the Jack tokenizer, as we mentioned, I think, in earlier units, is some sort of a service that allows us to ignore the whitespace in the input. It allows us to advance the input one token at a time. And ask also some questions about the current token. So if you think about it, the tokenizer encapsulates the input. Once we have a tokenizer, we no longer have to worry about the actual input file. It is completely hidden from us. For us, the tokenizer is the input. We can advance the tokenizer, we can ask questions about the current token, and so on and so forth. So at the end of the day, if I fast forward this discussion, the tokenizer will be some sort of module, a software module that has a constructor and then it has a bunch of methods that will describe, whose API will be given in just a minute. And these methods have names like hasMoreTokens, advance to the next token, tokenType, and so on. So writing this tokenizer is a matter of writing a program or a class. If you write in a language like Java, or C#, or C++, you write a class that has these bunch of methods, and the class uses the input file as its input. And it goes through the character and groups them into tokens and so on so forth. And I think that the most challenging method here is advance, because advance is the method that groups character into tokens and figures out also What type of token we have here. And this is not a terribly complex thing to do, but you do have to think about some contingencies. For example, The fact that you have a bunch of characters with no space in the middle does not necessarily mean that you have only one token. You can have several tokens there. For example, take a look at getx(). So we have g, e, t, x, (, ). And only then we have a space in the input stream. And you know, that what we have here is actually three tokens. Get x is one token. Left paren is another token. Right paren is a third token. The tokenizer has to be smart enough to distinguish between these three tokens. So one advance will give me yet x, another advance will give me left param, and the next advance will give me right param, and so on. So how do I figure it out? Well, I use the grammar as my overarching guide. And Let's see, I start with the character g, and then I look at the next character it's E. I happily append it. Well, let's assume that I start with the character g, and I know that I'm now starting to build a token. So then I have an e I append it to what I had previously. Then I have a t I append it happily, I have an x I append it. And then I have a left paren, well left paren is not a legitimate part of an identifier, right? In fact, left paren is a symbol. I can check it, I can check that left paren is a symbol, and it is a symbol. And therefore I know that I now have a new token. So go back to the token that I constructed before, getx and I look it up in my keywords list, it's not there, so it must be an identifier, now I can make this conclusion. So I have a token getx and I know that it's an identifier, and then the next token is kind of easy, it's a left paren. I look it up in the symbols and I know right away that they are a symbol and so on and so forth. So as you see, it's not terribly complicated but it requires some programming effort to come up with the actual implementation. And obviously you are free to use whatever your language is kind enough to give you in the way of string processing, regular expressions, whatever you have you can use it. All right, so here's the API of the JackTokenizer. We have a constructor that receives the name of an input file and opens it for processing. Then we have hasMoreTokens method that had no arguments, it returns a boolean. Answering the questions whether or not we have more tokens. Then we have this advance method, which is sort of the main beef of this module. It's get the next token from the input. And it should be called only if indeed we more tokens to process. Then we have a token type method, which returns a constants. And a constant is one of these five constants, KEYWORD, SYMBOL, INDENTIFIER, INT_CONST or STRING_CONST. And this constant informs the user of the tokenizer what type of token we now have. And then we have five more detailed methods that if you will that return the actual value of this token, of the current token. And you can read the documentation yourself, it's a little bit tedious, and realize that implementing these methods is rather easy. So you have to come up with a module that actually delivers this particular API. So this takes care of this part of the grammar and now we are ready to go on and talk about the compilation engine. Now the compilation is designed just like in Tokenizer using the grammar. As our lead guide and to make a long story short, I’m going to develop a class or module called compilation engine. The compilation engine is going to get its input from Jack tokenizer so I no longer have to worry about the original input file. I can happily ask questions about the current tokens and advance within the input and so on. It will have a constructor and we'll skip definition for now. And then I will have a sequence of compile [BLEEP] routines. One for almost every XXX rule in the grammar. And each one of these routines is going to be responsible for handling all the tokens that make up the right hand side of the corresponding rule. Now we dedicated a whole unit. Unit 4.5 for simulating the logic and the operations of the combination engine. And so if you feel you need some more guidance on how to implement the combination engine. I highly recommend that you take a look at unit 4.5. And so, here's the API. We're going to have a constructor. The constructor will create a new compilation engine. And then, the constructor is going to invoke the compile class method. The compile class method is going to compile a complete class. In a complete class, if you take a look at the grammar, you will see that the complete class consists of zero or more class variable decorations, either aesthetics or fields which have to be compiled. And then we have zero or more subroutines that have to be compiled. So I guess we're going to have two loops that handle these complications, and then we'll have a compiles subroutine declaration, compile parameter list, compile subroutine body. Each one of these methods simply follows the logic of the corresponding rule from the grammar. And I feel that we gave enough hints on how to develop them in the previous unit of this module. Then we're going to have five subroutines that deal with the compilation of the five statement types in Jack. And finally, we're going to have three subroutines that deal with compilation of expressions, and the components of expressions, which are terms and Special list and so on. That's it, that's the API of the CompilationEngine. Now, I'd like to make one more end-note about this CompilationEngine. The following rules in the Jack grammar are not covered, are not handled by specific compile methods. And indeed if you look at the Jack grammar, I believe that it consists of 21 non-terminal rules. If you look at the API, I believe that it consists of only 15 compile methods. So there are six missing rules in this accounting here, and these are the rules which don't have corresponding compile methods. And the logic of these rules is handled by the logic of the rules that invoke them. We don't bother to deal with these rules directly, because they are kind of secondary, or shallow in the role that they play in the overall grammar. Let me give you an example. Take a look, for example, at statements. Statements is covered by a method called compileStatements, and this method, compileStatements, uses a loop to handle zero or more statement instances. And it determines which rule we have or which rule to invoke according to the first token in each one of these statements. So if the left token is if or while, it invokes compileIf, compileWhile, and so on. So I'm saying all this because I want to point out that there is no compileStatement method. The first rule is not covered by a method because it's shallow and not terribly interesting, so we bypass this rule. And we use similar rationale to once again bypass some other rules and take care of their logic in the subroutines that actually use them. So we are done with the proposed implementation of the JackTokenizer and the CompilationEngine. And now, let's talk about the JackAnalyzer module which is relatively simple because it drives the show by using the services of the far more challenging Tokenizer and CompilationEngine that we discussed before. All right, so the JackAnalyzer is the top-most module in our program. It receives either a single Jack filename or the name of a directory that contains several Jack files. And for each such file, it goes through the following routine. First of all, it creates a JackTokenizer object from the file name .jack, the input. And then it creates an output file whose name is the same as the input file, .xml, and prepares it for writing. And finally, it creates and uses a compilation engine that compiles the input JackTokenizer into the output file. And that's it, that's what the JackAnalyzer is designed to do. So with that, we are done with the description of our proposed implementation of the syntax analyzer. We called it the JackAnalyzer. And in the next unit, we're going to actually get our hands dirty and build this analyzer using some programming language like Java or Python. So we'll see you in the next unit.