Last week, we've built a computer that can run a code, machine language code. Two weeks ago, we've actually programmed in machine language, but we didn't really program in binary code, but rather in assembly language in some kind of nice syntax. So, there is a gap in the middle, and the gap is supposed to be a simple software program called an Assembler. So, specifically, you've seen this slide already twice. Two weeks ago, you've seen this slide to focus on the left-hand side and the actual assembly language format in which you were supposed to program. That was your eh, perspective two weeks ago. Your perspective last week was what happened on the right-hand side, the machine language, since you had to actually construct a computer that can actually execute the 1s and 0 that you see on the right. Our focus, our perspective this week is going to be on the middle, how do you go from left to route, how do you do, from left to right. How do you do the operation the assembly operation? I.e., how do you, what is the assembler, which is a program that translates what you have on the left, the, the, the assembly language code to what you have on the right, the machine, machine language code. Now notice it's, for the first time in this course, this is software. This soft, this is going to be a program that takes as input a file written in the left hand side format, i.e., in assembly language, and produces another file that's written in the right-hand side format i.e., as zeros and ones, which directly can be executed on a computer. So this is a software, the first time that you do software in this course. Until now, we, it was always hardware. And it's really the first software layer In every arg, in every eh, the first software layer that basically we have in every computer. There are much more, there is much more software above it, but this is the first level, and we're doing it together with the hardware to give us a complete picture. So we can have a computer that we can understand how you program it. And understanding how you program it does not mean programming it in zero and ones, but rather in assembly language which is still is a very, very low format essentially equivalent to machine language. But still it's something that humans can relate to. So let's see what kind of software is this going to be, this assembler. In principle, you've only, we've only created a computer that can run machine language code. So, in principle, we should write our program in the only language that can be directly executed, i.e., in machine language code in zeros and ones, which is going to be pretty annoying. But, the best way to think about it is that we're not writing the first, we're not constructing the first computer in the world, but rather the second one. So, let us assume we already have a computer that can run some high-level language. In particular, I'm talking about your computer and the high-level languages that you already know. So, we are going to write the assembler in that language, and it will be a software program that runs on your computer. What it's going to produce, the machine language it is going to produce, is machine language for the HACK computer, for a different computer, the computer on which you will actually run the produced program. So we're thinking about cross-compilers, some, sometimes it's called. It's running on one computer and producing code intended for another computer. This way, we don't have this loop, we don't have this bootstrap problem of having to write our assembler in machine code. But rather we can write it already in the high-level language that we have already implemented on another computer. So the Assembler program is really a very simple program. It does the following basic loop and repeats it again and again. It reads one Assembly language command from the input. It basically breaks it into its parts, and each one of its parts can be translated to binary in a unique way that's specified by our language. And then we take the binary code of the different parts, put them together, and we get the machine code that directly is equivalent to the Assembly language command that we've just read. We output that, and we move on to the next Assembly language command. We keep on doing that and translate one command after another without having to remember anything that happened in the history, very simple. So let's look at the different stages, each one of these different stages, slightly more carefully. So how do we read the next Assembly language command? Well, it seems that we just read the next line from the input. This is almost exactly what we do. The only, the only other difficulty is that we may need to, to skip what's called white space, for example, comments. We need to make sure that we read what is really the next command and not any kind of a comment, or any type of spaces, or blank lines that are with it. So when we read it from the file, we have this file that, let's say, has a command Load R1, 18, and this is just a fictional language, Assembly language. Later in the next unit we'll start actually talking about the Hack len, language specifically, but in this unit I'm still focusing on a generic Assembly Assembler program. So let's assume that our hypothetical Assembly language has this type of command, Load R1, 18, which we expect is the, would probably mean taking the value 18 and putting it into the first register. But this is not something that we need to go, to know when we're actually doing this translation. We actually want to read that command and put it into some kind of string variable, into some kind of array of characters that we can later work on. And that is basically what's involved in reading a line by line from the input. So the next step is taking this string of characters and breaking it into its different parts. So when we look at it, we see that the different parts of this command are Load, first part. R1 should be the second part. 18 should be the third part. Each one of them has meaning. And there is also a space and a comma. But these are not really the interesting parts of the command, but rather just some syntax, which helps us break and understand what's written, the important part in the command. So, the next thing we have to do is basically understand the syntax and break the original string into these three different, if you wish, sub-strings, the three different interesting parts that are involved in this command. So that involves basically some kind of simple string manipulation until we get the different parts. Once we have these different parts, then we will need to translate each one of them to machine language, to actually its binary pa, its binary counterpart. Now, how do we do that? That has to be part of the specification of the machine language and assembly language. Basically tells us what is the code for each one of these commands, for example. So it would, so for example, eh, the Load command, we would probably have some kind of table, which tells us, what is the machine language code for each command? In particular, we can look up in that table and see what is the machine language code for the Load command. Other parts of the input may be, for example, numbers or maybe other things, that we can directly translate into machine language just because they are numbers. For example, the number 18 may be translated into just into the binary representation of the number 18. Again, this is part of the language specification. This is a basic part. This is the part where we need to understand exactly the mapping between Assembly language and machine language, which is ully, usually very simple, and specified completely by a bunch of tables and a bunch of rules of where you put integer numbers in the binary form. And that's about it. The third part is now we have basically the translation of each part of the input and we need to put them together, usually just some kind of concatenation. Maybe we will need to put also some kind of other bits that are defined in the specification to actually pad them and complete the language. Because sometimes the translation of a mach, of a command does not fill all the available bits that are, are available in the machine. And the other bits are specified to be some, let's say, a constant 0 or 1. So now we have basically the number, the binary number that we need to output, and we just need to print it out into the, well, some kind of file. How exactly we print it out is according to, there's going to be some kind of specification of the file format of machine language, which may be a binary format, which may be just in characters 0 and 1. That would be not something that you would usually have, and so on. But then you just need to basically translate the numbers you have in your memory, the 1s and 0s you have in your now, in your memory. You have to translate that into a format that the machine language can actually, that the computer can actually execute. So far, we've described the basic operation of the assembler. And now there's one extra complication that we need to worry about, and that is symbols, handling symbols. As you may recall, one of the major services that an assembly language gives a programmer is the ability to use symbols rather than direct numbers. And we usually use them for two different types of things. One of them is for labels in the programs to jump into a certain part the a program. You give it a name rather than really hard code the address. And the other is you want to give a variable name rather than el, always refer to some to its exact address in memory. So for example, we can write and view most assembly languages, JMP loop. And loop of some kind, is going to be translated automatically into some kind of address. Similarly, we can load into a register a variable weight. And again, weight is going to be some location in memory which the assembler needs to figure out which location and will basically translate this access directly to accessing that location in memory. So basically what our assembler will have to do is replace each one of these symbols. For example, the weight in the, in the previous eh, in the previous command into its equivalent address, and it will have to remember where exactly that address in the memory. Similarly, it will replace the, the, the label of a loop of a jump with the exact address inside the program and again, remember where that is. How does that happen how will that be done? Wel,l it will need to maintain some kind of sym, of, of table that basically has the, the information about the translation between symbols and actual addresses. And whenever it needs to do this translation, it will have to actually look up in the table. It sees weight in the assembly language program. It will have to look up in the table and see what is the correct address to put instead of it. Once it's replaced the symbol with a address, then it can continue as we previously described. So let us see now how do we maintain such a table, how do we enter information into it, and how do we look up information from it. So for example, let's say we, now our assembler is reading command by command, and it encountered this command with a variable weight in it. So, maybe the variable is already in the table, we need to look it up in the table. If it's already in the table, then we know exactly how to translate it. This is the easy part. But what happens the first time we see this variable, weight? Well, we look at it, we look in the table, and we see that it's not there. So, we know that we need to allocate a new memory location to had, to hold this variable. And this is one of the things that our assembler will have to do. So, it will actually find the next memory location that's available, and the exact definition of where does it allocate memory addresses should be part of the assembly, assembly language specification. And it will basically allocate this new memory to the symbol. And now, it has this pair of symbol memory address that it can put into the table, and use from now on, including for now. And this is how, basically, we put a variable locations into the table, and how we use them. The other kind of la, the other kind of symbol that we have are labels. So let's see we have, well, let's see, for example, we have a piece of code that has inside, that had in it a label loop, what does this mean? When our assembler actually looks and reads this line, label, it knows well this is just a label, it's not a real command to execute. But I have to remember that the next time somebody wants to jump to this symbol loop, I have to remember where exactly it is. Where is it? Well, I have to remember where, what address is the current command going to be put into the, into, into the memory. And that is going to be the address that's referred to by loop. So for example, if our, prog, if our current, if this current piece of program is being allocated into, being put, being written into memory location 671, 672, 673, and so on, of the program memory. When we see this label, our assembler will have to remember the next program is program 673. So I need to remember that loop from now on always refers to location 673. And this is basically the kind of thing that it needs to do. And then, when we actually get to a place that uses this label, uses this symbol, then we already know, we already have in our table what exactly is the correct address. So that's basically how information is, about labels, is put into the table and read from it when needed. Now, there's one extra complication that we didn't talk about. That sometimes we can use, we can jump into a label before the label was actually defined. This is called a forward reference. So for example, it's very common in, in programs that I will need to jump, let's say, into a label called continuing, cont, before the label actually occurred in the program. If that is the case, when I reach the first jump, when I first, the, the jump instructions, it uses a label. That happens before I've already seen the place that defines the label in the label command, and how can I handle that because it's not in the table yet. Well, there are two ways to handle it. One way, which is usually a little bit more complicated, is to basically remember that I've seen the labels, and I don't know where it is yet. Keep it in a side table. And when I actually get into the definition of the correct address, fix it back. Another, another option, which turns out to be sometimes easier, is that I actually do everything in two passes. In the first pass, I read everything only paying attention to the labels and remembering where each label refers to. That, that's when i actually build the table for the labels. And only on the second pass do I actually go and put and convert each and every label into its correct code, into its correct address, that now is already in the table, because I put it there in the first pass. And that's a usual, usually slightly easier to do, but you may use each one of these two possibilities. So now we've finished talking about the general process of creating an assembler, the general thing that an assembler must do. And what we're going to do in the next few units is start talking specifically about the Hack machine assembler and talk about the different parts of how you actually construct it.