Hi everyone and welcome to Python for Genomic Data Science. In this first lecture, I will give you an overview and a brief history of the Python programming language. So first let me tell you who this course is for. This course has been designed for people who are not computer scientists but who need to learn Python from scratch. So no prior knowledge is expected from you. Also, this course is oriented towards programming tasks for molecular biology, and they'll give you enough understanding of Python programming. From there you can take it to higher level of expertise. Why would you want to program? Let's say you are a biologist. And do you need to learn how to program to do Bioinformatics? On one level, the answer to this question is no, you don't. You can accomplish quite a lot by using existing tools. On another level, the answer to that question changes. What if you want to use a tool? That does something does that what's been written, or what if you cannot find somebody that can write it for you. Or lets say that you just want to write a small program to organize and summarize the data you all ready have. In this case, learning to program can be incredibly useful. So what's the best way to learn how to program? I would say that taking a class like this that teaches you the basics is a good way to start. It also helps to read a programming book. Or trying to understand a program written by somebody else. If you are stuck, just ask an expert for help. Speaking about experts, Stephen, what do you say would be the best programming strategies? >> Thanks, Ella. Let me tell you a little bit more about some advice on how to learn how to program. I should say, though, that Ella is probably more of a programming expert than I am, but both of us know quite a bit about programming, having done it for more than 20 years each. So there are many strategies for programming but some of the things you need to think about are what the data is that you're going to be manipulating when you're doing programming. So keep in mind that when you're programming you're telling the computer exactly what to do with some kind of data. You have to provide the data to the program in a specific format, and you have to give the program all of your data and tell it what kind of data it is, whether it's numbers or strings or some other type of object. Before you start writing a program, it's good to have an overall design which you could think of as a recipe. If you were going to bake a cake, you might think about collecting together all the ingredients, that would be your data. And then thinking about all the different steps you have to mix the ingredients together and then to cook them and then perhaps anything you have to do like frosting the cake afterwards. So think about all the steps you are going to do once you've collected your data and then decide what the output of your compartment is going to be. Now these days, we usually, for big data sets we would have the output go to a file. If you have a small program and you just want a little bit of output, the output might be just say a number, perhaps, show up on the screen. Once you've got everything written down, take a step back, look at the overall design and fill in details that they might have left out the first time, and then you're ready to write your program. Now all this might only take you a few minutes, especially if your program is a small program. So one way, if your program is a little more complicated, one way to organize your thoughts is to write something called pseudocode. So pseudocode is essentially programming computerlike language that isn't really a computer language, but specifies the steps a little bit more precisely. In some sort of written form. So here's an example of a very simple pseudocode program that will compute the GC percentage of a DNA sequence. Now remember that DNA consists of four letters, a, c, g and t. And genomes have a typical GC composition that we sometimes use to characterize how many Gs and Cs are in the genome. For example the human genome is about 50% Gs and Cs. But other species like plasmodium falciparum, the malaria parasite, is only about less than 20% Gs and Cs. So that's sort of a typical statistic you might want to look at in a genome. So if you're going to compute the GC content of a genome, the first thing you've gotta get is your data. That's going to be the DNA sequence. So you want to read that in somehow. Then you can count the number of Cs in the sequence. Then count the number of Gs. Then of course you need to know how long the sequence was. And then you can simply add up the Gs and Cs, and divide that by the length, and that will give you the percent. And then you want to display that in some ways, and this case because we just are looking at one number, we wouldn't save it in the file, we would just print it out. We're going to write this program in Python a little bit later. So, what is Python itself? Python is a programming language. Here's one of the official logos of the language. Now it's a strange name for a programming language. You might be reminded of Monty Python's Flying Circus. Python is not a TV show or a humorous movie, however, the name Python comes from, or was inspired by Monty Python's Flying Circus. It has nothing to do with Monty Python, even though that's where the name came from. So Python, what really is an easy to learn and powerful programming language. It's very popular because it's so easy to learn, you can write very small programs that can do a lot with your data, and we're going to give some examples of those and walk you through some during the course of the next few lectures. It has efficient data structures and a simple but effective approach to what's called object-orientated programming, so python is object-orientated. Although for a lot of your programs, you can ignore that feature. The data structures that is has, we are going to tell you what some of them are. They're built into language, you don't have to really worry about what is behind the scenes and they're often stored efficiently and you can compute with them efficiently. A very important feature of Python in terms of its usability is that it's an interpreted language. So in computer science what we mean by interpreted language is that you don't have to compile it. That you write some code. You can write one line of code and the computer will instantly interpret it for you and execute it. That's in contrast to other programming languages like CSC ++ which are compiled program languages. For those languages ,there's a extra step you have to go through each time you write a program called compiling. We take your code, you run it through a program called a compiler. It creates a different object, a binary object that the computer can actually interpret directly. And that's what gets run. You don't run your code line by line through an interpreter. So Python's much simpler. You can just type one line of code and then run it. You can put it in a file and run that from a file. Or you can just, as you'll see, you can just start Python up. And at a command line interface, you can just type commands one at a time. So one small drawback of interpreted languages is that they're slower. Much slower, in fact. So if you wanted to get into really serious, hardcore programming, and write big programs that dealt with big data sets. You might eventually have to learn a language like C or C++, because in general, those programs will run much faster, and if your data sets get really big, then the speed will become an issue. But for the examples we'll cover in this course, they're quite small and Python is just fine, if not optimal. >> Let me tell you a little bit about the history of Python now. Python was conceived and developed in the late 1980s and early 1990s by Guido van Rossum at the National Research Institute for Mathematics and Computer Science in the Netherlands. The first version of Python was version 1.0. That was released in January 1994. Python is mainly inspired from the ABC langauge, whose original purpose language also developed in the same research institute in the Netherlands. But Python also borrows concepts from other languages such as Pearl, Java, C, and C++. Python 2.0 was released in October 2000. It brought in many new features, but the most important thing was that it changed its development process with a shift towards more transparent and community backed process. Python 3 was released in December 2008, and is not totally compatible with Python 2. Many of its major features were backported to the latest versions of Python 2, starting with version 2.6. Today, both Python 2 and Python 3 are used for active development. So which version should you use? Well it depends on what you want to get done. On one hand the latest improvement in the standard libraries are released by default in Python 3. On the other hand there is light less support for the standard libraries in Python 3, and also many of the Linux distributions and Mac's come with Python 2 pre-installed. So, I suggest you use whichever version you have installed on your computer. There are just very minor differences between the two versions, and I make sure to point them out during this class. So what kind of features that make Python such an easy to learn language, as Stephen said. First of all, it is simple to use. Python has a clearly defined syntax and this makes it easy to read. It also, it makes it easy to maintain. Programs written in Python are typically much shorter than equivalent programs written in C, C++, or Java. Also Python is interactive. You can write and test your programs directly from a terminal window. It has a large standard library. Python offers many built-in functions that are already written for you. It is portable, you can use Python across platforms on Unix or on Macs, or on Windows. It is also extensible. This means that you can write wrapper functions, for instance, to provide an interactive interfaced program written in a more complex program, such as C or C++. It is also scalable. This means you can write small screen, or you can write larger applications. Here are some useful Python resources for learning Python. The official Python site has the latest news related to Python. It also has a list of comprehensive documents and tutorials In the latest versions of Python. I also show you here a few tutorials oriented towards Bioinformatics. The first one is a programming course for biologists at the Pasteur Institute. The other one is a tutorial written by Patrick O'Brien. If you want to learn Python in an interactive way, you might want to check LearnPython.org. Also, there is a free online Python book written by Allen Downey, who doesn't only teaches Python, but also how to think like a computer scientist. If you want to get a more in-depth knowledge of Python, you might want to read the standard Python books, such as Learning Python or Python Programming. In fact, Learning Python book has a foreword by the Python's author himself, Guido van Rossum, who talks about the early history and background of Python. But if you are new to programming, you might want to start with Michael Dawson's book, Python Programming for the Absolute Beginner. Python Cookbook is a great resource because it has lots of programs written and tested in Python. But it's really towards the more advanced Python programmer. So if you don't have Python already installed on your computer, you might get it first. Go to the official website of Python and click the download link. And then you can install your latest Python version, either Python 2 or Python 3. I would suggest you get Python 3. Now you are ready to run Python. Let's start the Python Interpreter by typing Python in a terminal window. As soon as you do this, the Python interpreter will greet you with a welcome message stating its version and a copyright notice. Followed by a prompt that usually consisting of three greater than signs. Now we are ready to write our first program in Python, which I'll teach you how to do in the next lecture.