Hello and welcome to another walkthrough of some sample code in our Postgres class. So now we are playing with loading email data. This is probably the most complex bit of code that I'm going to use in this class. It's a bunch of Python code that does a lot of cleaning. It's dealing with some real data. This data is actually from 2005 that I have got captured from a mailing list from my open source Sakai project. And so, mail is a weird format, it's a format called Mbox, that the interest it's a simple format From, space is the delimiter Literally, From in the first column, space, is the delimiter. And then there's a series of headers with From colon, Message-ID, they're all colon. There's continuation lines, and then there's a blank line and then there's the body of the message. And I can get to these sort of in groups or one at a time. And I'm going to write a bit of code to sort of pull a whole series of these things into a database, and then play some full text stuff with it. So let's take a look at the code, here we are at pg4e.com/lectures/06-Python.sql. And we are on this email corpus one. And again, it's going to pull from inbox.dr-chuck.net.sakai.devel We're going to grab from pg4e.com/code gmane.py, datecompat.py, and hidden.py. I hope by now you've got hidden.py. And so the program is pretty self-contained. Make sure that when you start it you don't have a table named messages. So you shouldn't have a table named messages. Drop it if you are getting started. But after that you just say, python gmane.py, and I'll go ahead and run it. And it just asks how many? Now this is a spider web crawler style thing, and so it's asking me how many, and I'll just say, go get me 100 messages. And it just starts retrieving them. And so it's just doing it and it's showing you and it's pretty quick, you'll notice. Now if I go in here and I say SELECT COUNT star FROM messages. There is 100. Let me get another 100. Oop, not 1000, 100. So you'll see that this count doesn't go up. And that's because again, I'm not committing. I'm pushing rows. Still 100, still 100. Now it just went to 150. It's committing them 50 at a time, just like I did in similar examples. So this is running pretty well. You are supposed to pull in about 300 messages. I'll go about 200. This is a very fast server, it's highly cached, and so you can beat the heck out of this server. A lot of APIs have rate limits. This particular one, inbox.dr-chuck.net, has no rate limit because it uses a really cool technology called CloudFlare, which is free, infinitely scalable caching, so it's super fast. And I use this in my Python for Everybody class as well. Okay, so that's running. And let's see, we got 250, it'll run for a while, so let's take a look at the source code. So the the main code is in gmane.py, and it is a bit of complexity in here. The biggest complexity of this is that I am spending some time trying to make this code clean the data up a little bit so that my work inside the database is less difficult. And the problem with all this email stuff is that it's doing different servers are involved in it. There are slightly different ways like this From address, you'll see in a second. Sometimes the From address has like less than, greater than in it, sometimes it doesn't. The format of the date is weird and the time zone and all that stuff. And sometimes this is connected together and there's different ways that they do dates, it comes through a Linux box or some other box or goes through Google and these headers are different. So one of the things I'm trying to do is kind of clean this data up so that when I put it in my database, I've got it kind of nice. The other thing I want to do, because I'm going to do full text stuff, is I'm going to pull the subject out. I'm going to split the headers and the body into different fields in my database. And so the headers are the part from the From up till the first blank line. And if I'm doing like full text searches, I really don't want to be searching the headers. Although the subject line is particularly interesting, and we might want to do searches on the subject line. So I'm pulling out the date, the sent date, the subject, the headers, and the body, and that's what I'm pulling out. And I'm using the Python program because it just is too hard to write SQL to do all this stuff. Some people might tell you to write a stored procedure, but frankly, you might as well just do this. In particular, because we can connect to the database. We use the hidden trick, etc. There's these lines about Ignore SSL certificate errors, just do them. Long story that has to do with HTTPS and expired certificates, you don't care. You're talking to an API and nothing here's important, so we're just doing that. So if we take a look at what's going on here, this is a little different than a classic crawling in that I know what the order of these things are. And I'm actually going to a primary key and the message number from the database from it, so it's like message three to going from three to four, that's just retrieving one message. That I know it and so I'm just going to just retrieve them in order, hopefully they don't blow up on me. But this one's more reliable and doesn't have rate limits and so it's a little easier to handle. And so what I'm going to do is I'm going to decide where we're going to pick up where we left off and let me show you how this code works. It's either going to start at message 0, actually which is message 1 because I say start = start + 1. But now I'm going to hit Enter here in the running gmane.py, and it's going to go back to the shell. And I'm going to start it back up again. This is a restartable process because the data is all in, there's 400 rows sitting in the database. The data is all there, so I run python gmane.py, and it knows that we've got to start at 401. So if I just say, give me one message, it'll retrieve one message, and then I can ask, it's 401. And so it doesn't restart. Now, if I wanted to restart, then I've got to manually drop the table. Let me add drop the tables right here. I don't have to drop it, oh you just say DROP TABLE messages and it would work, okay. So let's go back to taking a look at this code. So it's a while loop and it goes kind of it asks how many messages then it retrieves those messages and how many more do you want to do. But it's still working down, you know, message 401, 402, 403. So as it starts, it skips any messages that are in there. So if there is a message in there we do a SELECT, and at this point I want to show you this thing called myutils. Get out of there and do myutils, so take a look at myutils.py. I just made some library code. And the reason was is that I have a few things, like I want to query a row and fetch one, fetch the first record it's a query like limit one query or something, and I'm going to do that. And sometimes I'll do like a count star and I actually want the value, like not just a row with a single item, a single row with a single item. But I'm just, the zeroth element of that row tuple and I want it. So I'm just like, these are common things I do over and over again. So I put them in this file myutils and if I go up to the top of the file, I import myutils, and then I can use that. So this is just myutils.queryValue. This is what I call when I know that I want a single value. This is just going to give me one row, SELECT id FROM messages WHERE id = 12 or whatever, and away we go. And so this is just saves me typing about eight or nine lines of code, and it returns None. If it's not None, see that is not None, we're going to continue. So that a skipping through rows, and then I create the URL by just concatenating and I get one message at a time. I could actually get multiple messages at a time but then I'd need a loop inside the loop, and so I'm just going to do that. And then I have the code here. I'm going to of course put a try/except block, I'm going to use urllib, give myself a 30 second timeout. This context=ctx is a compensation for bad HTTPS certificates. I wish someday I could just go off on why that got messed up, and and warn security people. Arrgh. Then I read the document, check to see if I got a 200, which of course is the HTTP okay. I specifically catch this with a Control-C so I can blow it up and can get out and clean my mess up. So let me let me go ahead and do that. I'll do a python gmane.py and retrieve 10 messages and then hit Control-C, and I actually caught this abort. And what I want to do when I'm aborting is I want to break out of my loop, and then I want to make sure that I commit all the records that I just got done doing and close. And you've seen other things where I didn't do this and it blows up, and I wasn't in a try and except and it messed up. The other exception, who knows what's wrong with this other exception, I just am going to print the error message out. Now one thing I should sort of point out at this point is this gmane.py code, this took me probably two weeks to write and get right. And so all of this stuff, you don't necessarily have to put all this stuff in, I kind of added all these things, some of this try/except stuff. I added it all as I realized, you know, I have this problem and I have that problem. And so one of these things is these programs evolve as you run them and you should write them in a way that let you do that. So you'll notice that my thing is eminently restartable. You just drop the table and it starts itself back up. And so because you're going to be restarting this so many times as you're debugging it. And this code is pretty smooth at this point, it's been run by probably 20,000, 30,000 students in Python for Everybody. But the point is, when you start, it's not necessarily this sophisticated. And so don't feel like everything that you write has to have a bunch of trys and excepts, although when you're writing code, you can come back and take a look at this code and say oh, that's kind of nice. Because this will always be available on pg4e.com/code, so even if you're not taking a class, you can see all this. Okay. So I retrieve the data and I print out the length and you'll notice that that's what's coming out here, which URL I'm retrieving and how long it is, 3686 characters. And then I'm going to start parsing it, right, okay, and so because I'm retrieving one line at a time, it's supposed to have a From space at the beginning of it, one mail message at a time. And I have this thing about failing and sometimes it's just fail, fail, fail, fail so I have a little code that's smart about that. So then what I do is I check to see if I got a From the beginning. And you'll notice that that From at the beginning is not necessarily the person that it's really from, there's a From colon header, which is different than From. Then I look for two newlines in a row, newline \n\n, and if I find it, then I break it into header and body. And again, that's going to then be two columns in my database so that I can separately do stuff to the header and the body. Else I kind of trigger a failure and print it out to debug it, and again, once it works, but these little failure things are a good idea and you don't want it to go too far when it's failing. And then I've got to pull out the email address and I use a regular expression looking for the From colon at the beginning of the line, followed by a less than and then a bracket. And let me check something else here. I don't know why that was there. I just took that space out. So the From colon with brackets and the From colon without brackets and that's what this code is doing. And if I find one with brackets, I take the brackets out. And the next thing that's a little bit difficult is the processing of the date. So we go find the date header, so I look for a line that has Date in it, and that's right here. And then I pull out all this stuff. Friday, I want to get rid of the Friday. So that's what this dot star comma does. Then I want to skip a space, and then I grab this stuff. And this is a weirdly formatted date. and so I had to build a separate function, parsemaildate. And so if you take a look at parsemaildate, parsemaildate is trying to use a the thing that's built into Python called parser.parse. And if it works, it's great, otherwise, I wrote my own date compatibility library. And I just Googled something about what if the parse mail date doesn't exist, and then I could write my own but I really wanted to use the one that was built in to convert everything to this ISO format and the ISO format is what then my INSERT statement is capable of doing. So the date part is probably the most complex, and this falls into the category of data cleaning, right? And so then I get a sent_at date, then I go find the subject. And I look for a newline followed by the word Subject at the beginning of a line and then I do extraction of everything up to the end of that line then I pull that out of the header data, right? So I pull the subject out of the header data and I strip it and I lowercase it, etc., etc., etc. And then I just insert it all, INSERT INTO messages. I set the id in this one, which is kind of weird. And then every 50 messages I do a commit and every 100 message I sleep so that I can hit Control-C and when the loop is all done, I do conn.commit. And again, I can run this over and over and over and pull messages out. And of course I can watch so at this point I got 403, 403, and again, it's cruising along, right? If I hit Control-C now, I think, I think, or I might not have got it. It might not have been in the right place. And so let's see if the commit worked. Well, I got three of them there, so the commit worked, okay? So, let me just pull 10 messages down. So that one worked and so what I'm going to do is I'm going to stop and come back and talk about what we're going to do when we start creating these indexes.