So the next Python sample that we're going to take a look at is working with some email data. Now, I've got some code for you to download gmane.py, and myutils.py, and datecompat.py. We're going to play with some regular expressions, we're going to do some indexing, we're going to do some looking, and we're going to again make some full text databases and do some clever queries. So most of it is actually in the demonstration, which we've had and I give you the SQL commands. The basic idea here is we're going to retrieve again from an online source mbox.dr-chuck.net/sakai.devel, and we're going to be able to pull a series of mail messages down. And mail messages are in a format that's called Mbox. Mbox is an interesting format. It starts with a From followed by a series of headers, which are key-value pairs which is like the key is up to the colon character and then the value is after the colon character. In this case, I went from message four through message six so I got more than one message back. The messages are delimited by a From space. Now you'll notice the From colon is not the delimiter, that's actually just one of the keywords of the header. The other thing is that there is a header that is a series of the key-value pairs and then there's a blank line. And it's as simple as that. There's a blank line and then there is the message text. And blank lines can be in the message texts. You know that you're in the new message when you see From in the characters. Now if there's from, it actually kind of escapes that in the message body. And so this sample code is pretty cool. It's a much more sophisticated. The key thing this is it's like a web crawler and then it uses the database as like a scratch storage to spider it and pull the data in and out. And it can be stopped, it can be restarted. If the server starts having problems, you can sort of stop your crawling process and then figure out what's wrong or what's wrong with your network and then start it back up, and it's pretty cool. And the other thing that it's doing is it's cleaning up the data. And so this is real email data that was just archived right off of a real email server and the header conventions and how addresses are represented, they are a little bit differently. And so one of the things we're seeing in this, and you will see in this code, is the code to clean up different formats of dates and times in different formats of email messages, etc., etc., using regular expressions. And these are just things you can't do in Postgres, but you can do pretty naturally. Now I'll just I'll let you know that when you look at this code, this gmane.py, this wasn't code that I sort of thought through for a few hours and then wrote for a few hours, this took weeks, like two weeks of work, where I was talking to the server and having things blow up and then I would evolve the code. Now the code you're going to use is pretty robust, and it's pretty resilient when it faces errors, but I didn't anticipate before it all started every error that I could possibly have made. And so when you write this code, you start writing a basic thing and if nothing goes wrong it works and you've got your data and you do your analysis. If you just find problems of data inconsistency or server unreliability or rate limits, then you might have to adjust your code. So some of these that I've written, they get a special code when you hit the rate limit and you're like, "Oh, saw a rate limit. Stop, might as well stop, wait a day and go get some more data." Or whatever. So it's an interesting kind of programming exercise in that it's an evolutionary programming exercise where you don't always know what's going to happen when you start talking to external data sources. And then the part of the idea is to get all this data in a nice pretty, consistent format, email addresses, dates, names, the headers, the body, and all that stuff in a way that you can start your data analysis and not worry about all of those crazy vagaries. So take a look at me walking through the sample code. It's quite a long run through the sample code and then what we do with the data once we have it in a database. [MUSIC]