Hello and welcome back to our in-progress demonstration of handling email data. So what we've done so far is we've actually run the program and we've got 463 messages. You can just do however many you want, you just keep on going, there's actually thousands of them. And then what we're going to do is we're going to start doing our indexes. And so I'll just start this index because sometimes it takes a little while for the index to run. So I've created an index named message_gin in the messages using the Generalized Inverted Index, and then the computation that is producing the array that's being indexed is the to_tsvector, using the English dictionary on the body. So we're only going do an index on the body. And so that's running right now. And so let's play a little bit and do some review. Actually these'll run, it's the explains that take a while. So we're going to look at the ts_vector and that shows what the index is being made with. So SELECT to_tsvector english body FROM messages LIMIT 1, that is basically showing the stemming. So we see all of the stemmed words like notifications as notif, organization is organ, please has been stemmed, the product is stemmed. So you see this stemming. And so simply running the function on the body is very different, we get a very different thing. So if we look at the whole body, so if I say, SELECT the body FROM messages LIMIT 1. Oh yeah, the perceptive reader will realize that I mistyped from, which I do all the time. So there's a whole bunch of stuff, right? So there's a whole bunch of stuff in here and the stemming, and it threw away semicolons, it threw away parentheses, "ice storm" is somewhere in there. So the index is really built based on this greatly reduced stemmed set of words and that's the inverted index. And the GIN basically says, for every one of these words in here, the stemmed case-fixed words are going to point at, if I say SELECT id comma body FROM messages LIMIT 1, all those stemmed words are going to point at 1. So you can, if you ask where golden is, it'll say, "Well, golden is in Message 1, along with all the rest of the messages." So that shows our stemming. I can do the to_tsquery. That is the stem of easier, easier is, so can I change that to easy? So easy turns into ease and the ts_query is in effect taking the WHERE clause, the thing that we would do the comparing to, and making sure that it whatever word is in your WHERE clause or the thing you're looking for, matches whatever things. So like notifications. So notif, notifications should go down to notif, right? So that does it. So that makes sure that we do the stemming process before we create the thing that we're going to look up in the inverted index because notifications is not in the inverted index at all. And so now what we can say is we can find whether or not a query in the double at-sign. So here I'm saying SELECT id to_query english neon, double at-sign to_query english body. And so that's asking the question is neon, appropriately stemmed, etc., in the body in English? And I'm going to look at the first 10 messages. Now I'm going to guess that neon shouldn't show up anywhere there. Yeah. False, false, false, false, false, false, false, false, false. Right? So there we go. And so it's all falses because the word neon is not in there anywhere. But if I can look for the word english, and now I'm just printing whether it matches, so we can kind of understand how WHERE clauses work. And so if we see easier, that didn't show up. Let's look for something that we know in here that's going to be there, easier wasn't in any of the rows, but let's put notifications because it's in all of them. True, true, true, true, true. Let's look for golden to see if golden, if the ts_query of golden is in the ts_vector of body. And so we got a true. Message 1 is true and the rest of them are false and that's probably because Glen Golden wrote Message 1. And so that gives you a sense of how we use this in a WHERE clause. Now what I'm going to do is just so I can keep track of things, I'm going to actually add a column that wasn't in the original database, ALTER TABLE messages ADD COLUMN sender as a TEXT column. And so then what I'm going to do is I'm actually going to do a substring. Let's just do a SELECT on this, SELECT substring headers and then a complex regular expression FROM messages LIMIT 10. Okay? And so this is pulling out those email messages from names. And then I'm going to do an UPDATE statement so that I get a new column. I'm populating this new column, the sender column, and I'm populating it by running through and parsing all of the headers. And so that actually built the sender. So I can say SELECT now sender FROM messages LIMIT 10, and I'm just going put that on. So instead of having to put this substring headers everywhere, I put that. I'm not going to index it because I want to show you other techniques. Okay. So if I take a look now, this just allows me to throw the sender on various selects, SELECT subject comma sender FROM messages WHERE it's on body Monday. So where ts_query monday is in the ts_vector of the body. And so this is showing now I can easily see the messages and their subjects and who sent them, and so sender is just there so that these look a little prettier. So now we can take a few things. Let's check to see if our index is working by using an EXPLAIN ANALYZE on tsquery of monday into the tsquery of the body. And that was a sequential scan. So now we've waited a little while and we do the EXPLAIN ANALYSE and we do see we get a nice it uses the messages_gin scan, and away we go. Now you'll notice that, for example, we never made a Spanish index. So if we take a look at the EXPLAIN ANALYZE spanish, it actually can do this, right? It actually can run the Spanish select. It could actually pretend these are all Spanish, but it's a sequential scan, right? So if I just run the SELECT, you'll see that it's capable of understanding. I mean, this is really checking to see, I'm doing a sequential scan by looking in Spanish. Now, of course Spanish doesn't tell us much here. The key is to_tsquery spanish is just code that runs to_tsvector. So what it's going to do here, it's going to retrieve all the bodies. It's going to compute the to_tsvector of spanish of all the bodies sequentially, then it's going to do the to_tsquery of monday in Spanish with stemming and all that stuff, right? So we could see what to_tsquery of monday looks like, but it's probably going to look exactly like English, but you can kind of see that. Yeah. So the Spanish doesn't change anything. The point is it runs through all of them sequentially, does the computation, the WHERE clause works. The difference is that because I never made the GIN for Spanish, it's not fast. And so it's not like you can't do things, the index just really makes things faster, okay? And so the last thing I'm going to do here is I'm going to just switch between GIN and GIST. So I'm going to drop the GIN index, boom, and then I'm going to create a GIST index, which is using GIST in the same thing, to_tsvector. And so now we've created a GIST index. Now to review, the difference between a GIN index and a GIST index has nothing to do with the queries, it has to do with the size of the index, the cost of maintaining the index and the performance of the queries. And so GIN is a simpler, hash-based index and because it's hashing, it's somewhat lossy, but it keeps it smaller and it's easier to maintain. So the GIST is smaller, easier to maintain. But then when you do selects, you might get extra rows, but then it does actually check even, with a GIST index. If you do something, it's going to maybe under the covers pull too many rows out. Did I forgot my semicolon? There might have been rows that it found that don't actually match the to_tsquery of english of monday to the to_tsvector of english body. It might, but then it throws them away. And so it might actually retrieve but not hand to you. So before it hands it to you, it only sends you the exact messages. And so ultimately, GIN and GIST do the same thing, they just have different performances. And in everything in indexes, you are always trying to trade off insert performance, size of index, and performance of select. And so that's a trade-off. I just did this as GIST to show that it pretty much does the exact same thing. So we'll stop now and then come back for the third bit of this discussion.