In this lecture, we're going to talk about pattern matching in strings using regular expressions. Regular expressions or regexes are written in a condensed formatting language. In general, you can think of regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data using the pattern and returns chunks of texts back to the data scientist or programmer for further manipulation. There's really three main reasons you should want to do this. To check whether a pattern exists within some source data, to get all instances of a complex pattern from some source data, or to clean your source data using a pattern generally through strings splitting. Regexes are not trivial but they are foundational technique for data cleaning in data science applications. A solid understanding of regexes will help you quickly and efficiently manipulate text data for further data science application. Now, you could teach a whole course on regular expressions alone, especially if you wanted to demystify how the regex parsing Engine works and efficient mechanisms for parsing text. In this lecture, I want to give you a basic understanding of how regex works. Enough knowledge that, with a little directed sleuthing, you'll be able to make sense of the regex patterns you see others using, and you can build up your own practical knowledge of how to use regexes to improve your data cleaning. By the end of this lecture, you'll be able to understand the basics of regular expressions. How to define patterns for matching, how to apply these patterns to strings, and how to use those strings for those patterns in data processing. Finally, and note that, in order to best learn regexes, you need to write regexes. I encourage you to stop the video at any time and try out new patterns or syntaxes you learn. So first, we'll import the re module which is where the Python stores regular expression libraries. So import RE. There are several main processing functions and read that you might use. The first, Match, checks for a string match that is at the beginning of the string and returns a Boolean. Similarly, Search, checks for a match anywhere in the string and returns a Boolean. Let's create some texts for an example. So text, "This is a good day." Now, let's see if it's a good day or not. So we can do if re.search, look for good, text, and the first parameter here is the pattern. Then we'll print, wonderful, if good's there, else will print alas and a frowny face. So in addition to checking for conditionals, we can segment a string. The work that regex does here is called tokenizing, where this string is separated into substrings based on patterns. Tokenizing is a core activity in natural language processing which we won't talk much about here but you'll study in the future. The Findall and Split functions will parse the string for us and return chunks. Let's try an example. So text, "Amy works diligently. Amy gets good grades. Our student Amy is successful." It sounds like a very positive thing. So this is a bit of a fabricated example, but let's split on this all instances of Amy. So re.split, Amy as the pattern where we want to match, and text as our text. You'll notice that Split has returned an empty string followed by a number of statements about Amy, all its elements of a list. If we wanted to count how many times we've talked about Amy, we could use Findall. So re.findall, Amy and text. Okay, so we've seen that Search looks for some pattern and returns a Boolean that Split will use a pattern for creating a list of substrings, and that Findall will look for a pattern and pull out all occurrences. Now, we know how the Python RegEx API works. Let's talk a little bit more about complex patterns. The regex specification standard defines a markup language to describe patterns in text. Let's start with anchors. Anchors specify the start and/or end of the string that you're trying to match. The caret character means Start, and the dollar sign character means End. If you put caret before a string, it means that the texts that the regex processor retrieves must start with the string you specify. For ending, you have to put the dollar sign character after the string. It means that the texts regex retrieves must end with a string you specified. Here's an example. So text, "Amy works diligently. Amy gets good grades, and our student Amy is successful." Let's see if this begins with Amy. So we could do re.search and then inside quotes as our pattern and we'll put caret Amy. So we want to match Amy at the beginning and then text. Notice that re.search actually returned to us a new object called an re.Match object. An re.Match object, always has a Boolean value of true as something was found. So you can always evaluate it in an If statement like we did earlier. The rendering of the match object also tells you what pattern was matched, in this case, the word Amy, and the location of the match was in as in the span. Let's talk more about patterns and start with character classes. Let's create a string of single learners grades over a semester in one course across all of their assignments. So grades equals, and I'll just put a string of As, Bs and Cs. No Ds for this student. If we want to answer the question, How many Bs were in the grade list, we could just use B as our pattern. So re.findall B and grades, and we see that there's three here. If we wanted to count the number of As or Bs in the list, we can't use A and B since this is used to match all As followed immediately by a B. Instead, we put the characters A and B inside square brackets. So we do re.findall and then we do square brackets AB and then passing grades. Here, we can see the whole list of As and Bs that are found. This is called The Set Operator. You can also include a range of characters which are ordered Alpha numerically. For instance, if we want to refer to all lowercase characters, we could use A-Z in the square brackets. Let's build a simple regex to parse out all instances where the student received an A followed by a B or a C. So re.findall and then we'll use the set operator with A and the set operator with E-C and parse in the grades. Notice how the AB pattern describes the set of possible characters which could be either A or B while the A and then followed by the B-C pattern denotes two sets of characters which must have been matched back to back. You can write this pattern by using the pipe operator which means OR. So we could do re.findall AB or AC in quotes directly as our pattern with grades. We can use the caret with the set operator to negate our results. For instance, if we wanted to parse out only the grades which were not A's, we would do re.findall, and then we'd use the set operator, this is important caret A and grades. Here we only have B's and C's. So note this carefully, the caret was previously matched to the beginning of a string as an anchor point. But inside of the set operator, the caret and its other special characters we'll be talking about, they lose their meaning. This can be a bit confusing. What do you think the result of this operation would be? So we'll do re.findall, caret set operator, then another caret A, and grades. Just take a minute, look at that, and reflect on what we just talked about. What do you think it would be? So it's an empty list, because the regex is saying we want to match any value at the beginning of the string which is not an A, and our string starts with an A, so there's no match found. Remember, when you're using the set operator, you're doing character based matching. So you're matching individual characters in an or method. Okay. So we talked about anchors, and matching to the beginning, and end of patterns. We've talked about characters using sets with the set notation. We've talked about character negation and how the pipe of character, allows us to use or in our operations. Let's move on to quantifiers. Quantifiers are the number of times you want to pattern to be matched in order to actually count as a match. The most basic quantifiers, the expression of E, curly brace M, N curly brace, where E is the expression or character we're matching, M is the minimum number of times you want it to be matched, and N is the maximum number of times the item could be matched. Let's use these grades as an example. How many times has this student been on a back-to-back A streak? So we do this with re.findall where the character we're interested in is A, so that takes the place of E, curly brace, we want to have at least two of them, so that's our m value of two, and we'll just have some big value. We use ten here, and we'll pass this in as our grade. So we're using two as our mean, but ten is our max. So we see that there were two streaks; one where the student had four A's, and one where they only had two A's. We might try and do this using single values, and just repeating the pattern. So we might try re.findall, and then we've got a capital A, and we want a minimax match of one for those capital A, minimax match of one for those all grades. As you can see, this is different than the first example. The first pattern is looking for a combination of two A's, up to ten A's in a row, so it sees four A's as a single streak. The second pattern is looking for two A's back-to-back, so it sees two A's followed immediately by two more A's. We say that the regex processor begins at the start of the string, and it consumes variables which match patterns as it does. It's important to note that the regex quantifier's syntax does not allow you to deviate from the MN pattern. In particular, if you've got an extra space in between the braces you'll get an empty result. So re.findall, A, a brace two comma space two, we see that that's an empty result. As we've already seen, if we don't include a quantifier when the default is 1 1. So re.findall, we can just use A A if we just want to match one. If you have one number in the braces, it's considered to be both the M and the N value. So re.findall A brace two, two is the min and the max. So using this we could find a decreasing trend in the students grades for instance. So re.findall, and we'll look for A's, any number of A's, we'll use ten as the max here as well, B's any number, C's any number, and then we'll pass in grades as our string. Now, that's a bit of a hack because we included a maximum that was just arbitrarily large. There are three other quantifiers there used to shorthand, that we could think about using here. An asterix is used to match zero or more times, a question mark to match one or more times, or a plus sign to match one or more times. Let's look at a more complex example, and load some data scraped from Wikipedia. So we'll open a dataset called ferpa.txt. This is the ferpa article on Wikipedia as a file. We'll read this into a variable called Wiki. Let's just print this out to the screen. So scanning through this document, one of the things we notice is that the headers all have the word edit and braces behind them, or rather square brackets behind them, followed by a new line character. So if we wanted to get a list of all the headers in this article, we could do so using re.findall. So re.findall, and then we can say here, well we're interested in all characters that are lowercase, A through Z or a capital case A through Z. We're interested in somewhere between 1-100 of those characters, as long as they're followed by Edit. We have to escape the square braces here unfortunately, which makes it a little bit messier, but we don't want to move into the set notation. Okay. So that didn't quite work. It got out the headers, but only the last word of the header, and it really was quite clunky. So let's iteratively improve on this. First, we can use slash W to match any letter, including digits and numbers. So if we do re.findall, and then we want a character, and we want to match slash W. So we're not going to worry about the A through Z's and the lowercase capital case, we just want any word character, and somewhere between 1-100 of those followed by edit. So this is something new. Slash W is a metacharacter, and it indicates a special pattern of any letter or digit. There are actually a number of different metacharacters listed in the documentation. For instance, slash S matches any whitespace character. Next, there are three other quantifiers that we can use which shorten up the curly brace syntax. We can use the asterix to match 0 or more times, so let's try that. So re.findall, we want to match any word character any number of times, so we'll just use an asterisk. So this removes our top 100 limit followed by edit. Now that we've shortened the regex, let's improve it a little bit. We can add in spaces using the space character. So we could do re.findall, and in our set notation we want to match either word characters or space characters any number of times, followed by edit. Okay. So this gets us a list of section titles in the Wikipedia page. You can now create a list of titles by iterating through this and applying another regex. So it's quite common to do, for instance for title and re.findall, we put in our regex which is matching all of the titles, now we're going to take the intermediate result and split on the square bracket, and just take the first result. So I'll just print re.split, and then we'll search, and the character we're looking for is a square bracket, so we have to escape it, so it looks a little nasty. Then we just want to take the first value of this list which will be the actual title. Okay. This works but it's a bit of a pain. To this point, we've been talking about regex as a single pattern which is matched, but you can actually match different patterns called groups at the same time, and then refer to these groups later as you want to. To group patterns together use parentheses, which is actually pretty natural. So let's rewrite our find all using groups. So re.findall, and then we start a pattern string, and we want the first group, so first parenthesis will be the first part of our pattern, so that would be any word character or space any number of times. The second group will actually just be this edit tag that denotes to us that it's a header. Nice. We see that the Python RE module breaks out the result group by group, and we can actually refer to groups by the number as well with the match objects that are returned. But how do we get back a list of match objects? Thus far, we've seen that find all returns strings, and search and match return individual match objects, but what do we do if we want a list of match objects? In this case, we use the function finditer. So for item in re.finditer, and then we pass it in this pattern that we're interested in. Then we can print item.groups. So this is the groups for a specific match item, and we're iterating over all of them. So we see here that the groups method returns a tuple of the group. We can get an individual group using groups of number, where group subzero is the whole match, and the other number is the portion of the match where we're interested in. In this case, we want to group sub one. So we don't want the whole match, we want the first item in the match. So for item and re.finditer, so we'll pass in our whitespace or word character any number of times as the first group, and edit as the second group, and we'll just print out item.group and parameter 1. One more piece to regex groups that I rarely use but it's a good idea is labeling or naming groups. In the previous example, I showed you how you can use the position of the group, but giving them a label and looking at the results as a dictionary, is actually pretty useful. For that we use this syntax, where we have a parenthesis, then a question mark, a capital P, and then the name inside of these angle brackets. Where the parenthesis starts the group, the question mark, capital P, indicates that this is an extension to basic regexes, and the name and the angle brackets is the dictionary key that we use, and it's wrapped in those angle brackets. So for item in re.finditer, we could use as our pattern, our first group, so parentheses, we want to name it, so question mark P. I'll name this title, so you see that we keep these angle brackets. Then we actually put our patterns. So we want the set operator, which indicates character matching, and we want to do any words, so slash W any word character, as well as spaces. So we'll put space in there any number of times, so we've got announced tricks for that. Then the second group, we want to name this as well, question mark P, and we'll call this edit_ link, for lack of a better word, and then our pattern here which we have to escape the square brackets. So we can get the dictionary returned for the item using the.groupdict. So print item.groupdict, this is a dictionary, so we can just do subtitle which is our key. Of course, we can print out the whole dictionary for the item too and see that the edit string is still in there. So here's the dictionary kept for the very last match. So print item.groupdict. There's our actual full dictionary. Okay. So we've seen how we can match individual character patterns with the set operator, how we can group matches together using parenthesis. Now, we can use quantifiers such as asterisks, question mark or m{n} to describe patterns. Something I glossed over on the previous example was \w, which stands for any word character. So there are a number of short hands that can be used with regexes for different kinds of characters including a period for any single character, which is not a new line, a \d for any digit, and \s for any whitespace character like spaces and tabs. There are more and a full list can be found in the Python documentation for regexes. One more concept to be familiar with is called lookahead and lookbehind matching. In this case, the pattern being matched to the regex engine is for the text either before or after the text that we're actually trying to isolate. For example, in our headers, we want to isolate the text which comes before the edit rendering, but we actually don't care about the edit text itself. Thus far, we've been throwing the edit away. But if we want to use them to match but don't want to capture them, we could put them in a group and use lookahead instead with the "?=" syntax. So let's see an example of this. For item in re.finditer, we're going to match two groups. The first group is going to be a named group called title. We're just copying that from before. But the second group is actually one of these throwaway groups that is lookahead. So we're just looking for edit, but we're actually not matching on it. So what this regex says is to match two groups. The first will be named and called title, and we'll have any amount of whitespace or regular word characters. The second will be the characters edit, but we don't actually want this item put in our output match objects. So we can print that item. We see that we have three match objects there, their spans, and the actual values that they matched. Let's look at some more Wikipedia data. Here's some data on universities in the US which are Buddhist-based. So with open from data sets, a buddhist.txt as file, and we'll read this into a variable called wiki. For good measure, let's print this variable output to the screen. We can see that each university follows a fairly similar pattern with the name followed by an em dash. Then, the words "located in" followed by the city and the state. I'm actually going to use this example to show you the verbose mode of Python regexes. The verbose mode allows you to write multi-line regexes and increases readability. For this mode, we have to explicitly indicate all whitespace characters, either by prepending them with a slash or by using the \s special value. However, this means that we can write our regex a bit more like code and we can even include comments with the pound sign. So pattern equals, and we'll use this triple-quoted string, which means we can go on multiple lines. So the first group we want to match is the title. So here, I'm going to say I want it to be a named group called title, and I want any number of characters, so I'll just use ".*". We'll add a comment, the university title. The next group is going to be a space, sorry, the em dash followed by a space. So I'm escaping that, in this case, located space in space. So this is just some indicator of the location. I don't care about this, so I'm not naming this group. The next one we'll make is the named group city, and this is any number of word characters. So this is the city that the university is in. Then, we have some separator, comma, space, that we see is a common pattern. I don't care about this, so I won't name this group. Then, we'll name another group, the state. It's just like city, any number of word characters. So the state, the city is in. Then, we'll end the triple-quoted string. So now, when we call finditer, we just pass the re.VERBOSE flag as the last parameter. This makes it much easier to understand large regexes. So we just use this like previous. We say for item in re.finditer, we pass in the pattern, we pass in our source text, but because it's this multi-line verbose mode, we say re.VERBOSE to let the processor know. We can get the dictionary returned to us for the item with the.groupdict value. So we print item.groupdict. So there we see our output is a set of universities, their cities, and their states, and all of the groups that we didn't name they're matched, so they're consumed by the processor, but they're thrown away. Here's another example from the New York Times which covers health tweets on news items. This data came out of the UC Irvine Machine Learning Repository, which is a great source of different kinds of data. So with open datasets/nytimeshealth.txt as file, and we'll read everything into a variable, and take a look at it. So health equals file.read. We'll look at health. That's a lot of tweet data. So here, we can see there are tweets and the fields are separated by pipes. Let's try and get a list of all of the hashtags that are included in this data. Now, in Twitter, a hashtag begins with a pound sign, or the hash mark, and continues until some whitespace is found. So let's create a pattern. We want to include the hash sign first, then any number of alphanumeric characters, and we end when we see some whitespace. So a pattern is equal to the hash mark followed by word characters or digits and an asterisk, any number of those. Then, we'll use a lookahead just looking for any whitespace. So we use \s in this case. We could've used a space if we weren't worried about tabs and so forth. Notice that the ending is lookahead because we're not actually interested in matching this whitespace and returning the value. Also, notice that I used an asterisk instead of the plus for matching of alphabetical characters or digits because a plus would require at least one of each. So let's search and display all of the hashtags. So re.findall pattern and health. So we can see that here there are lots of Ebola related tweets in this particular dataset. This lecture has been an overview of regular expressions. Really, we've just scratched the surface of what you can do. Now, I actually find regexes really frustrating. They're incredibly powerful. But if you don't use them for a while, you're left grasping for memory of some of the details, especially named groups and lookahead searches. But there are lots of great examples and reference guides on the web including the Python documentation for regex. With these in hand, you should be able to write concise and readable code which performs well too. Having basic regex literacy is a core skill for applied data scientists.