Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text

Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text


Hey there how’s it going everybody in this video We’re going to be learning how to use regular expressions So we’re actually going to look at regular expressions as a standalone topic because they aren’t specific to any one programming language Now there are some slightly different flavors here and there but for the most part Whether you’re programming and Python or JavaScript or Java or whatever if you learn how to use general regular expressions? Then it should mostly carry over into your language of choice And it will also allow you to use them in text editors and the command line and things like that now I am going to do a follow-up video where I show how to use regular expressions Specifically in Python since that’s a language that I cover most on this channel, but for this video We’re going to be learning how to use Regular expressions by themselves so that you can apply these to other areas so with that said let’s go ahead and get started so regular Expressions basically allow us to search for specific patterns of text and they can look extremely complicated But that’s mainly because there’s just so much that you can do with them You can create a regular expression for just about any pattern of text that you can think of so let’s see what some of these Look like so I have a test file open here that we’re going to use to search for specific patterns And I’m going to be using the regular expression tool in the atom text editor to write these regular expressions And find what text matches our patterns now in order to open up this regular expression search tool I’m just going to go to find and then find in buffer now. You could have also opened this up with command F on a Mac and I believe, that’s Ctrl F on windows now within the options here make sure that you have the dot asterisk selected over here because that’s going to tell our search tool to use regular expressions and Also select this match case option here as well. That’s just going to give us behavior That is more common to how regular expressions usually behave, okay, so let’s start writing some regular expressions and first We’ll start off kind of simple so first of all we can just search for literal characters, so if I was to search for ABC then we can see here at the top that it highlighted ABC because it matched the ABC and our lower case alphabet now it didn’t match the Capital ABC here because it’s case sensitive Now this search right now is looking specifically for a B and C But if I was to type in something like BCA then we can see that there were no results found because the order does matter Now if we look at this meta character section here I have some examples of characters that I say need to be escaped so for example if you wanted to search for a literal period now if I was to just type in a Period here and hit enter for my search then we can see that it does this weird thing where it matches Everything and that is because the dot is a special character and regular expressions And we’ll see more of this in just a second But for now if we just wanted to actually search for a period or a dot then we have to escape it and to escape Characters we can use the backslash so if I do a backslash and search then now we can see that it only matches the actual Literal dot or period within our document here and that goes for any of these meta characters that I’ve listed here So for example we can see that the backslash is a special character also so if you wanted to search specifically for a backslash Then you have to escape itself so a backslash to escape and then a backslash for the search And if I search for that then we can see that we matched a literal backslash so a practical Example of this might be trying to match this URL right here So if we wanted to match that literal URL exactly Then we could just say kori MS and then for the dot on the dot-com we have to escape that with a backslash and then a period and then come and we Can see that it matches our URL okay? So that’s how you match literal characters But a literal search isn’t too exciting because we’re used to that already Really, we want to use regular expressions to search for patterns and to do this We’re going to be using some of these meta characters that we were just escaping so I have a snippets file open here So I’m going to switch over to this and in here I have a list of values where we can see the types of characters that we can match Now just for now I’m going to try to make this into a split screen here As we’re walking down this list so the first one I have listed here is this dot or period and we can see that this matches any character except a newline now we’ve already seen this, but let’s take a look again if we just do a And search for that then we can see that it matches any character Except it does not match the new lines. Okay, so next on the list is backslash D and that matches any digit 0 through 9 so if I was to do a Backslash D here and searched for that then you can see that this matches all of our digits so anything 0 through 9 it matches now we also have an uppercase D here and that matches anything That is not a digit, so if I search for an uppercase D. Then We can see that our digits are not matched, but everything else is highlighted, so it matched everything except for the digit now You’ll notice that This is a common theme here that the uppercase versions of all of these are the ones that kind of negate the search so Moving on down here we have backslash W That searches for any word character and a word character is lowercase a through Z uppercase a through Z 0 through 9 and an underscore So let’s search for the word character, and we can see that it matches You know all these lowercase uppercase numbers and things like that tani maaha mid match garanysa charcter gaarka ah It doesn’t match these special meta characters here and just like with the digit the uppercase W Will match anything that is not a word character so anything that is not in this list here So let’s go ahead and search for that uppercase W And we can see that you know it picks up the spaces and these special punctuation ‘s and things like that But it does not match the word characters that we saw before now if you’re not quite getting this just yet We are going to look at a lot of examples to where it’ll start to sink in so moving down the list here We have backslash s Which will match any white space and white space is a space tab or a new line so if we search for Backslash s. Then we can see that it matches our new lines here and our spaces but it doesn’t match any of these characters and here so it’s mainly white space and Just like with the others the capital S Will search for anything that is not white space so now you can see that we have you know all these lowercase uppercase Digits and then also this punctuation anything that isn’t a new line or a space or anything like that Now these bottom ones over here the backslash B. The caret and the dollar sign These are a little bit different, so these are called anchors, and they don’t actually match any characters But rather they match invisible positions before or after characters So let’s see what I mean by this so for a word boundary if I search for a word boundary here So now let’s search for where we have this ha ha ha here Let’s search for a word boundary And then ha and match that so we can see that that matched because there is a word boundary here at the start of this line before this first one here and This space here is also word boundary, so this one gets matched as well But this last one does not get matched because there’s no word boundary between these two Cause here now Just to show with what this would look like without the word boundary if I was to search for that then you can see that It highlights all three of those now. Just like with the other ones if I do a an uppercase B Then that matches anything that is not a word boundary, so if I do an uppercase B Then we can see that we match the one that it didn’t match before because there is no word boundary between these two here so it doesn’t match these first two now if I was to put word boundaries on both sides of These then it should only match this first one Because this is the only one that has a word boundary at the beginning Which we’re matching here and at the end so this one has a word boundary at the beginning But not at the end because it’s in the middle of this word and this one has a word boundary at the end But not at the beginning ok so our other two anchors here are pretty similar so the caret matches the position at the beginning of a string and the dollar sign matches the position at the end of a string So let’s say for example That we only wanted to match a ha if it was at the beginning Of a string so for example if I was to do a caret And then a ha and match that then we can see that it only matched this one because it’s the only one That is at the beginning of a line now if we wanted to only match it if it was at the end then we could put that dollar sign at And what we’re saying here is that we only want to match this if the end of the string is the? Is in the following position so we can see that it not only matches this last one? Because the end of the string is the next position in line, okay? So now that we’ve seen what we can match with these special characters here now Let’s go ahead and take a look at some practical examples, so I’m gonna move my snippets file back here And we will keep referencing that later on but for now Let’s go ahead and say that we wanted to match a couple of phone numbers and let’s write some regular expressions to do this now with a phone number We can’t just type in a literal search like we did before because all of these are different So they have a similar pattern But they’re not all the same digits so in this case we need to use the meta characters instead of literal characters So we just have a pattern here of three digits And then a dash or a period and then three more digits and then a dash or a period and then Four digits at the end so we saw before that we can match a digit with a backslash D And that is going to match all of the digits in our file so we want to match this phone number here So we want to match first three digits in a row so we can just put in three Backslash DS and that will match any three digits in a row so now that we’re matching those first three digits now We’re getting to where we can see that We’re either going to match a dash, or a dot in our phone number so for now. Let’s just match any character That’s in this position so from our snippets file We saw that if we want to match any character the win can use a dot so we can see that for now our Pattern is still matching some other stuff as well But let’s just continue on so now that we’re matching this – or this dot now Let’s go ahead and add in the next three digits, so we want to search for three more digits So I’ll do three backslash DS And now we’re going to want a dot to match any character Which should match that – or that dot and now we want four digits so we can just do four backslash DS So now we can see that this regular expression Highlights both of our phone numbers and matches both of those so now we’re starting to see how this could be pretty useful So for example I have a data file here now if I pull this up then I have a bunch of fake names and numbers And addresses and emails, but if I wanted to match all of the phone numbers in this file Then you can see that the regular expression that we just wrote matches all of the phone numbers here so now we’re starting to kind of get a sense of how this could be more useful than just a literal search because now we’re Actually searching for a specific pattern, so now let me go back to our Simple text file here so now Let’s get a little bit more specific So let’s say that we only wanted to match a phone number if it had a dash Or a dot now right now this pattern will match any separator because we’re using the period down here which will match any Character so if I was to put in a another number here that doesn’t have a regular separator Let’s just say it’s an asterisk then we can see that it matches this number as well Even though the asterisk isn’t really a phone number separator so to only match the dash or the dot We’re going to have to use a character set and a character set uses square brackets with the Characters that we want to match so to create a character set I’m going to replace our first dot here And this is going to be square brackets now This is a character set now within this character set we want to put the characters that we want to match so we want to match either a dash or a dot and I will just copy that and we’ll replace this second dot here which was matching any character And we will put that in for that as well And now you can see that it only matches our phone numbers here that have a dash or a dot separator And it does not match this one with the weird asterisk there now you probably also noticed that we didn’t need to Escape our dot character within our character set and that’s because character sets have some slightly different rules Now you can escape these characters if you’d like But it just makes it a lot more difficult to read if you do that now even though the character set has Multiple characters here in the set it’s still only matching one character in our next it’s matching one character that is either a dash or a Period but if I was to put in let’s say two dashes here into one of these numbers Then you can see now it doesn’t match that number because it’s only matching the first dash, or a dot and then it moves Right on to looking for a digit, so it’s looking for a digit in this position So that’s something that can kind of throw people off when they first start working with regular expressions so even though you know we have four characters total here in this character set with these square brackets and All of the characters in this set it’s still only searching for one literal character up here, which is either a dash Or a dot now to show another example of this Let’s say that we only wanted to match 800 and 900 numbers, so I’m going to create two different numbers here I’ll do an 800 number and a 900 number here So if we only wanted to match 800 and 900 numbers then our first three digits here We have to do something different so first we want the first digit that we’re going to match to either be an 8 or a 9 so we can do a character set and we can say that we’re looking to either start with an 8 or a 9 Now the following two numbers are going to be 0 0 and that’s just a literal search so now you can see that We’re finding the 800 and 900 Numbers here now within our character set the dash is actually a special character as well so when it’s put at the beginning or the end of the character set then it will just match the Literal – but when it’s placed between values that it can actually specify a range of values so for example We know that the backslash D matches any digit But what if we only wanted to match digits between? let’s say 1 and 7 so to do that we can use a character set and we can just say instead of typing out 1 2 3 4 5 6 7 If we wanted to specify a range of those values Then we can just say 1 – 7 so now we can see that We’re matching all of the digits between 1 and 7, but the 8 9 and the 0 aren’t getting matched up here Now you can do this with letters as well so if we won’t only wanted to match the lowercase letters A through Z then we could just do a character set of A through Z Now you can see all of the capital letters aren’t getting matched But the lowercase ones are now if we wanted to match the uppercase and lowercase numbers Then we could just put our ranges back-to-back So I could say a through Z and then just add on to this character set and say Capital A through capital Z and now we’re matching all letters regardless of whether they are uppercase or lowercase and you could keep adding to those ranges if you wanted to you could do a 0 through 9 there as well to add in all digits now another special character in our character set is the carrot Now we saw before that outside of the character set it matches the beginning of a string But within the character set it negates the set and matches everything That is not in the set so for example if we wanted to match every Character that is not a lowercase letter, then we could say this carrot And then a through Z so we can see that it matches everything on our screen that isn’t a lowercase letter It’s not matching these lowercase letters here So it’s even matching these new lines and the spaces and everything so just to show another example of this Let’s say that we had some words here cat mat Hat and bat, so let’s say that we wanted to match every word that ends in a T except bat we don’t want to match bat so to do this we can just say that we want a character set of Everything that is not be followed by a T So now we can see that it matches all of these three letter words that end in 80 except for bat because our character set here negated that B So everything that we’ve looked at so far has involved single characters so in this example Right here where you’re matching any single character That is not a B then followed by an A and then followed by a T But we can actually use these things called quantifiers to match more than one character at a time So let’s go back to our original phone number example from earlier And we’ll do match any character like we did before so I will do three digits and then a period for any character and then three digits again and a period for any character and then four digits at the end and I’m just going to remove what we had there for an example and Scroll those back up so to see what quantifiers we have available I’m going to make my snippets half of my screen here again And then scroll down to my quantifier section so the asterisk will match zero or more of what we’re searching for the plus sign will match one or more the question mark will match zero or one and To match exact numbers we can use these curly braces with a number on the inside so in this example this would match exactly Three of what it is we’re looking for and we can also specify a range of numbers as well With the first number being the minimum and the last number being the max so this would search for whatever our pattern is it would Look for three or four of those So let’s take a look at an example of this to see how this works So you can see that with our phone number We are searching for one digit at a time But we could change this if I erase my digits here Then we could say that I’m searching for a digit And then we could put in our quantifier for exactly three digits And we could do this after our separator as well, so we’re searching for Three digits and then any character and then here at the end we want to match four digits So instead of writing out the same character over and over we can see how these quantifiers allow us to specify Exactly how much we want now here. We’re matching exact numbers, but sometimes We don’t know the exact number And we’ll need to use one of these other quantifiers so for example here at the bottom of this test file here We have some lines where each starts with a prefix of mr. or miss or misses So let’s say that we wanted to match these prefixes as well as the names after so just to start Let’s start by matching the names that start with mr. Now we can see that some of these Have a period after the prefix and some do not some of them Just have a space So let’s start our regular expression by searching for lines that start with mr And then we’re gonna put a backslash period to search for that Literal period and right now it isn’t matching this mr. Smith which doesn’t have a period after the prefix now to match that also we can use this question mark Quantifier which tells our pattern that we want to match 0 or 1 of that character so if I put a question mark after that literal period then it’s saying that There can be 0 periods there or there can be 1 so we can see that now It’s matching the ones with 1 period there And it’s also matching the one with no period so now to continue and match the entire line now we want to match a space after that and after the space we want to match any uppercase letter and to do that we can use our character class and we can match any uppercase letter by doing a range of uppercase letters there so at this point after that first uppercase letter that we match we’ve Completely matched the name for mr. T down here at the bottom, but we still need to match the rest of our other names so we could say that we will match any word character after that uppercase so let’s put in a Backslash W to match any word character, and now we don’t know how many more characters are going to be in our name So we’ll have to use a quantifier here now if we look over here We could use the asterisk or the plus sign and the plus sign will match one or more of these word characters and the asterisk will match zero or more so if we Used the plus sign then we can see that it matches our two top names here, but now it’s not matching this mr. t because after our word character It’s searching for one or more word characters after our uppercase character so a better solution in this case May be to use the asterisk which matches zero or more word characters And if we use that asterisk then we can see that it matches all three of our names that begin with mister now I know that we’ve covered a lot so far But we’ve got a couple more concepts to go and then we’ll look at some examples that wrap everything together So we still haven’t matched our miss or misses names here So how would we do that? So you might think that we could use a character set that matches either an R or an S And there are maybe some ways that we could get that to work but it probably would be a bit ugly since we’d have to match either and R or an S as the second character, and then the optional s after that So that could get kind of ugly, but I think a better solution here would be to use a group now We haven’t looked at groups yet But groups allow us to match several different patterns and to create a group we use parentheses so After the M here instead of just searching for mr. I’m going to create a group with open and close parentheses here and Now within our group we can specify different Matches so I can say that we want to match either an R and Then or and we use this character here to specify an or and that is just the vertical Bar character to specify an or so we can say that we want to match an R or an S And whenever we add that in we can see that now we’re matching the miss name here But we’re still not matching this Misses so to match the misses we can put in another or and say that we want to match an RS Okay, so now we can see that We are matching all of our names here so let’s do a quick walkthrough of this one more time to make sure we know what’s going on so we have a Capital M to start, and then that capital M is. Followed by either an R and s or an RS and then we are looking for a literal period and this question mark says that we can have zero or one of those so that is optional so it’s matching the ones that do have that period and the ones that don’t and Then after that we are matching a space then after that space we the first letter of the last name We’re looking for any capital letter so we a character set here That is a through Z of capital letters, and then for the rest of the last name. We are matching zero or more Word characters now these groups can actually be used to capture sections of your matched regular expression And that’s something that we’ll look at in just a minute But for now, let’s do a quick recap of everything that we’ve learned so far And look at some examples that incorporates all of these things together, so I have a file here And I’m going to move my snippets back into the group here and open up this file emails Txt so I’ve got a file here with three fairly different email addresses So let’s try to write a regular expression that will match all of these emails So let’s just match the first email address first and see what that looks like So the first email address we have a mix of upper and lowercase letters here before we hit this @ symbol So let’s go ahead and match those first so to match any upper or lowercase letters we can do a character set and we can do a lowercase a through Z or an uppercase a through uppercase Z now right now This is only matching those single characters So we can use the plus quantifier to say that we want one or more of these upper or lowercase letters So we’re still working on the first email address here We have our upper and lowercase letters here, and now we want to match that at symbol So I’ll just put in a literal at symbol and now for the domain name here I’ll just do a another search for any upper or lowercase letters So I’ll do the same as we did before and then I will do a plus sign for a quantifier to match any Upper lowercase letters after that at symbol and then that’s when we hit the end with the dot-com so to match the dot-com we can do a Backslash period for the dot and then we can just fill in a literal com so now we’ve successfully matched that first address now It looks like it’s not matching the second address So let’s see why and see if we can mold this to match the second address as well So we can see that the second address has a dot in the first part of the name here So let’s add a dot to our first character set so that dots are included in That character set so now it’s still not matching that second address, and it’s because at the end here We don’t have a dot com but a dot e-d-u So in order to search for both. We can use a group like we saw before using open and close parentheses and we can search for either com or Edu okay, so now we are building this up a little bit at a time and we can see that We are now matching our second email address, okay? So now let’s see if we can change this to match our third email address here so and our third email Address it looks like before the @ symbol We also have some hyphens and some numbers in the first part here So let’s add those to the character set as well so back here after our capital letters. I’m also going to add in digits by doing 0 through 9 and we also want to add a Dash in there as well so that should match everything before the @ symbol now It looks like we also have a dash in our domain here So we’ll have to add that in as well so after the @ symbol. We’re matching any characters right now it’s just lowercase and uppercase, but we can put a dash in there as well and Lastly, it’s still not matching because just like the other two instead we have a.net here instead so we can just add In a second or at the end and also include Dotnet so we can see that we built that up a little bit at a time to match all three of our email addresses Now with something like email addresses it can be pretty tough writing your own regular expressions from scratch But there are a lot of these available online and once we learn how to write regular expressions Then we should be able to read them and figure out how they’re matching as well now I’ve always found that reading other people’s regular expressions to be a lot harder than writing them But let’s take a look at one and see if we can do this so I have an expression here that I pulled off line That matches email addresses and let’s paste this in here And see if we can read through and see what this is matching So we can see that the one that I got offline does match all three of my email addresses here now Let’s look through this so we can see that It’s somewhat similar to what we had before but first we have a character set here And it’s a pretty large character set and it matches lowercase uppercase Any number and underscore a Period a plus sign or a hyphen And then the plus sign here says that we want to match one or more of any of those characters And we match one or more of those characters all the way up until we hit an @ sign And then after the @ sign we have another character set here and in this character set we have Lowercase letters uppercase letters any digits and also a hyphen now I don’t know a lot about email addresses, but I’m assuming that since they left out the underscore the period and the plus sign that were in the first part of the email address, I’m Assuming that those aren’t allowed in the domain so then we have a plus sign After that character set which means that we’re matching one or more of any of those characters all the way up until we reach this literal dot and that literal dot is escaped with a Backslash and then after the dot we have another character set here and this character set is any lowercase letter any uppercase letter any digits Any – or a period and then that is followed by a plus sign which matches one or more of anything in that character set So just like I did with the phone numbers if we open up our data file here with this Regular expression that we’ve typed in then we can see that it does match all of the email addresses in this data file as well, so we’ve Got an expression that will match email addresses fairly well so doing what we just did and reading through a regular expression Written by other people is probably the hardest part of all this, but if you walk through it bit by bit Then you should be able to break down just about any pattern, okay So the last thing that I’d like to look at in this video is how to capture information from groups now We’ve already seen how to match groups But we can actually use the information and capture from those groups so to show an example of this I’m going to open up a file here with some URLs. Okay, so we can see here that some of the URLs are HTTP some are HTTPS Also some of these have WW before the domain name and some do not So let’s say that you had a list of a lot of different URLs within your document and you only wanted to grab the domain name and The top-level domain which is dot-com or gov so for example out of all these domains you only wanted to grab Google.com or quarry MS Calm or youtube.com or nasa.gov, and you just wanted to ignore everything else So let’s see how we can do this so first Let’s write an expression that actually matches these URLs So let me get rid of the one that we currently have now First to match this we can say all of these start with with HTTP and then the S is Optional so we can say s and then put in a question mark to say that we want to match 0 or 1 for the s and then after that optional s we want a colon forward slash forward slash so at this point some of these domains have a Www.affordablecomm.com So now you can see on all of our URLs. We’ve matched up to the domain name so now to complete this I’m just going to say any word character so backslash W And I will put in a plus sign to say one or more of those word characters and then We get here to the top level domain So we want to match a literal dot so we’ll do a back slash dot and then for the rest of that top level domain I will just do any word character one or more times so we can do a Word character with a plus sign to do one or more okay, so we can see that this matches all of our URLs But the point here was to use our groups to capture some information from our URLs so let’s capture the domain name and the Top-level domain which is the dot-com or the dot gov and things like that so to capture these sections? We can just put them in a group by surrounding them in parentheses So what we want to group here is our domain name and the domain name is this part right here this string of one or more word characters So I’m just going to wrap those in parentheses and create a group And we’ve seen that before and now we also want to put the top-level domain and a group as well that is the dot-com or The dot gov so we can put a parentheses around That dot and then also around the ending there that is the string of one or more or characters, okay? So we can see that we’re still matching all of our URLs here But now we have three different groups so our first group is just that optional wwr Second group is the word characters that make up our domain name and the third group is that top-level domain now There’s also an implicit group 0 and group 0 is. Everything that we captured so in this case It’s the entire URL, so now let’s get to the cool part about this So let me show you what we can do now that we’ve captured these so we can use something called a back Reference to reference our captured group so for example here an atom we have the ability to Replace our matches we can see down here that we can replace So let’s replace all of our matches with just the literal text group 1 and then a colon and then a dollar sign 1 now this dollar sign 1 is a reference to our first group now sometimes This is a back slash But for some reason and atom they use a dollar sign so if I do a replace all here Then we can see that it replaced our matches with this literal text group 1 But then it also replaced the dollar sign 1 with our first captured group and the first capture group is that optional? Wwww we can see that it shows up and for ones that didn’t it doesn’t have anything so let me undo this and now let’s replace our matches with the second group and now the second group should be the domain name so now if I do a Replace all now then we can see that now it says group two is Google Corps EMS YouTube and NASA and if I undo that and Replace this with the group three then the group three should give us our top-level domain so our group 3 is the dot-com comm Gov things like that So let me undo this one more time so now that we know how to use those back references Then we can actually take our regular URLs and clean them up like we meant to from before So we could convert these to a cleaned up version without the HTTP or the WW just by replacing our matches with the domain name which is group 2 Followed by the top-level domain which is group 3 So now if I replace all of our selections with those two Then we can see that replaced all of our URLs with just the domain name and the top-level domain so you can imagine if you had a lot of Information like this that you needed to clean up or modify in some way Then knowing how to match these groups with regular expressions could save you a ton of time with doing things like this ok So I think that’s gonna do it for this video now There’s a lot of advanced features that we could go over with regular expressions as well So if anyone is interested in learning more then I could put together an advanced video covering those topics in the near future But hopefully now you feel comfortable with being able to read and write these regular expressions that we went over in this video But if anyone does have any questions about what we covered in this video Then feel free to ask in the comment section below, and I’ll do my best to answer those And if you enjoy these tutorials and would like to support them Then there are several ways you can do that the easiest ways to simply like the video and give it a thumbs up and also It’s a huge help to share these videos with anyone who you think would find them useful And if you have the means you can contribute through patreon and there’s a link to that page in the description section below Be sure to subscribe for future videos, and thank you all for watching You

100 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *