Regular Expressions
Oct. 22nd, 2005 11:26 amThis is the kind of thing I never, ever post. But I wrote it for a few people at work, and if you've ever been scared by a regular expression, then you might find this unscares you, just a little bit. Oh - as a note, the code is in C#, but that shouldn't really matter for most purpouses.
If you're anything like me, then regular expressions look scary as hell - large blobs of text stuck together with no real rhyme or reason to them.
They're one of those things that took me a while to really get to grips with, despite being sure that they could make my life easier. Here are a couple of simple examples of how to extract data using them, so you can see how they can make your life easier.
In the following examples searchText is
[M K12345678]
You can also search a string for a particular substring, like so:
Match m = Regex.Match(searchText,"[K|k][0-9]{8,10}");
which searches the searchText for a K or a k, followed by between 8 and 10 values between 0 and 9.
So the square brackets surround different possibilities for a particular character, and the curly brackets can be used to say how many times a character group can repeat.
you can then follow it up with
Identifier = m.Value
to get the string out.
If you know the structure of the text you're looking at, but not what you're looking for (i.e. you know that you want what lies after the space and before the closing ], but not what form it's in) then you can use something like the following:
GroupCollection gc = Regex.Match(searchText,@"(\[)(?<Type>.)( )(?<Key>.*)(\])").Groups;
Each 'group' is enclosed in brackets, so the first one is:
\[
And the \ is there because [ is a special character in Regex expressions, and the \ means "treat it like a normal character"
The second one is
.
Which means "anything" - and the ?<Type> next to it means that the group is named "Type".
The third one is a space.
The fourth one is
.*
which means "anything" - "as many times as possible" - i.e. A* would mean "A" or "AA" or "AAAAAAAAA", etc. This group is called "Key"
and finally
\]
which means ], the \ again meaning "treat this special character like an ordinary one"
I can then extract the Key out of it with:
Identifier = gc["Key"].Value;
which gets the entry in the GroupCollection called "Key" - in this case
"K12345678".
So, both of the two above regexes would have the same effect - i.e. they'd extract K12345678 out of searchText.
If you're anything like me, then regular expressions look scary as hell - large blobs of text stuck together with no real rhyme or reason to them.
They're one of those things that took me a while to really get to grips with, despite being sure that they could make my life easier. Here are a couple of simple examples of how to extract data using them, so you can see how they can make your life easier.
In the following examples searchText is
[M K12345678]
You can also search a string for a particular substring, like so:
Match m = Regex.Match(searchText,"[K|k][0-9]{8,10}");
which searches the searchText for a K or a k, followed by between 8 and 10 values between 0 and 9.
So the square brackets surround different possibilities for a particular character, and the curly brackets can be used to say how many times a character group can repeat.
you can then follow it up with
Identifier = m.Value
to get the string out.
If you know the structure of the text you're looking at, but not what you're looking for (i.e. you know that you want what lies after the space and before the closing ], but not what form it's in) then you can use something like the following:
GroupCollection gc = Regex.Match(searchText,@"(\[)(?<Type>.)( )(?<Key>.*)(\])").Groups;
Each 'group' is enclosed in brackets, so the first one is:
\[
And the \ is there because [ is a special character in Regex expressions, and the \ means "treat it like a normal character"
The second one is
.
Which means "anything" - and the ?<Type> next to it means that the group is named "Type".
The third one is a space.
The fourth one is
.*
which means "anything" - "as many times as possible" - i.e. A* would mean "A" or "AA" or "AAAAAAAAA", etc. This group is called "Key"
and finally
\]
which means ], the \ again meaning "treat this special character like an ordinary one"
I can then extract the Key out of it with:
Identifier = gc["Key"].Value;
which gets the entry in the GroupCollection called "Key" - in this case
"K12345678".
So, both of the two above regexes would have the same effect - i.e. they'd extract K12345678 out of searchText.
no subject
Date: 2005-10-22 10:50 am (UTC)I think I'll stick to Perl, if you don't mind. (All that stuff about needing to invoke methods to get matches out of target text is doing my head in. What's wrong with $target ~= /regexp/i; ...?)
no subject
Date: 2005-10-22 11:30 am (UTC)Of course, you don't need to go through multiple steps - you can do what you're saying (I think - I don't know Perl) with:
target = Regex.Match(SearchText,Pattern).Value;
The advantage of the Match being an object is that it has various properties built in such as all the captures and groups, or the index in the original string where it was found.
no subject
Date: 2005-10-22 04:08 pm (UTC)PHP's regex functions are quite often a huge pain in the ass.
no subject
Date: 2005-10-22 08:14 pm (UTC)Next :P.
no subject
Date: 2005-10-22 08:17 pm (UTC)b) That tells whether the pattern's present. Sorting through the infinite arrays when you put in a $matches array to out put can often be a huge pain in the ass to deal with, especially if you have multiple areas you're saving in the backreferences.
no subject
Date: 2005-10-22 08:28 pm (UTC)It's not that awkward imo. You can easily ( preg_match($string,$regex,$matches); $echo = $matches[2]; echo $echo) use it to grep stuff or change it - yeah, it's awkward, and I LOATH arrays too, but that's probably due to being a n00b at 16 :). There are loads of other ones, but yeah, it can be a pain.....
no subject
Date: 2005-10-23 05:40 am (UTC)Understanding that namespaces were actually just a way of not having to type in the full name of the class was handy too. So that whereas the full name of the "Match" class is "System.Text.RegularExpressions.Match", if you have "using System.Text.RegularExpressions" at the top of your code, it will automatically look for the Match class with that prepended, without you having to type the whole damn thing in each time.
no subject
Date: 2005-10-22 12:30 pm (UTC)no subject
Date: 2005-10-23 05:14 pm (UTC)http://www.codeproject.com/dotnet/expresso.asp
which is a version tailored to .Net conformant regexes.
Haven't checked out Regulator though - I'll give it a look at some point.
no subject
Date: 2005-10-22 06:48 pm (UTC)no subject
Date: 2005-10-23 05:37 am (UTC)Putting ? before a pattern in a Regex means "Store the results for this pattern in an easily retrieved place, accessible as "MyName".
So, once I told it to store the first "." as "Type", I could then say 'what is gc["Type"] to retrieve it. The word "Type" is purely arbitrary, as is "Key".
no subject
Date: 2005-10-23 05:08 pm (UTC)no subject
Date: 2005-10-23 05:12 pm (UTC)I've rectified it now, and if you look at the original example it should be clear.
no subject
Date: 2005-10-23 05:16 pm (UTC)no subject
Date: 2005-10-22 08:17 pm (UTC)