andrewducker | Regular Expressions

This is the kind of thing I never, ever post. But I wrote it for a few people at work, and if you've ever been scared by a regular expression, then you might find this unscares you, just a little bit. Oh - as a note, the code is in C#, but that shouldn't really matter for most purpouses.

If you're anything like me, then regular expressions look scary as hell - large blobs of text stuck together with no real rhyme or reason to them.
They're one of those things that took me a while to really get to grips with, despite being sure that they could make my life easier. Here are a couple of simple examples of how to extract data using them, so you can see how they can make your life easier.

In the following examples searchText is
[M K12345678]

You can also search a string for a particular substring, like so:
Match m = Regex.Match(searchText,"[K|k][0-9]{8,10}");
which searches the searchText for a K or a k, followed by between 8 and 10 values between 0 and 9.
So the square brackets surround different possibilities for a particular character, and the curly brackets can be used to say how many times a character group can repeat.

you can then follow it up with
Identifier = m.Value
to get the string out.

If you know the structure of the text you're looking at, but not what you're looking for (i.e. you know that you want what lies after the space and before the closing ], but not what form it's in) then you can use something like the following:

GroupCollection gc = Regex.Match(searchText,@"(\[)(?<Type>.)( )(?<Key>.*)(\])").Groups;

Each 'group' is enclosed in brackets, so the first one is:
\[
And the \ is there because [ is a special character in Regex expressions, and the \ means "treat it like a normal character"

The second one is
.
Which means "anything" - and the ?<Type> next to it means that the group is named "Type".

The third one is a space.

The fourth one is
.*
which means "anything" - "as many times as possible" - i.e. A* would mean "A" or "AA" or "AAAAAAAAA", etc. This group is called "Key"

and finally
\]
which means ], the \ again meaning "treat this special character like an ordinary one"

I can then extract the Key out of it with:
Identifier = gc["Key"].Value;
which gets the entry in the GroupCollection called "Key" - in this case
"K12345678".

So, both of the two above regexes would have the same effect - i.e. they'd extract K12345678 out of searchText.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Most Popular Tags

ai - 616 uses
art - 571 uses
business - 721 uses
children - 1067 uses
computers - 472 uses
conservatives - 639 uses
death - 516 uses
delicious glue - 1131 uses
doom - 801 uses
economics - 544 uses
edinburgh - 816 uses
europe - 1536 uses
food - 795 uses
funny - 1849 uses
games - 1065 uses
gender - 541 uses
goodnews - 569 uses
history - 1691 uses
instagram - 686 uses
internet - 577 uses
labour - 659 uses
language - 608 uses
law - 1099 uses
lgbt - 1161 uses
links - 6225 uses
money - 787 uses
movies - 1084 uses
music - 849 uses
ohforfuckssake - 1844 uses
pandemic - 639 uses
photos - 1063 uses
politics - 1991 uses
psychology - 1077 uses
racism - 476 uses
relationships - 491 uses
religion - 528 uses
russia - 484 uses
scotland - 1483 uses
sex - 510 uses
society - 908 uses
technology - 1233 uses
transgender - 595 uses
tv - 783 uses
uk - 3568 uses
usa - 2085 uses
viaswampers - 575 uses
video - 1421 uses
women - 873 uses
work - 564 uses
writing - 496 uses

Flat | Top-Level Comments Only

From:

autopope.livejournal.com

Wow, that's cumbersome!

I think I'll stick to Perl, if you don't mind. (All that stuff about needing to invoke methods to get matches out of target text is doing my head in. What's wrong with $target ~= /regexp/i; ...?)

andrewducker

Everything in C# is a class, and methods/properties is how you get things out of it. There are only 8 built in functions, mostly to do with operations on types. Everything else is provided by classes.

Of course, you don't need to go through multiple steps - you can do what you're saying (I think - I don't know Perl) with:

target = Regex.Match(SearchText,Pattern).Value;

The advantage of the Match being an object is that it has various properties built in such as all the captures and groups, or the index in the original string where it was found.

theferrett.livejournal.com

That explains much of C#, which actually makes it less frightening.

PHP's regex functions are quite often a huge pain in the ass.

azalemeth.livejournal.com

$var = pregmatch($text,'\w{0,15}')

Next :P.

a) It's preg_match().

b) That tells whether the pattern's present. Sorting through the infinite arrays when you put in a $matches array to out put can often be a huge pain in the ass to deal with, especially if you have multiple areas you're saving in the backreferences.

Yeah, preg_match (and co), sorry brain fart :P.

It's not that awkward imo. You can easily ( preg_match($string,$regex,$matches); $echo = $matches[2]; echo $echo) use it to grep stuff or change it - yeah, it's awkward, and I LOATH arrays too, but that's probably due to being a n00b at 16 :). There are loads of other ones, but yeah, it can be a pain.....

Yeah - it took me a while to work out why all the C# commands were so long - and why everything looked so complex. Once I realised that it was just because every bit of code had to be stored in a class/object, it made more sense.

Understanding that namespaces were actually just a way of not having to type in the full name of the class was handy too. So that whereas the full name of the "Match" class is "System.Text.RegularExpressions.Match", if you have "using System.Text.RegularExpressions" at the top of your code, it will automatically look for the Match class with that prepended, without you having to type the whole damn thing in each time.

odheirre.livejournal.com

This reminds me - have you used the Regulator? It's a great tool to look up and validate regular expressions.

I recently grabbed a copy of Espresso
http://www.codeproject.com/dotnet/expresso.asp
which is a version tailored to .Net conformant regexes.

Haven't checked out Regulator though - I'll give it a look at some point.

darkoshi

That hasn't unscared me, unfortunately. @ before the string? a group named type? a group called key? wha...? So the first question mark in the string always means it's a group called "Type", and the 2nd question mark always means it's the group called "Key"? Why are things in parenthensis? What does it mean for a group to be called "Type"? EEEEKKK! ::runs away::

C# automatically translates certain special characters in strings. So \n is automatically translated into a newline character, for instance. Putting a @ befor the string tells C# to not do any interpretation.

Putting ? before a pattern in a Regex means "Store the results for this pattern in an easily retrieved place, accessible as "MyName".

So, once I told it to store the first "." as "Type", I could then say 'what is gc["Type"] to retrieve it. The word "Type" is purely arbitrary, as is "Key".

Did your example leave out the lines where you specified that the groups are named "Type" and "Key" ?.... Oh. You have to look at the page source to see that part, since it's in angle brackets. Now it makes more sense.

Yes, I'm a complete idiot, and forgot to escape the < and &rt;.

I've rectified it now, and if you look at the original example it should be clear.

Heck. You understand regular expressions. Ergo you are not a complete idiot. :P

I find that regex's are dangerously good if you get used to them - it's like perl, dead easy to write, forms the duct tape of the universe, but impossible to read afterwards. I also don't like how you've had to do that :P. Perl, sed, bash, or php for me....:P

Very cool, if true

Regular Expressions

Regular Expressions

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

January 2026

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags