8.2 KiB
Regular Expressions
First of all, this will be a bit painful but as with vim
once you overcome the initial learning curve you start to see the potential regular expressions bring to the table.
To make matters even worse, there are multiple flavors of regexes.
An overview and comparison between different flavors can be found on wikipedia.
Don't see this as a reason not to learn some basic expressions though, a little experience goes a long way.
What are they?
A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.
You can see regular expressions as find (and replace) on steroids.
As a practical example, I used a lot of regular expressions to clean up the multiple choice LPI questionnaires.
This was done in vim
so I used the vim flavor regex but it's not too much different from the main one you should know, grep
.
From a practical system administer point of view you'll probably use regexes in this order:
- with
grep
- with
sed
(went copy pasting commands found online) - with
vim
- with a scripting language such as
python3
How to learn them?
Some tips and pointers before we head into the actual syntax.
Vim
There is a setting in vim
that is disabled by default but highly advised to learn vim regexes.
By setting set incsearch
in your ~/.vimrc
or in the expert command line vim will highlight whatever matches the pattern you're searching for.
This can be a tremendous help when building complex patterns.
Grep
By default grep
only interprets basic regular expressions.
If you want, or more likely need to use extended expressions you should use grep -E
or egrep
instead.
For completeness's sake I should mention there is a third version of grep
invoke with grep -P
that interprets the patterns as perl regex.
One of the advantages of perl regexes is reverse matching.
The basics
Fundamental structure and anchoring
Download the following file which is an American English dictionary we will use to test out some basic patterns.
Fire up a terminal and find all words that contain the string abba
.
You already know how to do this with grep
, nothing really special here.
grep 'abba' regex_03.txt
should give give you a list containing 16 words.
Notice the color output in you terminal.
What if we only want the words containing abba
but who also start with s
?
This can be done with the following line grep 's.*abba' regex_03.txt
.
You should have 6 matches left.
How about the same ones, but they have to end with an s
?
You guessed it, grep 's.*abba.*s' regex_03.txt
, get's the job done.
Now, what are those special characters?
They are like wildcards in bash but on steroids.
In the example above, the .
represents almost any character, and the *
means as many times as we want.
They are both very powerful, and broad, matching patterns that are part of the fundamental structure of regex.
They work well but sometime we need to be a bit more precise, which we'll get to in a bit.
Second little example with the same list.
What if we want all words with the letter a
in them?
Easy, grep 'a' regex_03.txt
should do the trick but 52849 results are a bit much.
Let's trim it back to only the words having aa
in them?
grep 'aa' regex_03.txt
brings it down to 65 matches.
Nice.
Now, how about only the ones that start with aa
?
This can be done with an other core concept of regex called anchoring and can be done as such grep '^aa' regex_03.txt
.
Here the ^
signifies the start of the line.
Last one, I promise, what if we want of those three only the ones with s
at the end?
Right, grep '^aa.*s$' regex_03.txt
should do the trick where $
means ends with.
Character classes and bracket expansion
Let's find, in the following file all words starting with a capital case letter.
This can be done in multiple ways.
First we try one you should remember from your bandit days.
Does [a-zA-Z]
ring a bell?
You used it to do ROT13 rotation somewhere around level 12.
grep '^[A-Z].*' regex_04.txt
greps out all words starting with a capital.
Wonderful, but a bit too many words.
Let's limit it to only words that have punctuation in them.
This, grep '^[A-Z][[:alnum:]]*[[:punct:]].*' regex_04.txt
, which is very cryptic does the job.
Let's break it down a bit:
^[A-Z]
matches all capital letters[[:alnum:]]*
all letters and number, as many times as we want[[:punct:]]
matches any punctuation, but just once.*
matches almost any character as many times as we want
In the above pattern you see two different forms of character classes, the [A-Z]
and the [[:alnum:]]
.
Worth noting is that you can negate an expansion by putting a ^
inside it as such grep '^[^A-Z].*' regex_03.txt
(note I used the first list because the animals list has no words starting with a capital letter).
This is basically the same as grep '^[a-z].*' regex_03.txt
.
The backslash character and special expressions
If we want to match only animals that contain multiple words we can use special expressions.
Have a look at the output of this grep pattern grep '\s[[:alpha:]]*' regex_04.txt
.
Here we notice we can select only parts of the lines too!
A handy flag we can add to the grep
command is -o
which will only print the part that matches (in red).
If we only want single words coming out of the search we can do this like so grep -o '\s[a-zA-Z]*\>' regex_04.txt
.
Notice how it starts with an empty space?
let's break this one down.
\s
matches whitespace[a-zA-Z]*
matches the entire alphabet as many times as we want\>
matches the end of the word (not really needed here)
Beyond basic
The tests done above are a quick introduction to what regex are and how to read them. To master them there are only two things you need to do.
- use them
There is just no other way to wrap your head around it. You should see it as a puzzle you're trying to solve. To get you started I urge you to go to this website and start the introduction to regex section. It highlights automatically what your pattern is doing. Once you completed the introduction you should try their exercises and the ones I made for you below.
Exercises
Below are some practical exercises and files to go with them. Use them to test out you grepping skills and as inspiration for personal challenges. I've tested all of the challenges myself but there are a lot of different ways to get the same result with a different regex.
- configuration file
- print only lines with actual configuration settings (ignore comments and empty lines)
- css file
- extract all the hex color codes
- html file
- html extract pictures
- just jpg
- jpg and png at the same time
- html extract pictures
- log file
- extract all IP addresses
- plus only the unique ones
- extract all wrong logins for known users
- extract all unknown users (this is tricky and requires backwards searching using
grep -P
) - extract all the dates and times for successful logins (might require multiple greps in a pipe)
- extract all IP addresses
- mail dump file
- extract all unique email addresses
- extract all web links
- only the base link (https://www.example.co.uk)
- both http and https links