linux_course_doc/modules/qualifying/learning_regex.md

# Regular Expressions

First of all, this will be a bit painful but as with `vim` once you overcome the initial learning curve you start to see the potential regular expressions bring to the table.
To make matters even worse, there are multiple *flavors* of regexes.
An overview and comparison between different flavors can be found on [wikipedia](https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines).
Don't see this as a reason *not* to learn some basic expressions though, a little experience goes a long way.

## What are they?

> A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

[wikipedia](https://en.wikipedia.org/wiki/Regular_expression)

You can see regular expressions as find (and replace) on steroids.
As a practical example, I used *a lot* of regular expressions to clean up the multiple choice LPI questionnaires.
This was done in `vim` so I used the vim flavor regex but it's not too much different from the main one you should know, `grep`.

From a practical system administer point of view you'll probably use regexes in this order:

1. with `grep`
2. with `sed` (went copy pasting commands found online)
3. with `vim`
4. with a scripting language such as `python3`

## How to learn them?

Some tips and pointers before we head into the actual syntax.

### Vim

There is a setting in `vim` that is disabled by default but highly advised to learn vim regexes.
By setting `set incsearch` in your `~/.vimrc` or in the **expert** command line vim will highlight whatever matches the pattern you're searching for.
This can be a tremendous help when building complex patterns.

### Grep

By default `grep` only interprets basic regular expressions.
If you want, or more likely *need* to use [extended](https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html) expressions you should use `grep -E` or `egrep` instead.
For completeness's sake I should mention there is a third *version* of `grep` invoke with `grep -P` that interprets the patterns as [perl regex](https://perldoc.perl.org/perlre).
One of the advantages of perl regexes is reverse matching.

## The basics

### Fundamental structure and anchoring

Download the following [file](./assets/regex_03.txt) which is an American English dictionary we will use to test out some basic patterns.
Fire up a terminal and find all words that contain the string `abba`.
You already know how to do this with `grep`, nothing really special here.
`grep 'abba' regex_03.txt` should give give you a list containing 16 words.
Notice the color output in you terminal.

What if we only want the words containing `abba` but who *also* start with `s`?
This can be done with the following line `grep 's.*abba' regex_03.txt`.
You should have 6 matches left.
How about the same ones, but they have to end with an `s`?
You guessed it, `grep 's.*abba.*s' regex_03.txt`, get's the job done.

Now, what are those special characters?
They are *like* [wildcards in bash](https://ryanstutorials.net/linuxtutorial/wildcards.php) but on steroids.
In the example above, the `.` represents almost any character, and the `*` means as **many** times as we want.
They are both very powerful, and broad, matching patterns that are part of the [fundamental structure](https://www.gnu.org/software/grep/manual/html_node/Fundamental-Structure.html#Fundamental-Structure) of regex.
They work well but sometime we need to be a bit more precise, which we'll get to in a bit.

Second little example with the same list.
What if we want all words with the letter `a` in them?
Easy, `grep 'a' regex_03.txt` should do the trick but 52849 results are a bit much.
Let's trim it back to only the words having `aa` in them?
`grep 'aa' regex_03.txt` brings it down to 65 matches.
Nice.
Now, how about only the ones that *start* with `aa`?
This can be done with an other core concept of regex called [anchoring](https://www.gnu.org/software/grep/manual/html_node/Anchoring.html#Anchoring) and can be done as such `grep '^aa' regex_03.txt`.
Here the `^` signifies the **start** of the line.
Last one, I promise, what if we want of those three only the ones with `s` at the end?
Right, `grep '^aa.*s$' regex_03.txt` should do the trick where `$` means **ends** with.

### Character classes and bracket expansion

Let's find, in the following [file](./assets/regex_04.txt) all words starting with a capital case letter.
This can be done in multiple ways.
First we try one you *should* remember from your bandit days.
Does `[a-zA-Z]` ring a bell?
You used it to do ROT13 rotation somewhere around level 12.
`grep '^[A-Z].*' regex_04.txt` greps out all words starting with a capital.
Wonderful, but a bit too many words.

Let's limit it to only words that have punctuation in them.
This, `grep '^[A-Z][[:alnum:]]*[[:punct:]].*' regex_04.txt`, which is *very* cryptic does the job.

Let's break it down a bit:

* `^[A-Z]` matches all capital letters
* `[[:alnum:]]*` all letters and number, as many times as we want
* `[[:punct:]]` matches any punctuation, but just **once**
* `.*` matches almost any character as many times as we want

In the above pattern you see *two* different forms of character classes, the `[A-Z]` and the `[[:alnum:]]`.
Worth noting is that you can negate an expansion by putting a `^` inside it as such `grep '^[^A-Z].*' regex_03.txt` (note I used the first list because the animals list has no words starting with a capital letter).
This is basically the same as `grep '^[a-z].*' regex_03.txt`.

### The backslash character and special expressions

If we want to match only animals that contain multiple words we can use [special expressions](https://www.gnu.org/software/grep/manual/html_node/The-Backslash-Character-and-Special-Expressions.html#The-Backslash-Character-and-Special-Expressions).
Have a look at the output of this grep pattern `grep '\s[[:alpha:]]*' regex_04.txt`.
Here we notice we can select only parts of the lines too!
A handy flag we can add to the `grep` command is `-o` which will only print the part that matches (in red).

If we only want single words coming out of the search we can do this like so `grep -o '\s[a-zA-Z]*\>' regex_04.txt`.
Notice how it starts with an empty space?
let's break this one down.

* `\s` matches whitespace
* `[a-zA-Z]*` matches the entire alphabet as many times as we want
* `\>` matches the end of the word (not really needed here)

## Beyond basic

The tests done above are a quick introduction to what regex are and how to read them.
To master them there are only two things you need to do.

1. use them
2. google

There is just no other way to wrap your head around it.
You should see it as a puzzle you're trying to solve.
To get you started I urge you to go to [this](http://regextutorials.com/) website and start the **introduction to regex** section.
It highlights automatically what your pattern is doing.
Once you completed the introduction you should try their exercises and the ones I made for you below.

## Exercises

Below are some practical exercises and files to go with them.
Use them to test out you grepping skills and as inspiration for personal challenges.
I've tested all of the challenges myself but there are a lot of different ways to get the same result with a different regex.

* configuration [file](./assets/sysctl.conf)
	* print only lines with actual configuration settings (ignore comments and empty lines)
* css [file](./assets/teddit.css)
	* extract all the hex color codes
* html [file](./assets/teddit.html)
	* html extract pictures
		* just jpg
		* jpg and png at the same time
* log [file](./assets/auth.log)
	* extract all IP addresses
		* plus only the unique ones
	* extract all wrong logins for known users
	* extract all unknown users (this is tricky and requires backwards searching using `grep -P`)
	* extract all the dates and times for successful logins (might require multiple greps in a pipe)
* mail dump [file](./assets/dump.mail)
	* extract all unique email addresses
	* extract all web links
		* only the base link (https://www.example.co.uk)
		* both http and https links

## Extra challenges

Regex patterns on their own are nice but can get a bit boring.
Try to integrate them into script to discover their true power.
An example would be to take the IP addresses from the [auth.log](./assets/auth.log) file and do a region discovery on them.
You can save all the country codes to a file and do an analysis on them to see where all the attacks are coming from.