# Regular Expressions First of all, this will be a bit painful but as with `vim` once you overcome the initial learning curve you start to see the potential regular expressions bring to the table. To make matters even worse, there are multiple *flavors* of regexes. An overview and comparison between different flavors can be found on [wikipedia](https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines). Don't see this as a reason *not* to learn some basic expressions though, a little experience goes a long way. ## What are they? > A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory. [wikipedia](https://en.wikipedia.org/wiki/Regular_expression) You can see regular expressions as find (and replace) on steroids. As a practical example, I used *a lot* of regular expressions to clean up the multiple choice LPI questionnaires. This was done in `vim` so I used the vim flavor regex but it's not too much different from the main one you should know, `grep`. From a practical system administer point of view you'll probably use regexes in this order: 1. with `grep` 2. with `sed` (went copy pasting commands found online) 3. with `vim` 4. with a scripting language such as `python3` ## How to learn them? Some tips and pointers before we head into the actual syntax. ### Vim There is a setting in `vim` that is disabled by default but highly advised to learn vim regexes. By setting `set incsearch` in your `~/.vimrc` or in the **expert** command line vim will highlight whatever matches the pattern you're searching for. This can be a tremendous help when building complex patterns. ### Grep By default `grep` only interprets basic regular expressions. If you want, or more likely *need* to use [extended](https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html) expressions you should use `grep -E` or `egrep` instead. For completeness's sake I should mention there is a third *version* of `grep` invoke with `grep -P` that interprets the patterns as [perl regex](https://perldoc.perl.org/perlre). One of the advantages of perl regexes is reverse matching. ## The basics ### Fundamental structure and anchoring Download the following [file](./assets/regex_03.txt) which is an American English dictionary we will use to test out some basic patterns. Fire up a terminal and find all words that contain the string `abba`. You already know how to do this with `grep`, nothing really special here. `grep 'abba' regex_03.txt` should give give you a list containing 16 words. Notice the color output in you terminal. What if we only want the words containing `abba` but who *also* start with `s`? This can be done with the following line `grep 's.*abba' regex_03.txt`. You should have 6 matches left. How about the same ones, but they have to end with an `s`? You guessed it, `grep 's.*abba.*s' regex_03.txt`, get's the job done. Now, what are those special characters? They are *like* [wildcards in bash](https://ryanstutorials.net/linuxtutorial/wildcards.php) but on steroids. In the example above, the `.` represents almost any character, and the `*` means as **many** times as we want. They are both very powerful, and broad, matching patterns that are part of the [fundamental structure](https://www.gnu.org/software/grep/manual/html_node/Fundamental-Structure.html#Fundamental-Structure) of regex. They work well but sometime we need to be a bit more precise, which we'll get to in a bit. Second little example with the same list. What if we want all words with the letter `a` in them? Easy, `grep 'a' regex_03.txt` should do the trick but 52849 results are a bit much. Let's trim it back to only the words having `aa` in them? `grep 'aa' regex_03.txt` brings it down to 65 matches. Nice. Now, how about only the ones that *start* with `aa`? This can be done with an other core concept of regex called [anchoring](https://www.gnu.org/software/grep/manual/html_node/Anchoring.html#Anchoring) and can be done as such `grep '^aa' regex_03.txt`. Here the `^` signifies the **start** of the line. Last one, I promise, what if we want of those three only the ones with `s` at the end? Right, `grep '^aa.*s$' regex_03.txt` should do the trick where `$` means **ends** with. ### Character classes and bracket expansion Let's find, in the following [file](./assets/regex_04.txt) all words starting with a capital case letter. This can be done in multiple ways. First we try one you *should* remember from your bandit days. Does `[a-zA-Z]` ring a bell? You used it to do ROT13 rotation somewhere around level 12. `grep '^[A-Z].*' regex_04.txt` greps out all words starting with a capital. Wonderful, but a bit too many words. Let's limit it to only words that have punctuation in them. This, `grep '^[A-Z][[:alnum:]]*[[:punct:]].*' regex_04.txt`, which is *very* cryptic does the job. Let's break it down a bit: * `^[A-Z]` matches all capital letters * `[[:alnum:]]*` all letters and number, as many times as we want * `[[:punct:]]` matches any punctuation, but just **once** * `.*` matches almost any character as many times as we want In the above pattern you see *two* different forms of character classes, the `[A-Z]` and the `[[:alnum:]]`. Worth noting is that you can negate an expansion by putting a `^` inside it as such `grep '^[^A-Z].*' regex_03.txt` (not I used the first list because the animals list has no words starting with a capital letter). This is basically the same as `grep '^[a-z].*' regex_03.txt`. ### The backslash character and special expressions If we want to match only animals that contain multiple words we can use [special expressions](https://www.gnu.org/software/grep/manual/html_node/The-Backslash-Character-and-Special-Expressions.html#The-Backslash-Character-and-Special-Expressions). Have a look at the output of this grep pattern `grep '\s[[:alpha:]]*' regex_04.txt`. Here we notice we can select only parts of the lines too! A handy flag we can add to the `grep` command is `-o` which will only print the part that matches (in red). If we only want single words coming out of the search we can do this like so `grep -o '\s[a-zA-Z]*\>' regex_04.txt`. Notice how it starts with an empty space? let's break this one down. * `\s` matches whitespace * `[a-zA-Z]*` matches the entire alphabet as many times as we want * `\>` matches the end of the word (not really needed here) ## Beyond basic TODO ## Exercises Below are some practical exercises and files to go with them. Use them to test out you grepping skills and as inspiration for personal challenges. * configuration [file](./assets/sysctl.conf) * print only lines with actual configuration settings (ignore comments and empty lines) * css [file](./assets/teddit.css) * extract all the hex color codes * html [file](./assets/teddit.html) * html extract pictures * just jpg * jpg and png at the same time * log [file](./assets/auth.log) * extract all IP addresses * plus only the unique ones * extract all wrong logins for known users * extract all unknown users (this is tricky and requires backwards searching using `grep -P`) * extract all the dates and times for successful logins (might require multiple greps in a pipe) * mail dump [file](./assets/dump.mail) * extract all unique email addresses * extract all web links * only the base link (https://www.example.co.uk) * both http and https links There are some very good regex exercises online as well. [This](http://regextutorials.com/) is a good starting point.