linux_course_doc/modules/qualifying/learning_regex.md

8.2 KiB

Regular Expressions

First of all, this will be a bit painful but as with vim once you overcome the initial learning curve you start to see the potential regular expressions bring to the table. To make matters even worse, there are multiple flavors of regexes. An overview and comparison between different flavors can be found on wikipedia. Don't see this as a reason not to learn some basic expressions though, a little experience goes a long way.

What are they?

A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

wikipedia

You can see regular expressions as find (and replace) on steroids. As a practical example, I used a lot of regular expressions to clean up the multiple choice LPI questionnaires. This was done in vim so I used the vim flavor regex but it's not too much different from the main one you should know, grep.

From a practical system administer point of view you'll probably use regexes in this order:

  1. with grep
  2. with sed (went copy pasting commands found online)
  3. with vim
  4. with a scripting language such as python3

How to learn them?

Some tips and pointers before we head into the actual syntax.

Vim

There is a setting in vim that is disabled by default but highly advised to learn vim regexes. By setting set incsearch in your ~/.vimrc or in the expert command line vim will highlight whatever matches the pattern you're searching for. This can be a tremendous help when building complex patterns.

Grep

By default grep only interprets basic regular expressions. If you want, or more likely need to use extended expressions you should use grep -E or egrep instead. For completeness's sake I should mention there is a third version of grep invoke with grep -P that interprets the patterns as perl regex. One of the advantages of perl regexes is reverse matching.

The basics

Fundamental structure and anchoring

Download the following file which is an American English dictionary we will use to test out some basic patterns. Fire up a terminal and find all words that contain the string abba. You already know how to do this with grep, nothing really special here. grep 'abba' regex_03.txt should give give you a list containing 16 words. Notice the color output in you terminal.

What if we only want the words containing abba but who also start with s? This can be done with the following line grep 's.*abba' regex_03.txt. You should have 6 matches left. How about the same ones, but they have to end with an s? You guessed it, grep 's.*abba.*s' regex_03.txt, get's the job done.

Now, what are those special characters? They are like wildcards in bash but on steroids. In the example above, the . represents almost any character, and the * means as many times as we want. They are both very powerful, and broad, matching patterns that are part of the fundamental structure of regex. They work well but sometime we need to be a bit more precise, which we'll get to in a bit.

Second little example with the same list. What if we want all words with the letter a in them? Easy, grep 'a' regex_03.txt should do the trick but 52849 results are a bit much. Let's trim it back to only the words having aa in them? grep 'aa' regex_03.txt brings it down to 65 matches. Nice. Now, how about only the ones that start with aa? This can be done with an other core concept of regex called anchoring and can be done as such grep '^aa' regex_03.txt. Here the ^ signifies the start of the line. Last one, I promise, what if we want of those three only the ones with s at the end? Right, grep '^aa.*s$' regex_03.txt should do the trick where $ means ends with.

Character classes and bracket expansion

Let's find, in the following file all words starting with a capital case letter. This can be done in multiple ways. First we try one you should remember from your bandit days. Does [a-zA-Z] ring a bell? You used it to do ROT13 rotation somewhere around level 12. grep '^[A-Z].*' regex_04.txt greps out all words starting with a capital. Wonderful, but a bit too many words.

Let's limit it to only words that have punctuation in them. This, grep '^[A-Z][[:alnum:]]*[[:punct:]].*' regex_04.txt, which is very cryptic does the job.

Let's break it down a bit:

  • ^[A-Z] matches all capital letters
  • [[:alnum:]]* all letters and number, as many times as we want
  • [[:punct:]] matches any punctuation, but just once
  • .* matches almost any character as many times as we want

In the above pattern you see two different forms of character classes, the [A-Z] and the [[:alnum:]]. Worth noting is that you can negate an expansion by putting a ^ inside it as such grep '^[^A-Z].*' regex_03.txt (note I used the first list because the animals list has no words starting with a capital letter). This is basically the same as grep '^[a-z].*' regex_03.txt.

The backslash character and special expressions

If we want to match only animals that contain multiple words we can use special expressions. Have a look at the output of this grep pattern grep '\s[[:alpha:]]*' regex_04.txt. Here we notice we can select only parts of the lines too! A handy flag we can add to the grep command is -o which will only print the part that matches (in red).

If we only want single words coming out of the search we can do this like so grep -o '\s[a-zA-Z]*\>' regex_04.txt. Notice how it starts with an empty space? let's break this one down.

  • \s matches whitespace
  • [a-zA-Z]* matches the entire alphabet as many times as we want
  • \> matches the end of the word (not really needed here)

Beyond basic

The tests done above are a quick introduction to what regex are and how to read them. To master them there are only two things you need to do.

  1. use them
  2. google

There is just no other way to wrap your head around it. You should see it as a puzzle you're trying to solve. To get you started I urge you to go to this website and start the introduction to regex section. It highlights automatically what your pattern is doing. Once you completed the introduction you should try their exercises and the ones I made for you below.

Exercises

Below are some practical exercises and files to go with them. Use them to test out you grepping skills and as inspiration for personal challenges. I've tested all of the challenges myself but there are a lot of different ways to get the same result with a different regex.

  • configuration file
    • print only lines with actual configuration settings (ignore comments and empty lines)
  • css file
    • extract all the hex color codes
  • html file
    • html extract pictures
      • just jpg
      • jpg and png at the same time
  • log file
    • extract all IP addresses
      • plus only the unique ones
    • extract all wrong logins for known users
    • extract all unknown users (this is tricky and requires backwards searching using grep -P)
    • extract all the dates and times for successful logins (might require multiple greps in a pipe)
  • mail dump file
    • extract all unique email addresses
    • extract all web links