Pattern Matching (Regex)

6 min readMar 8, 2022

In this article, I will be using Bash commands like grep or sed to explain how RegEx works. This knowledge can be used to pattern match in almost all languages.

Note: Do not parse HTML with RegEx

Why? Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your web app. Parsing HTML with regex summons tainted souls into the realm of the living.
For more check out: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

How to use the grep command?

The grep command is used to print lines that match a particular pattern.

This command runs line after line and by default uses the basic regular expression engine (BRE)

grep ‘pattern’ filename
command | grep ‘pattern’

or to use the extended regex engine (ERE) use

egrep ‘pattern’ filename or grep -E ‘pattern’ filename

For case insensitive matching add the -i flag.

Now, what is this regex that I am talking about? This is a specific sequence of characters that check patterns within the given text. Can also be called an advanced version of find and replace features that you may have seen.

This regex uses regular keyboard characters like “.” or “*” or “^” in a way that they are special.

Regex Special Symbols

“.” Represents a single character except for null or new line.

“*” Represents Zero or more the preceding character/ expression.

“[]” Represents a list of enclosed characters; if hyphen (-) is in between it indicates character range.

“^” Represents the beginning of a line or negation of characters enclosed in “[]”

“$” Represents Anchor for end of line

“\” escape special characters.

We can write a range of patterns in the form of:

\{n,m\} will be a range of the preceding patterns at least n times and utmost m times.

 grouping of regular expressions

In ERE we can write it as

{n,m} for \{n,m\} in BRE

() for  in BRE

“+” to represent if the preceding character or expression is to be matched more than once.

“?” to represent the preceding character or expression is to be matched zero or one time.

“|” logical (OR) over the patterns

Examples:

Words that have ‘S’ then one alphabet then ‘n’

Here we use the backslash as an escape clause to select . in the file.

This selects one alphabet before ‘am’ that is at the end of a sentence

Character Class

To make regex more human-readable there are certain Character classes that help out.

Class and what it represents are as follows:

[[:print:]] -: Printable

[[:blank:]] -: Space/Tab

[[:alnum:]] -: Alphanumeric

[[:alpha:]] -: Alphabetic

[[:lower:]] -: Lower case

[[:upper:]] -: Upper Case

[[:digit:]] -: Decimal Digits

[[:space:]] -: Whitespace

[[:punct:]] -: Punctuation

[[:xdigit:]] -: Hexadecimal

[[:graph:]] -: Non-Space

[[:cntrl:]] -: Control Characters

Backreferences

Backreferences are used to match the same text previously matched by a capturing group. This both helps in reusing previous parts of your pattern and in ensuring two pieces of a string match.

For example

To recognise a pattern like 1–3–4 or 1 2 5 or 1/4/7

We could use regex in the form of [0-9][-/ ][a-z][-/ ][0-9]

This has certain flaws as it would also give results like 1 2–4 or 1/2 3

So, we can use backreferences to make it better [0-9]([-/ ])[a-z]\1[0-9]

Here \1 will capture the first group i.e. the one inside the ().

In such a way Backreferencing can be done for groups from \1 to \9.

Here I put a bunch of random numbers inside my names.txt file and tried the above regex expressions using egrep

Word Boundary

The metacharacter \b matches at a position that is called “word boundary”

What is a word boundary?

It is before the first character in a string if the first character is a word character.
It is after the last character in the string if the last character is a word character.
Between two characters where one is a word character and the other is not.
Example:

Checking for “an” at the end of every word.

Sed

Sed is used for filtering and transforming text. This is an abbreviation for stream editor and works similar to the grep command we used before.

Sed is available in all Linux systems and is super fast.

Execution of Sed:

Input a set of lines
Each line has a sequence of characters.
Has the active pattern space and a hold space.
For each line of input, the execution cycle is performed loading each line into the pattern space.
During each cycle, all statements of the sed script are executed in a sequence for matching address patterns for actions specified with options provided.

Usage in terminal: sed [flag option] {script for filtering} [input file]

Most Used Flags:

“-E” for the Expanded Regex engine

“-e” to add the script to the commands to be executed

“-f” for adding a .sed script file

Sed Script:

Substituting Values: Sed can be used to replace the text in output while leaving the input file unchanged.

Here I had a file that stated Hello world and the sed output gives Bye world

If you notice in the command you will see an “s” present before /Hello/ this refers to substituting values and Hello is turned to Bye.

If a word exists in a sentence more than one time then we can add the /2,3…at the end and replace that nth occurrence.

To replace all we use /g instead of /2

We can also use both together to replace all same patterns from the n’th value onwards

sed -e ‘s/world/Bye/5g’ input.txt This command replaced all the world to Bye starting from the 5th world of the line.

In between the “s” and “g”, we can use all kinds of regex and play around with it.

Format: sed ‘s/regex1/regex2/g’ [input file] Here all the patterns coming out from regex1 will be replaced with regex2

Deleting Lines from a file: Sed can also be used to delete lines and can perform this operation without opening the file.

Syntax: sed ‘nd’ [input file] This will delete the n’th line of the file.

Note: $ is used to represent the last line and % means all the lines.

Just like “d” and “s” we can use “i” to insert above the current line and “a” is used to append below the current line. “=” is used to print input line number and so on….

This was just an introduction to sed command a lot more practice and reading will be needed to be good at using it.

There is also a command called awk which works similar to sed and can be used in the Linux terminal.