Pattern Matching (Regex)
In this article, I will be using Bash commands like grep
or sed
to explain how RegEx works. This knowledge can be used to pattern match in almost all languages.
Note: Do not parse HTML with RegEx
Why? Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your web app. Parsing HTML with regex summons tainted souls into the realm of the living.
For more check out: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
How to use the grep command?
The grep command is used to print lines that match a particular pattern.
This command runs line after line and by default uses the basic regular expression engine (BRE)
grep ‘pattern’ filename
command | grep ‘pattern’
or to use the extended regex engine (ERE) use
egrep ‘pattern’ filename
orgrep -E ‘pattern’ filename
For case insensitive matching add the -i
flag.
Now, what is this regex that I am talking about? This is a specific sequence of characters that check patterns within the given text. Can also be called an advanced version of find and replace features that you may have seen.
This regex uses regular keyboard characters like “.” or “*” or “^” in a way that they are special.
Regex Special Symbols
“.” Represents a single character except for null or new line.
“*” Represents Zero or more the preceding character/ expression.
“[]” Represents a list of enclosed characters; if hyphen (-) is in between it indicates character range.
“^” Represents the beginning of a line or negation of characters enclosed in “[]”
“$” Represents Anchor for end of line
“\” escape special characters.
We can write a range of patterns in the form of:
\{n,m\} will be a range of the preceding patterns at least n times and utmost m times.
\(\) grouping of regular expressions
In ERE we can write it as
{n,m} for \{n,m\} in BRE
() for \(\) in BRE
“+” to represent if the preceding character or expression is to be matched more than once.
“?” to represent the preceding character or expression is to be matched zero or one time.
“|” logical (OR) over the patterns
Examples:
Character Class
To make regex more human-readable there are certain Character classes that help out.
Class and what it represents are as follows:
[[:print:]] -: Printable
[[:blank:]] -: Space/Tab
[[:alnum:]] -: Alphanumeric
[[:alpha:]] -: Alphabetic
[[:lower:]] -: Lower case
[[:upper:]] -: Upper Case
[[:digit:]] -: Decimal Digits
[[:space:]] -: Whitespace
[[:punct:]] -: Punctuation
[[:xdigit:]] -: Hexadecimal
[[:graph:]] -: Non-Space
[[:cntrl:]] -: Control Characters
Backreferences
Backreferences are used to match the same text previously matched by a capturing group. This both helps in reusing previous parts of your pattern and in ensuring two pieces of a string match.
For example
To recognise a pattern like 1–3–4
or 1 2 5
or 1/4/7
We could use regex in the form of [0-9][-/ ][a-z][-/ ][0-9]
This has certain flaws as it would also give results like 1 2–4
or 1/2 3
So, we can use backreferences to make it better [0-9]([-/ ])[a-z]\1[0-9]
Here \1 will capture the first group i.e. the one inside the ().
In such a way Backreferencing can be done for groups from \1 to \9.
Here I put a bunch of random numbers inside my names.txt
file and tried the above regex expressions using egrep
Word Boundary
The metacharacter \b
matches at a position that is called “word boundary”
What is a word boundary?
- It is before the first character in a string if the first character is a word character.
- It is after the last character in the string if the last character is a word character.
- Between two characters where one is a word character and the other is not.
- Example:
Checking for “an” at the end of every word.
Sed
Sed is used for filtering and transforming text. This is an abbreviation for stream editor and works similar to the grep command we used before.
Sed is available in all Linux systems and is super fast.
Execution of Sed:
- Input a set of lines
- Each line has a sequence of characters.
- Has the active pattern space and a hold space.
- For each line of input, the execution cycle is performed loading each line into the pattern space.
- During each cycle, all statements of the sed script are executed in a sequence for matching address patterns for actions specified with options provided.
Usage in terminal: sed [flag option] {script for filtering} [input file]
Most Used Flags:
“-E” for the Expanded Regex engine
“-e” to add the script to the commands to be executed
“-f” for adding a .sed
script file
Sed Script:
Substituting Values: Sed can be used to replace the text in output while leaving the input file unchanged.
Here I had a file that stated Hello world
and the sed output gives Bye world
If you notice in the command you will see an “s” present before /Hello/ this refers to substituting values and Hello is turned to Bye.
If a word exists in a sentence more than one time then we can add the /2,3…at the end and replace that nth occurrence.
To replace all we use /g instead of /2
We can also use both together to replace all same patterns from the n’th value onwards
sed -e ‘s/world/Bye/5g’ input.txt
This command replaced all the world
to Bye
starting from the 5th world
of the line.
In between the “s” and “g”, we can use all kinds of regex and play around with it.
Format: sed ‘s/regex1/regex2/g’ [input file]
Here all the patterns coming out from regex1 will be replaced with regex2
Deleting Lines from a file: Sed can also be used to delete lines and can perform this operation without opening the file.
Syntax: sed ‘nd’ [input file]
This will delete the n’th line of the file.
Note: $ is used to represent the last line and % means all the lines.
Just like “d” and “s” we can use “i” to insert above the current line and “a” is used to append below the current line. “=” is used to print input line number and so on….
This was just an introduction to sed command a lot more practice and reading will be needed to be good at using it.
There is also a command called awk which works similar to sed and can be used in the Linux terminal.