Regular expressions (REs) describe
patterns against which text in
files can be matched.
They are used in the editor
ed
and the family of editors based on it;
they are also used in many other Unix tools.
In this chapter, we will learn about REs by using the Unix tool called
grep.
Here is a short file of text:
$ more cars
The typical American male devotes more than 1,600 hours a
year to his car. He sits in it while it goes and while it
stands idling. He parks it and searches for it. He earns the
money to put down on it and to meet the monthly instalments.
He works to pay for petrol, tolls, insurance, taxes and
tickets. He spends four of his sixteen waking hours on the
road or gathering resources for it. The model American puts
in 1,600 hours to get 7,500 miles: less than five miles per
hour. In countries deprived of a transportation industry,
people manage to do the same, walking wherever they want to
go, and they allocate only three to eight percent of their
society's time budget to traffic instead of 28 per cent.
Ivan Illich
$
The name of the file is
cars; it will be used throughout this chapter and the next.
If we wish to look for lines in
cars
containing
four
we can use this command:
$ grep four cars
tickets. He spends four of his sixteen waking hours on the
$
The first of
grep's parameters is a regular expression (RE) and the second is a file name.
grep's usual action is to display all the lines in the file that
match the RE.
So, what we see is the only line of
cars
containing
four.
If we try the following:
$ grep it cars
year to his car. He sits in it while it goes and while it
stands idling. He parks it and searches for it. He earns the
money to put down on it and to meet the monthly instalments.
road or gathering resources for it. The model American puts
$
grep
displays more lines than before because
it
occurs more often than
four
in
cars.
We are not restricted to whole words with
grep; it will find part-words too.
For instance:
$ grep ling cars
stands idling. He parks it and searches for it. He earns the
$
If we supply more than one file name,
grep
looks in all of the files and displays the file name
before any lines containing the RE.
The same thing happens if we use Unix's file name expansion facilities:
$ grep needle *
haystack:This is the line with the needle.
porcupine:Here is a needle
porcupine:Another needle
porcupine:Yet another needle
This is very handy when we know we have some text but don't know which file it is in.
Regular expressions usually consist of
ordinary characters and special characters;
the special characters are known as
metacharacters.
The following are all metacharacters:
$ ^ [ . *.
We will examine them one by one.
If the dollar sign
($) comes
at the end
of a regular expression, it means the text that the
regular expression matches has to occur at the
end
of a line.
(If it helps,
you can think of the dollar sign matching an imaginary, invisible character
at the end of every line.)
This example is similar to the previous one:
$ grep 'it$' cars
year to his car. He sits in it while it goes and while it
$
This time,
grep
only finds the the portion of the RE before the dollar sign
(it) when it occurs at the
end
of a line.
Notice that the RE has to be in single quotation marks;
this is because the dollar sign has a special meaning to the shell
besides its meaning in REs.
The quotation marks hide the RE from the shell so it does not
interpret any of its special characters.
If the caret symbol
(^) comes at the
start
of a regular expression, it means that the
RE has to occur at the
beginning
of a line.
(You can think of the caret symbol matching an invisible character
at the start of every line.)
In this example:
$ grep '^The' cars
The typical American male devotes more than 1,600 hours a
$
grep
only finds
The
when it occurs at the start of a line.
The caret symbol and dollar sign are only treated as metacharacters when they occur at the beginning and the end of a RE respectively. In the following exchange:
$ grep 't^s$i' cars
$
grep
is looking for an exact match for the five characters,
including
^
and
$, inside the
quotation marks.
Notice that
grep
simply displays nothing if no lines match the RE.
The dot metacharacter matches any single character as we see here:
$ grep 'i.n' cars
hour. In countries deprived of a transportation industry,
$
In our text, the dot matched the character
o
letting the RE match
ion
.
We can use more than one dot to match more characters.
For example, to find
a
and
e
whenever they are separated by two
other characters:
$ grep 'a..e' cars
money to put down on it and to meet the monthly instalments.
He works to pay for petrol, tolls, insurance, taxes and
road or gathering resources for it. The model American puts
$
Even when
grep
has found the lines,
it's still quite hard for us humans to spot the text the REs match!
If we want to remove the special meaning of a metacharacter we
can do so by preceding it with a backslash
(\).
Here we look for lines ending with
t.:
$ grep 't\.$' cars
society's time budget to traffic instead of 28 per cent.
$
In the example, the backslash turns the dot into an ordinary character
so that
grep
finds the only line of
cars
that ends with a
t
and a full stop.
If we put brackets
([
and
]) around some group of characters we get a RE which matches any single
character from the group.
For instance:
$ grep ' [Ii]n ' cars
year to his car. He sits in it while it goes and while it
hour. In countries deprived of a transportation industry,
$
The part of the RE in brackets matches either
I
or
i;
combined with the spaces and the
n, the RE matches
In
or
in
in the middle of sentence.
Of course, we can have completely unrelated characters inside the brackets:
$ grep '[qxj]' cars
He works to pay for petrol, tolls, insurance, taxes and
tickets. He spends four of his sixteen waking hours on the
$
There are two lines with
q,
x
or
j
in them.
We can use a variation on the same theme to match a character in a range.
$ grep '[J-S]' cars
$
The example shows that no upper case letters between
J
and
S
occur in the text.
Another variation allows us to match characters that are not listed. The caret immediately after the opening bracket in the following example changes the sense of the search:
$ grep '[^A-Za-z0-9 .,:]' cars
society's time budget to traffic instead of 28 per cent.
$
Without
the caret, the RE would have matched any single character taken from:
letters, digits, space,
dot, comma and colon.
With
the caret, the RE only matches any single character outside that set.
(In the example, it is the apostrophe
(')
after
society
that matches the RE.)
The asterisk
(*) character is only used after some other character; it means
zero or more occurrences of the character it follows.
For example,
a*
means zero or more consecutive
as and
z*
means zero or more consecutive
zs.
Clearly
aa*
matches one or more
as and
aaa*
matches two or more
as.
This example shows how we look for one or more consecutive zeroes:
$ grep '00*' cars
The typical American male devotes more than 1,600 hours a
in 1,600 hours to get 7,500 miles: less than five miles per
$
Although there are three occurrences, we only get two lines displayed.
A little warning: because the character-asterisk combination matches
zero or more characters, it is always used after something else - it is
never used on its own.
For instance, as we show here, there are no
z
characters in
cars:
$ grep 'z' cars
$
Even so, this command:
grep 'z*' cars
would display
every
line of the file because
grep
would find no
zs at the start of every line.
Asterisk followed by dot matches any string of characters of any length; that is, any character followed by any character for as long a possible. You might expect it to match any character and then allow only that particular character to be repeated; it does not work like that.
Asterisk plus dot is most useful in this sort of situation:
$ grep 'stands.*the' cars
stands idling. He parks it and searches for it. He earns the
$
As you see,
.*
matches all the text between
stands
and
the.
Some REs are so useful that they are worth listing:
^$ empty (blank) lines ^ * lines starting with spaces *$ lines ending with spaces ^ *$ (blank) lines containing only spaces ^.*$ the whole line [A-Z][a-z][a-z]* a word with a capital letter
It might be difficult to tell from the presentation, but the ones with spaces contain exactly two space characters.
Here are the first and last of those REs used on
cars:
$ grep '^$' cars $ grep '[A-Z][a-z][a-z]* [A-Z][a-z][a-z]*' cars Ivan Illich $
Notice the blank line between the two
grep
commands.
It is the output from the first.
We have only seen the most important features of REs.
For the whole story you will have to read the
man
pages for
regexp
in section five of the manual.
One of
grep's most useful options is
-v
which makes
grep
display lines which do
not
match the RE.
This example:
$ grep -v 'e' cars
Ivan Illich.
$
gives us the only two lines of the text that do not contain an
e.
The
-c
option makes
grep
merely count the matching lines.
For instance, this:
$ grep -c '^$' cars
1
$
confirms there is only one blank line in the file.
As you would expect, you can use both options at once:
$ grep -cv 'e' cars
2
$
The example confirms that two lines did not contain an
e.
We often use
grep
to select some particular lines from many lines of output.
Suppose we want to see if a friend is logged on.
We could just use the
who
command but it might give a hundred or so people.
This will save us time:
who | grep fred
Clearly, it is easier to let
grep
scan for
fred
in the list of logged in users.
Similarly, to count the number of directories in the current
directory we could do
ls -l
and count the lines starting with
d.
However, since the purpose of computers should be to free people from
boring, repetitive tasks we should look for a better solution.
Here is one:
ls -l | grep '^d' | wc -l
You will find many similar uses for
grep.
In the following you are asked to work with a file called
Unix.
Here is the
Unix
file for you to download and use.
Look for lines in the
Unix
file containing "Unix".
Answer
If you are just looking,
grep
is the right tool for the job. So:
grep 'Unix' Unix
Or, if you put screws in with a hammer:
sed -n '/Unix/p' Unix
How many lines in
Unix
contain "Unix"? Do not use
wc. (Hint:
man grep)
Answer
grep -c 'Unix' Unix
The hint meant: find the option by reading the
man
page for
grep. If a logical person was picking an option letter to use
for something to do with counting they would use "c". Therefore,
-c
is the one to look at first in the list of options.
Look for lines in
Unix:
beginning with "U";
Answer
grep '^U' Unix
ending with "x";
Answer
grep 'x$' Unix
containing two "o"s separated by a single character;
Answer
grep 'o.o' Unix
containing two or more consecutive equals signs;
Answer
grep '===*' Unix
Three
=
are needed for two or more equals signs;
==
( without an
*) would work but is less precise.
containing "Unix" or "unix";
Answer
grep '[Uu]nix' Unix
ending with a full stop;
Answer
grep '\.$' Unix
ending with a character that is not a full stop; (This does not include blank lines.)
Answer
grep '[^.]$' Unix
The dot doesn't need to be escaped as characters in the brackets are assumed not to be metacharacters.
not ending with a full stop;
(This includes blank lines. Consult the
man
page to find
an option that makes
grep
behave like
ed's "v" command.)
Answer
grep -v '\.$' Unix
Did you take the hint to check out the
-v
option or did you read through the whole
man
page?
containing upper case letters between "P" and "S".
Answer
grep '[P-S]' Unix
Use
sed
to display the
Unix
file with:
all occurrences of "Unix" changed to "UNIX".
Answer
sed '/Unix/s//UNIX/g' Unix
OR
sed 's/Unix/UNIX/g' Unix
(Note the
g
suffixes -- needed if a line might have more than one "Unix"
its present line 17 deleted.
Answer
sed '17d' Unix
its present lines 17 to 19 deleted.
Answer
sed '17,19d' Unix
all lines containing "and" deleted.
Answer
sed '/and/d' Unix
all occurrences of "Unix" changed to "UNIX" and its present lines 17 to 19 deleted.
Answer
sed '/Unix/s//UNIX/g
17,19d' Unix
OR (less good):
sed -e 's/Unix/UNIX/g' -e '17,19d' Unix
only line 17 shown
Answer
sed -n '17p' Unix
its present line 17 displayed twice.
Answer
sed '17p' Unix
Unix has a file called
/etc/passwd
which has one line per user.
The login code (user name) is the first thing on the line and is
immediately followed by a ":". Find your own entry in the file.
Answer
This:
grep '^cmsps:' /etc/passwd
finds my entry; replace
cmsps
with your code or use:
grep "^$USER:" /etc/passwd
If you are just looking,
grep
is the right tool for the job. So:
grep 'Unix' Unix
Or, if you put screws in with a hammer:
sed -n '/Unix/p' Unix
grep -c 'Unix' Unix
The hint meant: find the option by reading the
man
page for
grep. If a logical person was picking an option letter to use
for something to do with counting they would use "c". Therefore,
-c
is the one to look at first in the list of options.
grep '^U' Unix
grep 'x$' Unix
grep 'o.o' Unix
grep '===*' Unix
Three
=
are needed for two or more equals signs;
==
( without an
*) would work but is less precise.
grep '[Uu]nix' Unix
grep '\.$' Unix
grep '[^.]$' Unix
The dot doesn't need to be escaped as characters in the brackets are assumed not to be metacharacters.
grep -v '\.$' Unix
Did you take the hint to check out the
-v
option or did you read through the whole
man
page?
grep '[P-S]' Unix
sed '/Unix/s//UNIX/g' Unix
OR
sed 's/Unix/UNIX/g' Unix
(Note the
g
suffixes -- needed if a line might have more than one "Unix"
sed '17d' Unix
sed '17,19d' Unix
sed '/and/d' Unix
sed '/Unix/s//UNIX/g
17,19d' Unix
OR (less good):
sed -e 's/Unix/UNIX/g' -e '17,19d' Unix
sed -n '17p' Unix
sed '17p' Unix
This:
grep '^cmsps:' /etc/passwd
finds my entry; replace
cmsps
with your code or use:
grep "^$USER:" /etc/passwd
http://homepages.shu.ac.uk/~cmsps/unix/regex.html
Last updated: Thursday 05 April 2012 at 17:45