Saturday, November 23, 2013

Regular Expressions using Python

In this short guide, we shall introduce our selves with a powerful technique called 'Regular Expressions' (a.k.a regexp) with the context of a fascinating language, Python. The reason I chose to introduce regexp with Python is, that it has a very concise and easy syntax to follow, and therefore we do not swing away out of our main focus on regexp. And further, we can just experiment with regexp using Python interactive shell.

We have to know, that many languages, including C++ (introduced in C++11), Java, Perl, Ruby etc supports regexp in some form. There can be minor differences in their syntax. But what we will describe here should mostly be applicable to all languages for some extent.

Well, what is regexp and what are it used for. Simply, it is a text parser for patterns. Therefore it's used to find matching patterns in strings. This is useful in many context. Say, you want to search all the files in a directory having your name, regexp can help you do that. Adding more, if you want the files having your name at the beginning only. regexp can help you achieve to specify these patterns in a very short string, yielding results. What if you want to scan through a log file for occurrences of certain string patterns. These are just a few useful applications for regexp.

Following are some of the basic rules in regexp.
  • Strings matches it self. e.g. abc would match in a string xyzabcqrs
  • . (period) is a general wildcard for a single character. e.g: AB.x would match in AByx, ABdx and even AB.x
  • What if you want to match AB.x and not others like AByx, ABfx and all. The savior is the escape character '\'. e.g: AB\.x would only match for AB.x  Note: In python and in many other languages \ is used to provide escape sequences. Such as \n for newline, \t for tab character and so on. To avoid having problems with that, we may specify our regexp as raw strings in which \ has no special meaning for escape sequences. We'll see that in a short while in following examples with a preceding character 'r'.
  • ^ is the character that says to look in the beginning of the text. But withing brackets, if it comes at the beginning, it means to negate the matching characters (see below)
  • $ is the character that says to look at the end of the string
  • Since . (period) represent a general wildcard, .* says any number of such wildcards, including nothing (Simply, zero or more)
  • .+ says one or more characters
We are not going to write any python scripts in files in this tutorial. Rather, we would use Python interactive shell in Linux. In Windows, you may use the IDLE (A simple python IDE) provided with the python installer package.

Following shows some of the above rules in action. Descriptions are stated inline.
[shazni@wso2-ThinkPad-T530 ~]$ python
Python 2.7.4 (default, Sep 26 2013, 03:20:26) 
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> # importing the regexp module called re in python. Note this is a comment
>>> import re

>>> # Defining a tuple of strings
>>> str=('sss', '#$%ssss^&*', 'as.su', 'q_%sssss', 'a\nut', '12sssTA', '^Atssy$')

>>> # Following finds all the strings having 'sss' character sequence
>>> a=filter((lambda str: re.search(r"sss", str)), str)
>>> print(a)
('sss', '#$%ssss^&*', 'q_%sssss', '12sssTA')

>>> # Note following uses 'match' instead of 'search'. This will only find string beginning with the 'sss'
>>> b=filter((lambda str: re.match(r"sss", str)), str)
>>> print(b)
('sss',)

>>> # We may also use the following to achieve the same. Here the character ^ specifies the pattern to match the specified string only in the beginning
>>> z=filter((lambda str: re.search(r"^sss", str)), str)
>>> print(z)
('sss',)

>>> # Following matches 'sss' only at the end. The $ special character does the trick 
>>> y=filter((lambda str: re.search(r"sss$", str)), str)
>>> print(y)
('sss', 'q_%sssss')

>>> # Period (.) is a general wildcard for a single characted. e.g: AB.x would match in AByx, ABdx and even AB.x. Therefore, following would return all the strings having an 's' then one more of any character and then another 's'
>>> c=filter((lambda str: re.search(r"s.s", str)), str)
>>> print(c)
('sss', '#$%ssss^&*', 'as.su', 'q_%sssss', '12sssTA')

>>> # Matches any string having s.s
>>> d=filter((lambda str: re.search(r"s\.s", str)), str)
>>> print(d)
('as.su',)

>>> # Matches any string having atleast zero or more characters in between two 's' characters
>>> e=filter((lambda str: re.search(r"s.*s", str)), str)
>>> print(e)
('sss', '#$%ssss^&*', 'as.su', 'q_%sssss', '12sssTA', '^Atssy$')

>>> # Matches any string having atleast one or more characters in between two 's' characters
>>> f=filter((lambda str: re.search(r"s.+s", str)), str)
>>> print(f)
('sss', '#$%ssss^&*', 'as.su', 'q_%sssss', '12sssTA')

>>> # Following matches anything having t
>>> g=filter((lambda str: re.search(r"t+", str)), str)
>>> print(g)
('a\nut', '^Atssy$')

>>> # Following matches an s charactes followed by an character in that you find in square bracket
>>> t=filter((lambda str: re.search(r"s[.y]", str)), str)
>>> print(t)
('as.su', '^Atssy$')

>>> # Follwoing matches an 's' character, followed by any character which is not 'y' or '.' and then followed by 'T'. Here what we need to know is the ^ character inside the square bracket. It says anything not in [] 
>>> u=filter((lambda str: re.search(r"s[^.y]T", str)), str)
>>> print(u)
('12sssTA',)

>>> # follwoing matcher all the strings that don't have an 's'. Note ^ at the begiining indicates begiining of strings, $ indicates end of all strings. In the middle we find anything not s. And * says any number of characters. Finally, this means, all the strings from beginning to end that do not consist an 's'  
>>> v=filter((lambda str: re.search(r"^[^s]*$", str)), str)
>>> print(v)
('a\nut',)

>>> # Ok now how do we match the ^ character at the beginning. Escape character comes to rescue
>>> w=filter((lambda str: re.search(r"^\^", str)), str)
>>> print(w)
('^Atssy$',)

>>> # Following is bit tricky. It says to match all the strings having a ^ but exclusing the ones having ^ at the beginning. 
>>> j=filter((lambda str: re.search(r"^[^\^].*\^", str)), l)
>>> print(j)
('at^yO',)
OK. That's it for now. This guide is by no means complete, but just a quick start. Now you may explore regexp capabilities in various languages, particularly in Python.

No comments:

Post a Comment