Community Learning Resource: Regular Expressions
Andrew Davis
Feb. 24th, 2016
SOC 561
A
previous blogpost by Nadina: http://soc596.blogspot.com/
explored some great examples of how to use these commands in Stata. This post will take things a bit further with
new examples and exercises. I’ll begin
with a review of the basics of “regular expressions” and then move to the new
examples and exercises.
What is a “regular expression”
in general?
*Commonly used in word in terms of “find and replace”
function. Many people have probably used
a function such as this in a word document. But…
-Word
is not alone. Many programs allow for
similar capacities, including Stata.
Literal expressions v.
regular expressions (Theory of Regular Expressions)
-When
to use what?
*Regular expressions are generally too powerful if
you just want to seek out a single concept.
They can open you to risk of error when a simple “find and replace”
function might have been best.
*Regular expressions are generally good for
searching out multiple variables with a similar concept (factors), and patterns
within data. This can be very helpful in
organizing and manipulating your data.
Stata
-You
use search techniques to find values of variables in a dataset that is brought
into Stata.
*You can only use this for “string”
variables.
·
String
variables have words as values in Stata, as opposed to numbers
-You
can easily find out if your variables, or which variables are “string” using
the “describe” command in stata.
Above and beyond simple “find and replace” type functions, Stata allows the user to seek out patterns in the data using simple commands and symbols (see table below).
Counting
| |
*
|
Asterisk means “match zero or
more” of the preceding expression.
|
+
|
Plus sign means “match one or
more” of the preceding expression.
|
?
|
Question mark means “match either
zero or one” of the preceding expression.
|
Characters
| |
a–z
|
The dash operator means “match a
range of characters or numbers”. The “a” and “z” are merely an example. It
could also be 0–9, 5–8, F–M, etc.
|
.
|
Period means “match any
character”.
|
\
|
A backslash is used as an escape
character to match characters that would otherwise be interpreted as a
regular-expression operator.
|
Anchors
| |
^
|
When placed at the beginning of a
regular expression, the caret means “match expression at beginning of
string”. This character can be thought of as an “anchor” character since it
does not directly match a character, only the location of the match.
|
$
|
When the dollar sign is placed at
the end of a regular expression, it means “match expression at end of
string”. This is the other anchor character.
|
Groups
| |
|
|
The pipe character signifies a
logical “or” that is often used in character sets (see square brackets
below).
|
[ ]
|
Square brackets denote a set of
allowable characters/expressions to use in matching, such as [a-zA-Z0-9] for
all alphanumeric characters.
|
( )
|
Parentheses must match and denote
a subexpression group.
|
Using the Commands
Regexm: you want Stata to find
a match (m). (Is there a phone number?)
-This
is pretty intuitive, regexm searches for whatever you are looking for within
the variable.
Regexr: you want Stata to
replace (r) the expression. (“Let’s replace those phone numbers”)
-This
“portion of a string” can be found in the parentheses. Stringvar locates the variable of which you want to replace a
portion, “expression” refers to what you’d like to replace in the variable, and
the final “replace” refers to what you would like to put in the place.
Regexs: you want Stata to
isolate a subsection (s) of a larger string. (“Let’s see those phone numbers,
pull ‘take’ them out and put them into a new variable”)
-Like
with the above functions, you should only use this expression in the service of
seeking out a bona-fide pattern in your data.
-To
use this expression, you must use syntax that combines regexm and
regexs. In general you want to create a
new variable that is the isolate of
the string.
* First, the # sign as highlighted above represents the portion of the string you would like to isolate. For instance, if your phone number was (520)867-5309 you would use regexs (0) to return the entire phone number, regexs (1) to return (520), regexs (2) to return 867, and regexs (3) to return 5309….(I got it!).
Subexpression #
|
String returned
|
0
|
1march2014
|
1
|
1
|
2
|
march
|
3
|
2014
|
*Second, the end of the syntax “regexm(stringvar,
("first subexpression") ("second
subexpression")...("nth subexpression"))” should be handled
carefully, depending on what you are wanting to return. Please refer carefully to the list of symbols
used in regular expressions in Stata before moving forward.
~See the forthcoming example for a good
demonstration on using this command.
Examples:
-We
will use regexm to create a variable that combines all responses from Africa (excluding
North Africa and the Middle East).
The
syntax is as follows:
gen
africa=regexm(vmar_region, "Sub")
-You
can also use regexm to produce lists, below is an example from the Minorities
at Risk dataset:
2.
Using regexr:
…with
numbers reaching until the modern day.
-We
want to replace all values of lost autonomy that occurred in the 1500’s or
before with a value called: preenlightenment.
This is how you’d go about doing that:
Syntax:
gen
preenlightenment=regexr(autonend, "[0-1][0-5][0-9][0-9]",
"preenlightenment")
I
will now tabulate the variable “preenlightenment” which should reflect the
replace change that was made to these values.
A snapshot of this variable is presented below, reflecting the change.
-As
you can see from this snapshot, all values from the 1500’s and before are
collapsed into the “preenlightenment” value.
3.
Using regexs:
Exercises:
Like
much programming, using regular expressions take quite a bit of practice to get
the hang of. To get started here are
four exercises that will get you on your way to understanding how to use
regular expressions in Stata.
Each
of these exercises use data from the “1978 autos” dataset. To retrieve these data, simply type “sysuse
auto” into your Stata browser.
1. The goal of this
exercise is to get you familiar with the regexm command. Please open the “1978 autos” and complete the
following tasks.
a.
List
which variables you can use regular expressions on.
b.
List
vehicles whose make is “Toyota”
c.
List
vehicles whose make is “Pont.”
d.
List
vehicles with the letters “VW” in their make
2. Using the “1978 autos”
dataset, use regexs to create a variable that includes only cars with numbers
in their name. Provide evidence that your variable does what you want it to do.
3. Using the “1978 autos”
dataset, use regexs to create a variable that includes only VW’s. Provide evidence that your variable does what
you want it to do.
4. Use regexr to replace VW
with Volkswagon. Provide evidence that
your variable does what you want it to do.
No comments:
Post a Comment