Tuesday, February 23, 2016

Regular Expressions in Stata


  
Community Learning Resource:  Regular Expressions
Andrew Davis
Feb. 24th, 2016
SOC 561

 

A previous blogpost by Nadina: http://soc596.blogspot.com/ explored some great examples of how to use these commands in Stata.  This post will take things a bit further with new examples and exercises.  I’ll begin with a review of the basics of “regular expressions” and then move to the new examples and exercises.

 

 
Using “Regular Expressions” in Stata

 

What is a “regular expression” in general?

 
-A regular expression is a sequence of characters that define a search pattern.

*Commonly used in word in terms of “find and replace” function.  Many people have probably used a function such as this in a word document. But…

-Word is not alone.  Many programs allow for similar capacities, including Stata.

 

Literal expressions v. regular expressions (Theory of Regular Expressions)

-When to use what?

*Regular expressions are generally too powerful if you just want to seek out a single concept.  They can open you to risk of error when a simple “find and replace” function might have been best.

*Regular expressions are generally good for searching out multiple variables with a similar concept (factors), and patterns within data.  This can be very helpful in organizing and manipulating your data.

 

Stata

-You use search techniques to find values of variables in a dataset that is brought into Stata.

            *You can only use this for “string” variables.

·         String variables have words as values in Stata, as opposed to numbers


Are my variables “string?”

-You can easily find out if your variables, or which variables are “string” using the “describe” command in stata.

 
Below is an example of the use of a “describe” command, as well as output on 8 variables with different storage types in this dataset.  In this case, the variable “country” is stored as a “string” variable.






 






Above and beyond simple “find and replace” type functions, Stata allows the user to seek out patterns in the data using simple commands and symbols (see table below).


 



Counting

*

Asterisk means “match zero or more” of the preceding expression.

+

Plus sign means “match one or more” of the preceding expression.

?

Question mark means “match either zero or one” of the preceding expression.

Characters

az

The dash operator means “match a range of characters or numbers”. The “a” and “z” are merely an example. It could also be 0–9, 5–8, F–M, etc.

.

Period means “match any character”.

\

A backslash is used as an escape character to match characters that would otherwise be interpreted as a regular-expression operator.

Anchors

^

When placed at the beginning of a regular expression, the caret means “match expression at beginning of string”. This character can be thought of as an “anchor” character since it does not directly match a character, only the location of the match.

$

When the dollar sign is placed at the end of a regular expression, it means “match expression at end of string”. This is the other anchor character.

Groups

|

The pipe character signifies a logical “or” that is often used in character sets (see square brackets below).

[ ]

Square brackets denote a set of allowable characters/expressions to use in matching, such as [a-zA-Z0-9] for all alphanumeric characters.

( )

Parentheses must match and denote a subexpression group.




 

Using the Commands

 
Each command (regexm, regexr and regexs) indicate to Stata that you would like to use a (re)gular (ex)pression. 

 

Regexm: you want Stata to find a match (m). (Is there a phone number?)

 
First, and most basic is the command “regexm.”  As reviewed under “theory of regular expressions” regexm should be used to find a pattern within some data.

 
The syntax for regexm:

 
gen newvar = regexm (stringvar, expression)

 
-The key components of this expression are the function (the regexm command) and the expression (what you’re asking the function to search for).

-This is pretty intuitive, regexm searches for whatever you are looking for within the variable.

 

Regexr: you want Stata to replace (r) the expression. (“Let’s replace those phone numbers”)

 
-You should use “regexr” when you want to replace a portion of a string variable.

 
The syntax for regexr:

 
gen/replace newvar = regexr (stringvar, “expression”, “replace”)

 
-The key components of this syntax are the “regexr” which commands Stata to replace a portion of a string.

-This “portion of a string” can be found in the parentheses. Stringvar locates the variable of which you want to replace a portion, “expression” refers to what you’d like to replace in the variable, and the final “replace” refers to what you would like to put in the place.

 

Regexs: you want Stata to isolate a subsection (s) of a larger string. (“Let’s see those phone numbers, pull ‘take’ them out and put them into a new variable”)

-Like with the above functions, you should only use this expression in the service of seeking out a bona-fide pattern in your data. 

-To use this expression, you must use syntax that combines regexm and regexs.  In general you want to create a new variable that is the isolate of the string.

 
The syntax for regexs:

 
gen newvar =regexs(#) if regexm(stringvar, ("first subexpression") ("second subexpression")...("nth subexpression"))

 
-There are several important components of this expression.

* First, the # sign as highlighted above represents the portion of the string you would like to isolate.  For instance, if your phone number was (520)867-5309 you would use regexs (0) to return the entire phone number, regexs (1) to return (520), regexs (2) to return 867, and regexs (3) to return 5309….(I got it!).




Subexpression #

String returned

0

1march2014

1

1

2

march

3

2014




*Second, the end of the syntax “regexm(stringvar, ("first subexpression") ("second subexpression")...("nth subexpression"))” should be handled carefully, depending on what you are wanting to return.  Please refer carefully to the list of symbols used in regular expressions in Stata before moving forward. 


~See the forthcoming example for a good demonstration on using this command.

 

 

 

Examples:

 
The following will serve as a guide, I will apply each expression type to an example using Stata syntax.  I will be working with the publicly available “Minorities at Risk” dataset, linked here: http://www.cidcm.umd.edu/mar/mar_data.asp#quantitativemar

 
1.      Using regexm:

 
The variable we will be working with is vmar_region

 
Cross-tabulation appears as follows:


-We will use regexm to create a variable that combines all responses from Africa (excluding North Africa and the Middle East).
 
The syntax is as follows:
 
gen africa=regexm(vmar_region, "Sub")
 



 

-You can also use regexm to produce lists, below is an example from the Minorities at Risk dataset:


 
list country if regexm(country, "Republic") == 1


2.      Using regexr:

 
The variable we will be working with in this example (in the Minorities at Risk dataset) is “autonend” which is a measure of the year in which a minority group lost political autonomy.  A snapshot of cross tabulation of this variable looks like this.


 
…with numbers reaching until the modern day.
 
-We want to replace all values of lost autonomy that occurred in the 1500’s or before with a value called: preenlightenment.  This is how you’d go about doing that:
 
Syntax:
 
gen preenlightenment=regexr(autonend, "[0-1][0-5][0-9][0-9]", "preenlightenment")
 
I will now tabulate the variable “preenlightenment” which should reflect the replace change that was made to these values.  A snapshot of this variable is presented below, reflecting the change.
 
 
 

-As you can see from this snapshot, all values from the 1500’s and before are collapsed into the “preenlightenment” value.

 

3.      Using regexs:

 
We will continue to use the “autonend” variable to demonstrate how to effectively use regexs.

 
-As previously discussed, regexs isolates a portion of a string variable.  In this case we would like to isolate cases that occurred in the 1500s and before or have text associated with them, such as the value “15th century”.

 
Syntax:

 
gen century = regexs(1) if regexm(autonend, "([0-9][0-5][0-9][0-9])[\-]*[0-9]*[ a-zA-Z]*$")

 
-A tabulation of this output is produced below.  As you can see, values have been isolated from the string variable “autonend” into a variable to which I’ve added the label “16th century measures.”


 
Exercises:

 
Like much programming, using regular expressions take quite a bit of practice to get the hang of.  To get started here are four exercises that will get you on your way to understanding how to use regular expressions in Stata.

 
Each of these exercises use data from the “1978 autos” dataset.  To retrieve these data, simply type “sysuse auto” into your Stata browser.

 

1.       The goal of this exercise is to get you familiar with the regexm command.  Please open the “1978 autos” and complete the following tasks.

a.      List which variables you can use regular expressions on.

b.      List vehicles whose make is “Toyota”

c.       List vehicles whose make is “Pont.”

d.      List vehicles with the letters “VW” in their make

2.      Using the “1978 autos” dataset, use regexs to create a variable that includes only cars with numbers in their name. Provide evidence that your variable does what you want it to do.

3.      Using the “1978 autos” dataset, use regexs to create a variable that includes only VW’s.  Provide evidence that your variable does what you want it to do.

4.      Use regexr to replace VW with Volkswagon.  Provide evidence that your variable does what you want it to do.

 

No comments:

Post a Comment