Regular Expressions in Stata

Preserving, Reshaping and Transferring Data in Stata

Andrew P. Davis

3/2/2016

This blog-post will focus on preserving, reshaping and transferring data in Stata – skills that are quite common and make data manipulation more efficient and less prone to human error.

I will begin with reviewing the logic behind reshaping data and the Stata commands to achieve this. I will then move to provide information on transferring data using StatTransfer. Finally, as with my previous blogpost http://apdsociology.blogspot.com/2016/02/regular-expressions-in-stata.html I will conclude with several exercises to help you get the hang of preserving and reshaping data.

Preserve and Restore

In Stata, a “preserve” command saves a pre-manipulated version of the data that will remain unadulterated after you conclude your programming.

-For instance if you write a do-file program that includes data transformations, you could use a “preserve” command that would return you data to its original version post-manipulation.

-Intuitively, “restore” commands call back the previously “preserved” data

These commands are called as follows

Syntax:

preserve

(syntax that changes the data in some way)
restore

And are essential to reshaping datasets.

Reshaping Datasets

Data exist in “long” or “wide” formats. An example of a “wide” dataset would be the following matrix:

countryid gdp90 gdp91 gdp92

1. 3 95000 96000 77000

2. 1 60000 50500 51000

3. 2 75000 45400 65800

These data are considered to be “wide” as each variable-year is given its own column. It actually is “wide.”

When we transfer these data to “long” or “narrow” format they look like this:

countryid year gdp

1. 1 90 60000

2. 1 91 50500

3. 1 92 51000

4. 2 90 75000

5. 2 91 45400

6. 2 92 65800

7. 3 90 95000

8. 3 91 96000

9. 3 92 77000

So that there is variable-year, long format data that is produced in which the collated measures from the wide format are broken down by year.

-Year is simply a convenient example, you could still theoretically transform wide data into long data and vice-versa if the data are not listed by year

*Okay. But how do you do reshaping in Stata?

Examples using reshape wide and reshape long:

*These examples will use fabricated data

Syntax:

List

*As you can see, the data are already in “long” format, that is, data are in country-year format

Syntax:

Ex.: reshape long var1, i(id) j(year)

reshape wide GDP, i(Country) j (Year)

list

*As you can see, our reshape has worked, data are in “wide” format.
*But what if we wanted to move our data back into “long” format?

Syntax:

List

Ex.: reshape long var1, i(id) j(year)
reshape long GDP, i(Country) j(Year)

list

Transferring data:

Using StatTransfer products

*At times you will work with colleagues who work with data in a different format that you do (and different than you would like to work with)

*Notably, moving data between two of the more popular statistical packages used in the social sciences, SPSS and Stata is difficult if the data is in SPSS format or in Stata format

StatTransfer allows for the easy movement of data between formats commonly used in major statistical packages

StatTransfer version 13 will move data among the following programs

1-2-3
Microsoft Access (Versions 2.0 through Office XP version)
dBASE (all versions)
Delimited ASCII
Delimited ASCII with a Stat/Transfer SCHEMA file
Data Documentation Initiative (DDI) Schemas
Epi Info
EViewsNew
Excel worksheets (all versions, including Excel 2013)
Fixed format ASCII
FoxPro
GAUSS (Windows and Unix)
GenstatNew
gretl
JMP 10
LIMDEP
MATLAB
MATLAB Seven Datasets
Mineset
Minitab 14 (read only)
MPLUS (write only)
NLOGIT
ODBC data sources (Oracle, Sybase, Informix, etc.)
Open Document Spreadsheets
OSIRIS (read-only)

Paradox
Quattro Pro for DOS and Windows
R
RATS
S-PLUS (now supported through version 7)
SAS CPORT datasets and catalogs (read only)
SAS for Unix—HP, IBM, Sun
SAS for Unix—DEC Alpha
SAS for Windows and OS/2
SAS PC/DOS 6.04 (read only)
SAS Transport
SAS Value Labels
SAS version 7–9
SPSS through version 21
SPSS Datafiles (Windows and Unix)
SPSS Portable Files
Stata (all versions, including 14)
Statistica versions 7–8 (Windows only)
SYSTAT 13
Triple-S Survey Interchange Format

http://www.stata.com/products/stat-transfer/

Stat/Transfer has a very user-friendly interface, details on transferring data can be found here:

http://www.ats.ucla.edu/stat/st/default.htm

Exercises

Using the “City Temperature” dataset (sysuse citytemp) please complete the following tasks and provide evidence using commands that you have reshaped the data.

1.) “List” the data. Is it in long or wide format?

2.) Preserve the data

3.) Reshape the data from long to wide format

4.) Re-reshape the data from wide to long format

Open your own dataset, play around with this a bit but be sure to use the “preserve” command before reshaping data.

-Is your data in long or wide format? How do you know?

-Reshape the data in the direction you’d like

-What would be the benefit or reshaping the data?

Community Learning Resource: Regular Expressions
Andrew Davis
Feb. 24^th, 2016
SOC 561

A previous blogpost by Nadina: http://soc596.blogspot.com/ explored some great examples of how to use these commands in Stata. This post will take things a bit further with new examples and exercises. I’ll begin with a review of the basics of “regular expressions” and then move to the new examples and exercises.

Using “Regular Expressions” in Stata

What is a “regular expression” in general?

-A regular expression is a sequence of characters that define a search pattern.

*Commonly used in word in terms of “find and replace” function. Many people have probably used a function such as this in a word document. But…

-Word is not alone. Many programs allow for similar capacities, including Stata.

Literal expressions v. regular expressions (Theory of Regular Expressions)

-When to use what?

*Regular expressions are generally too powerful if you just want to seek out a single concept. They can open you to risk of error when a simple “find and replace” function might have been best.

*Regular expressions are generally good for searching out multiple variables with a similar concept (factors), and patterns within data. This can be very helpful in organizing and manipulating your data.

Stata

-You use search techniques to find values of variables in a dataset that is brought into Stata.

*You can only use this for “string” variables.

· String variables have words as values in Stata, as opposed to numbers

Are my variables “string?”

-You can easily find out if your variables, or which variables are “string” using the “describe” command in stata.

Below is an example of the use of a “describe” command, as well as output on 8 variables with different storage types in this dataset. In this case, the variable “country” is stored as a “string” variable.

Above and beyond simple “find and replace” type functions, Stata allows the user to seek out patterns in the data using simple commands and symbols (see table below).

Counting
*	Asterisk means “match zero or more” of the preceding expression.
+	Plus sign means “match one or more” of the preceding expression.
?	Question mark means “match either zero or one” of the preceding expression.
Characters
a–z	The dash operator means “match a range of characters or numbers”. The “a” and “z” are merely an example. It could also be 0–9, 5–8, F–M, etc.
.	Period means “match any character”.
\	A backslash is used as an escape character to match characters that would otherwise be interpreted as a regular-expression operator.
Anchors
^	When placed at the beginning of a regular expression, the caret means “match expression at beginning of string”. This character can be thought of as an “anchor” character since it does not directly match a character, only the location of the match.
$	When the dollar sign is placed at the end of a regular expression, it means “match expression at end of string”. This is the other anchor character.
Groups
\|	The pipe character signifies a logical “or” that is often used in character sets (see square brackets below).
[ ]	Square brackets denote a set of allowable characters/expressions to use in matching, such as [a-zA-Z0-9] for all alphanumeric characters.
( )	Parentheses must match and denote a subexpression group.

Source: http://www.stata.com/support/faqs/data-management/regular-expressions/

Using the Commands

Each command (regexm, regexr and regexs) indicate to Stata that you would like to use a (re)gular (ex)pression.

Regexm: you want Stata to find a match (m). (Is there a phone number?)

First, and most basic is the command “regexm.” As reviewed under “theory of regular expressions” regexm should be used to find a pattern within some data.

The syntax for regexm:

gen newvar = regexm (stringvar, expression)

-The key components of this expression are the function (the regexm command) and the expression (what you’re asking the function to search for).

-This is pretty intuitive, regexm searches for whatever you are looking for within the variable.

Regexr: you want Stata to replace (r) the expression. (“Let’s replace those phone numbers”)

-You should use “regexr” when you want to replace a portion of a string variable.

The syntax for regexr:

gen/replace newvar = regexr (stringvar, “expression”, “replace”)

-The key components of this syntax are the “regexr” which commands Stata to replace a portion of a string.

-This “portion of a string” can be found in the parentheses. Stringvar locates the variable of which you want to replace a portion, “expression” refers to what you’d like to replace in the variable, and the final “replace” refers to what you would like to put in the place.

Regexs: you want Stata to isolate a subsection (s) of a larger string. (“Let’s see those phone numbers, pull ‘take’ them out and put them into a new variable”)

-Like with the above functions, you should only use this expression in the service of seeking out a bona-fide pattern in your data.

-To use this expression, you must use syntax that combines regexm and regexs. In general you want to create a new variable that is the isolate of the string.

The syntax for regexs:

gen newvar =regexs(#) if regexm(stringvar, ("first subexpression") ("second subexpression")...("nth subexpression"))

-There are several important components of this expression.

* First, the # sign as highlighted above represents the portion of the string you would like to isolate. For instance, if your phone number was (520)867-5309 you would use regexs (0) to return the entire phone number, regexs (1) to return (520), regexs (2) to return 867, and regexs (3) to return 5309….(I got it!).

Subexpression #	String returned
0	1march2014
1	1
2	march
3	2014

*Second, the end of the syntax “regexm(stringvar, ("first subexpression") ("second subexpression")...("nth subexpression"))” should be handled carefully, depending on what you are wanting to return. Please refer carefully to the list of symbols used in regular expressions in Stata before moving forward.

~See the forthcoming example for a good demonstration on using this command.

Examples:

The following will serve as a guide, I will apply each expression type to an example using Stata syntax. I will be working with the publicly available “Minorities at Risk” dataset, linked here: http://www.cidcm.umd.edu/mar/mar_data.asp#quantitativemar

1. Using regexm:

The variable we will be working with is vmar_region

Cross-tabulation appears as follows:

-We will use regexm to create a variable that combines all responses from Africa (excluding North Africa and the Middle East).

The syntax is as follows:

gen africa=regexm(vmar_region, "Sub")

-You can also use regexm to produce lists, below is an example from the Minorities at Risk dataset:

list country if regexm(country, "Republic") == 1

2. Using regexr:

The variable we will be working with in this example (in the Minorities at Risk dataset) is “autonend” which is a measure of the year in which a minority group lost political autonomy. A snapshot of cross tabulation of this variable looks like this.

…with numbers reaching until the modern day.

-We want to replace all values of lost autonomy that occurred in the 1500’s or before with a value called: preenlightenment. This is how you’d go about doing that:

Syntax:

gen preenlightenment=regexr(autonend, "[0-1][0-5][0-9][0-9]", "preenlightenment")

I will now tabulate the variable “preenlightenment” which should reflect the replace change that was made to these values. A snapshot of this variable is presented below, reflecting the change.

-As you can see from this snapshot, all values from the 1500’s and before are collapsed into the “preenlightenment” value.

3. Using regexs:

We will continue to use the “autonend” variable to demonstrate how to effectively use regexs.

-As previously discussed, regexs isolates a portion of a string variable. In this case we would like to isolate cases that occurred in the 1500s and before or have text associated with them, such as the value “15^th century”.

Syntax:

gen century = regexs(1) if regexm(autonend, "([0-9][0-5][0-9][0-9])[\-]*[0-9]*[ a-zA-Z]*$")

-A tabulation of this output is produced below. As you can see, values have been isolated from the string variable “autonend” into a variable to which I’ve added the label “16^th century measures.”

Exercises:

Like much programming, using regular expressions take quite a bit of practice to get the hang of. To get started here are four exercises that will get you on your way to understanding how to use regular expressions in Stata.

Each of these exercises use data from the “1978 autos” dataset. To retrieve these data, simply type “sysuse auto” into your Stata browser.

1. The goal of this exercise is to get you familiar with the regexm command. Please open the “1978 autos” and complete the following tasks.

a. List which variables you can use regular expressions on.

b. List vehicles whose make is “Toyota”

c. List vehicles whose make is “Pont.”

d. List vehicles with the letters “VW” in their make

2. Using the “1978 autos” dataset, use regexs to create a variable that includes only cars with numbers in their name. Provide evidence that your variable does what you want it to do.

3. Using the “1978 autos” dataset, use regexs to create a variable that includes only VW’s. Provide evidence that your variable does what you want it to do.

4. Use regexr to replace VW with Volkswagon. Provide evidence that your variable does what you want it to do.

Regular Expressions in Stata

Wednesday, March 2, 2016

Reshaping and preserving data in Stata

Tuesday, February 23, 2016

Regular Expressions in Stata