reshape
¶Panel data can, as described in another post be structured in a "long" or "wide" way, depending on the question asked. Below are examples of what the structures look like.
In this guide we will cover how to transform the data between the two formats, with the command reshape
.
Wide data:
country | population2000 | population2001 | population2002 |
---|---|---|---|
Sweden | 8872284 | 8888675 | 8911899 |
Norway | 4491572 | 4514907 | 4537240 |
Long data:
country | year | population |
---|---|---|
Sweden | 2000 | 8872284 |
Sweden | 2001 | 8888675 |
Sweden | 2002 | 8911899 |
Norway | 2000 | 4491572 |
Norway | 2001 | 4514907 |
Norway | 2002 | 4537240 |
First we load the data to work with: The QoG institute time-series cross-section data. It holds data on the countries of the world, over time. To make it a little bit easier to grasp we will throw away most of the variables. We will only keep
the variables "cname", "year", "fh_status" (democracy status) and "unna_pop" (population), and only the years 2010 and 2011. We do this with the command keep
and an if-statement.
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_ts_jan18.dta", clear
keep cname year fh_status unna_pop
keep if year==2010 | year==2011
If we look at the first six rows of the dataset (using list
) we can see that each country is included twice, once for 2010 and once for 2011. Each variable is only included once. The data is in "long" format.
list in 1/6
reshape
command from long to wide¶Now we shall run the reshape
command. We need to specify a few things:
The first two pieces of information are included before the comma sign, and the other two in the options i()
(the panel unit) and j()
(time). It looks like this:
reshape wide fh_status-unna_pop, i(cname) j(year)
Stata shows us a little status report that shows the result: The number of observations has been cut in half, just as expected. Previously, each country was included twice, now it is only included once. The number of variables has however increase. The variable "year" is gone (it is now reflected in the other variable names): "fh_status" has turned into "fh_status2010" and "fh_status2011", while "unna_pop" has turned into "unna_pop2010" and "unna_pop2011". We can again look at the first six rows to make sure that it worked:
list in 1/6
If the command doesn't work a problem might be that there are duplicate values in the panel and time variables. We then get the error message "values of variable year not unique within cname". Use the command duplicates list
to find these duplicate values: duplicates list cname year
and figure out what went wrong.
We have now created a "wide" dataset. If we want to turn it back to long format we can do so with reshape
. This might actually be the more common case when collecting data from various official sources.
In order for it to work the variable must be named in a specific way. All variables that show the same information needs to have the same name, but with different numbers at the end. They are supporsed to be named (for instance) "unna_pop2010", "unna_pop2011", "unna_pop2012". They cannot be called "unna2010pop" and "unna2011pop". The numbers have to be at the end. Siffrorna måste vara på slutet för att Stata ska fatta.
The command is very similar to what we did previously, but not entirely. First we write reshape long
to indicate that we want to create a long dataset. Then we write the name of the variables we want to reshape, but only the "stem" - not the time. Stata will then look for this variable name "stems" in the variable list. If we have variables that are called fh_status2010 and fh_status2011 we only write "fh_status".
In the option i()
we state the name of the panel unit variable, and in option j()
we name the time variable. But this time, there is no existing time variable! We set the name of it in this command. We will call the vairable that shows the year for "year". The full command is thus:
reshape long fh_status unna_pop, i(cname) j(year)
The number of observations doubles, from 211 to 422, because there were two years. Previously, the years were spread out over columns - now they are spread out over more observations instead. There are now only four variables, down from five, and Stata has also created a new variable: "year". The information in "fh_status2010" and "fh_status2011" are now contained in the variable "fh_status", and the information in the variables "unna_pop2010" and "unna_pop2011" are now in the variable "unna_pop". Let's look at the first six rows:
list in 1/6
reshape
can be used to transform data to and from long and wide format.
Going from long to wide:
reshape long variabelnamn, i(panelvariabel) j(tidsvariabel)
Going from wide to long:
reshape wide variabelnamn, i(panelvariabel) j(ny tidsvariabel)
For most (but not all) statistical analysis long format is better, since we then can set the panel structure with xtset
and do analyses that take time into account.