Cross sectional data means that we have data from many units, at one point in time.
Time series data means that we have data from one unit, over many points in time.
Panel data (or time series cross section) means that we have data from many units, over many points in time.
We can perform more interesting analyses with panel data than with both cross section and time series data, and gives us better opportunity to rule out alternative explanations, thereby making it easier to talk about cause and effect.
In Stata we can use time series commands (see separate guide for them!) in panel data to create lagged and leading variables. We can also use special regression commands that are suited for panel data, such as xtreg
.
But first we need to make sure that the data is set up for panel analysis. This guide is about that.
Panel data can be structured in two ways: "long" or "wide". To take an example, let's say we have data on countries, over time.
With wide data each row in the dataset stands for one country, and each column a variable at one point in time. For instance the population size of a country, a certain year. Like this:
country | population2000 | population2001 | population2002 |
---|---|---|---|
Sweden | 8872284 | 8888675 | 8911899 |
Norway | 4491572 | 4514907 | 4537240 |
It might seem intuitive at first glance, and it makes it easy to compare certain years to each other. But it is harder to do more advanced analyses, with many different variables (population, GDP, unemployment, and so on) we will need a lot of columns.
In general it is more convenient to have the data in long form. In long data each row represents one country one year, and each column represents one variable. But we also need a variable that shows which year the row represents. The table above would look like this in long form:
country | year | population |
---|---|---|
Sweden | 2000 | 8872284 |
Sweden | 2001 | 8888675 |
Sweden | 2002 | 8911899 |
Norway | 2000 | 4491572 |
Norway | 2001 | 4514907 |
Norway | 2002 | 4537240 |
The same data, in another format. Here we instead have few columns, but a lot of rows, but rows are easier to work with in Stata. To change format from wide to long, or from long to wide, use the command reshape
. There will be another guide about that. The rest of the guide presumes that the data is in long form.
xtset
¶We need to specify two variables for Stata: A panel (unit) variable and a time variable. The panel variable is country in this case - all observations for Sweden are connected, all observations for Norway are connected, and so on. The time variable is year, in this case.
The command to specify these variables is xtset
. We simply type xtset country year
- the panel variable first, and then the time variable. Let us try, with the QoG institute's time series cross section dataset, which contains information about countries, over time. The data is in long format.
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_ts_jan18.dta", clear
The variable "cname" shows the name of each country in the data, and the variable "year" shows the year which the row in the data refers to. But if we try to use these variables with xtset
we get the following error message:
xtset cname year
Stata objects that the panel variable "cname" is a string variable. Stata wants it to be a numeric code. In the QoG data we have such variables, for instance the variable "ccode". But in other cases, for instance when we collect the data ourselves, we might not be so lucky. In those cases we can easily construct a country code ourselves, with the command egen
, combined with group()
:
egen countryid = group(cname)
Stata then creates a new variable called "countryid", that gives each unique value of the variable "cname" its own numeric code, from one and up. We can now use this variable as our panel variable.
xtset countryid year
This is the message we get when the command worked. We can now see that our new variable "countryid" is the panel variable, and that the time variable is "year".
Another common error message is "repeated time values within panel". It means that we have duplicate observations for at least one country-year. The two variables we specify with xtset
must give unique combinations for all observations. Stata will not know what do with observations that are included in multiple places, for instance Sweden in the year 2000, and then shows us the error message. It looks like this:
xtset countryid year
Unfortunately Stata does not tell us which observations that caused the error. But we can use the command duplicates
to find them. We then write duplicates list
followed by the variables in question, both of them (countryid year). If we only write on of them, for instance duplicates list countryid
we will get a very long list of observations, as each country is included many times (once for each year). But if we instead write duplicates list countryid year
we only get the observations that have identical values on both the variables:
duplicates list countryid year
We can here see that 8 observations are causing the problem. They have the value 483 on the variable "countryid", and the value 2000 on the variable "year".
Now that we know who the culprits are we need to think about why they were duplicates in the first place. In this case it was because I created them, to demonstrate the error message, and we can safely delete them. But in general we don't know which of the duplicates that are the problematic ones - there might be one good observation of Sweden in 2000, and a bad one (caused by some error in data entry for instance). In those cases it is necessary to take a close look at the data, to determine what went wrong, and which observation that should be deleted.
If we have decided to remove them we can use the command drop
in combination with an if-statement. Below we instruct Stata to remove all observations with the value 483 on "countryid" and 2000 on "year".
drop if countryid==483 & year==2000
After doing so, we should be able to use xtset
as intended.
xtset countryid year
Now we are ready to start using the data. We can for instance create a lagged variable, that shows the population the previous year. Here we use normal time series commands.
gen lag_pop = l.unna_pop
If we now look at a segment of the data we can see that it worked:
list cname year unna_pop lag_pop if cname=="Sweden" & year>2010
The value on the variable "unna_pop" is for Sweden in 2011 9462352 persons. The year after the variable "lag_pop" also has the value 9462352. As planned! The good thing is that Stata has not simply shifted all observations down one row, but takes into account which observation that belongs to which country. One country's last year is not assigned to the next country's first.
It is sometimes a bit tricky to set up panel data the right way, so Stata will understand how to deal with the data. It is important that the data is in long form, so that each row is a country year, and that we have a separate variable that shows which year the data corresponds to. With the command reshape
we can transform the data from long to wide, or from wide to long. There will be a separate guide for this command.