Analysis of panel data can include variables that vary between individuals, over time, or both. When we try to answer a causal question we are often mostly interested in the variation over time. Understanding how variation between individuals (or countries, or other units of analysis) arose is generally more difficult, since they are the result of processes that have been going on for a long time.
By comparing units with themselves, over time, we can disregard the differences between units, and come closer to a counter-factual ideal situation of comparing each unit to itself in an alternate reality. If we want to prove that X affects Y it is a stronger point of evidence to show that X is associated with Y within units, not only "between units."
For instance, let's say we want to know whether quality of life in a city - say New York - is good or not. If we do a "normal" cross-sectional analysis we would measure how happy people in the city are, and compare it to how happy people are in other cities. Even if we find that people in New York are happier than in other places, it would not be strong evidence that New York is making them happy. Happy people can be attracted to New York, or perhaps only rich people can afford to New York (and being rich is really what makes them happy). But if we can show that individuals that have lived in different places were the happiest when they were living in New York, that is a much stronger piece of evidence for the hypothesis that living in New York makes you happy.
Such analyses can easily be done with so called fixed effects in regression analysis. In this guide we will cover both the intuition to understand them, and how to implement them in Stata. If you only are interested in the code for implementing fixed effects you can jump to the end of the guide, to the section "Fixed effects with xtreg".
The question we will investigate is whether a higher share of women in the national parliament is associated with greener energy production, measured as the share of the country's energy production that comes from renewable sources.
We first load the data: The QoG institute's time-series cross-section dataset.. It includes data on countries, over time. To make things a little bit easier to grasp we will discard all countries but two, Sweden and the United States. Thereafter we will remove all country years where we don't have information about the share women in parliament wdi_wip
and the share of renewable energy in energy production wwdi_elerenew
. We do this with the command keep
and an if-statement.
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_ts_jan18.dta", clear
keep if cname=="Sweden" | cname=="United States"
keep if wdi_elerenew!=. & wdi_wip!=.
(Quality of Government Basic dataset 2018 - Time-Series) (15,048 observations deleted) (107 observations deleted)
We start by looking at a simple scatterplot between the two variables. Here we just include all country-years in the analysis together, but we will also use mlabel(ccodealp)
to mark which observations that belong to which country. We can see that there is a strong relationship: Country-years when there was a high proportion of women in parliament also saw high shares of renewables in energy production.
* Graph 1: The relationship between women in parliament and renewables
twoway (scatter wdi_elerenew wdi_wip, mlabel(ccodealp) mlabsize(small)) ///
(lfit wdi_elerenew wdi_wip)
If we ran a regression on this relationship we would see that it was positive, strong and statistically significant. But we can also easily see that all the observations in the bottom left belong to the United States, and all in the top right belong to Sweden. How sure can we then really be that the effect is causal, that is, that the production of renewable energy would in crease if the proportion of women in parliament increased in the countries? Not particulary sure. Our analysis has mostly shown that both the proportion of women and renewables are higher in Sweden. But there are many other things that also vary between the two countries. It is a stretch to infer that one of these two variables caused the other.
Let us therefore see what the relationship between the two variables look like, within each country. We will rerun the scatterplot, but once for each country, with the option by(cname)
, and instead label the observations with years.
* Graph 2: The relationship between the share women in parliament and renewables, by country
twoway (scatter wdi_elerenew wdi_wip, mlabel(year) mlabsize(small)) ///
(lfit wdi_elerenew wdi_wip) ///
, by(cname)
The red regression lines are now much flatter than in the first graph, since they are now drawn within each country. We can for instance see that Sweden in 2003 and 2004 had more women in parliament than 1990, but the share of renewables was lower. The hypothesis that more women in politics would lead to greener energy now looks a lot weaker.
A fixed effects analysis does roughly this, automatically, with the difference that we will get one value for the relationship. In contrast, in the graph the slope of the regression line can be different for each country.
One way to include both countries in the same analysis without having it being affected by between-country differences is to look at the deviations around each country's mean value. Instead of investigating whether country-years with a high proportion of women also had a high proportion of renewables we will look at whether country-years when the proportion of women was higher than normal for the country also had a higher share of renewables than what is normal for the country. That is, we will do exactly what we did in the double graph above.
The code below does two things. First it calculates a new variable that shows each country's share of renewables, as a deviation from the country average (mean value). The first row of the code calculates the average for each country, and the second row calculates a variable that is the deviation from that average. So if Sweden on average has 50 percent renewables, and one particular year had 55 percent, the new variable will have a value of 5 for that year. The last two lines of code does the same thing, but for women in parliament.
egen mean_energy = mean(wdi_elerenew), by(cname)
gen demeaned_energy = wdi_elerenew-mean_energy
egen mean_women = mean(wdi_wip), by(cname)
gen demeaned_women = wdi_wip-mean_women
In the graph below we can see how the share of renewables has varied over time in Sweden (blue line) and the US (red line), but not in absolute numbers, but as in a deviation from each country mean. Sweden always has more renewables, but when we now compare erach country with itself we can more easily spot common trends. We can for instance observe a rising trend since the beginning of the 2000's, in both countries.
* Graph 3: The share renewable energe in Sweden and the US, as a deviation from each country mean.
twoway (line demeaned_energy year if cname=="Sweden", lcolor(blue)) ///
(line demeaned_energy year if cname=="United States", lcolor(red))
If we now return to the relationship between the share women in parliament (as a deviation from the country mean) and the share renewables (as a deviation from the country mean) the graph looks like the one below. Not a particularly clear relationship, at least not compared to the strong relationship in the original graph at the top of the guide. We have now removed all "between-country variation", and instead focus on the "within-country variation", over time.
This was also visible when we drew different regression lines for each country earlier. This is what we do when we do a regression analysis with fixed effects: we compare units with themselves. But we can do it faster, with regression analysis.
* Graph 4: Scatterplot of the relationship between the variables, as deviations from the country means.
twoway (scatter demeaned_energy demeaned_women) (lfit demeaned_energy demeaned_women)
With regression analysis we can easily add variables to remove the between-country variation in the same way as we did with the graphs. The solution is to add dummy variables, one for each unit. In this case, it is one for each country (except one, that is left out as a reference category). Since we now only have two countries, we will only need one dummy variable.
These variables will achieve exactly what we did above, that is, control away all between-country variation. The result is that the analysis only will be based on the variation over time.
We can use the prefix i.
to let Stata automatically add dummy variables of the country code variable ccode
. In the code below we run two analyses, where the dependent variable is the share renewables, and the independent variable is the share women in parliament. In the second analysis we also add the dummy variable for country. Then the results of both analyses are presented in a table with the help of the command esttab.
* Regression analysis with and without dummy. variables for country (fixed effects)
quietly reg wdi_elerenew wdi_wip
estimates store m1
quietly reg wdi_elerenew wdi_wip i.ccode
estimates store m2
esttab m1 m2
-------------------------------------------- (1) (2) wdi_elerenew wdi_elerenew -------------------------------------------- wdi_wip 1.403*** 0.194 (26.77) (0.92) 752.ccode 0 (.) 840.ccode -36.39*** (-5.83) _cons -10.57*** 43.44*** (-6.02) (4.65) -------------------------------------------- N 37 37 -------------------------------------------- t statistics in parentheses * p<0.05, ** p<0.01, *** p<0.001
The result of interest here is the coefficient for wdi_wip
. In the first model it is positive (1.403), strong and significant. This corresponds to the slope of the line in the very first graph we did. In the second model the coefficient is much weaker (0.194) and not statistically significant. This corresponds to the slope of the line in the fourth graph.
That means that almost the entire relationship between renewable energy and women in parliament was due to between-country variation. Sweden has more women in parliament and more renewable energy than the US.
xtreg
¶Stata also has a regression command that is specially tailored to do regression analysis on panel data, xtreg
. It requires that we first specify the structure of the panel data with xtset
, which you can read about in another post.. Here we specify that country is the panel variable, and year is the time variable, by writing xtset ccode year
.
Then we use xtreg
in the same way that we use reg
, but we can now add an option, fe
, for fixed effects. Stata will then automatically add dummy variables for all countries in the data. These variables will however not be shown in the output.
xtset ccode year
xtreg wdi_elerenew wdi_wip, fe
panel variable: ccode (unbalanced) time variable: year, 1990 to 2014, but with gaps delta: 1 unit Fixed-effects (within) regression Number of obs = 37 Group variable: ccode Number of groups = 2 R-sq: Obs per group: within = 0.0243 min = 18 between = 1.0000 avg = 18.5 overall = 0.9534 max = 19 F(1,34) = 0.85 corr(u_i, Xb) = 0.9839 Prob > F = 0.3639 ------------------------------------------------------------------------------ wdi_elerenew | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- wdi_wip | .1939546 .2107536 0.92 0.364 -.2343482 .6222574 _cons | 25.7381 6.352159 4.05 0.000 12.82896 38.64724 -------------+---------------------------------------------------------------- sigma_u | 25.728484 sigma_e | 3.3851346 rho | .98298352 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(1, 34) = 34.01 Prob > F = 0.0000
The output looks a lot like the one we get from regular regression analysis., but we also get some special information that has to do with the panel structure. We can for instance in the top right see that there are 37 observations, in two groups (countries). Below we get descriptive information about the number of observations per group. To the left under the heading "R-sq" we can see how much of the variation in the data that is explained by the modell. "R-sq within" means how much of the variation within countries that is explained by the independent variable, and it is not a lot. The share of women in parliament is not strongly associated with the share of renewables, over time. R-sq between is how much of the between-country variation that is explained by the independent variables, and that is all in this case, but that is because we only have two countries. It is not interesting in this case.
The coefficient for women in parliament is exactly the same as the previous analysis, where we added dummy variables ourselves: 0.194 and not significant. The two ways of doing the analysis have the same result.
The only thing that is different is the intercept (the constant). When we add dummy variables ourselves the number shows the intercept for the reference country. With the xtreg
command it is instead an average intercept for all countries. This is often left out of the presentation of the results, as the focus tends to be on the coefficient of the main variable.
Fixed effects "eat" all the variation between countries, which means that we can not include variables that do not vary over time. For instance, geographical position is different for Sweden and the US, but does not vary over time. The variable can therefore not add anaything that the dummy variables have not already accounted for. If we try to introduce such a variable, Stata will throw it out and write "xyz omitted because of collinearity". Whatch what happens when we try to add the variable ht_region
to the analysis below.
xtreg wdi_elerenew wdi_wip ht_region, fe
note: ht_region omitted because of collinearity Fixed-effects (within) regression Number of obs = 37 Group variable: ccode Number of groups = 2 R-sq: Obs per group: within = 0.0243 min = 18 between = 1.0000 avg = 18.5 overall = 0.9534 max = 19 F(1,34) = 0.85 corr(u_i, Xb) = 0.9839 Prob > F = 0.3639 ------------------------------------------------------------------------------ wdi_elerenew | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- wdi_wip | .1939546 .2107536 0.92 0.364 -.2343482 .6222574 ht_region | 0 (omitted) _cons | 25.7381 6.352159 4.05 0.000 12.82896 38.64724 -------------+---------------------------------------------------------------- sigma_u | 25.728484 sigma_e | 3.3851346 rho | .98298352 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(1, 34) = 34.01 Prob > F = 0.0000
Fixed effects is a fast and easy way to control away a lot of potentially confusing variation, to only focus on the variation within units. The downside is that it is not particularly theoretically. Differences between Sweden and the US are caused by something, but fixed effects does not allow us to say anything about it. And we could in principle imagine that the reason for why Sweden has such a high share of renewables is that we have had many women in parliament for a long time, even if differences from year to year are too small to detect. If that is the case (I doubt it) the fixed effects analysis will underestimate the true effect of women in parliament.
It is therefore - as always - necessary to know exactly which question you are interested in when doing the analysis. Fixed effects are well suited for removing variation between units, focusing on variation over time, within each unit. That is often desirable when we want to test causal claims, but not always.