In another post we talked about regression analysis with interaction effects. The variables then only had two values. But how do we do when one or both of the variables in the interaction are continuous, with many values? In general we do the same thing, but we have to present and interpret the results in a slightly different way. That is the subject of this post. Anyone that wants to dig deeper can read this (paywalled) article.
We will use data from the American General Social Survey, a survey with regular citizens, with questions about a lot of subjects. We will use the 2016 version, and ask what is the effect of having kids on income. If you want to follow along, download the data and put it into your project folder. I have put it in a sub-folder called "data", which I also state when loading the file.
In the code below, we first load the data, and then do a recoding to create a "woman" variable.
cd "/Users/xsunde/Dropbox/Jupyter/stathelp"
use "data/GSS2016.dta", clear
recode sex (1=0) (2=1), generate(woman)
In the previous post we could see that the effect of having kids on income was different for men and women. But the variable we used for kids was a dummy variable, with the values 0 (no kids) and 1 (one or more kids). Now we will insteade use a continous variable, "childs", that shows how many kids the respondent has. The variable is however cut off at 8 - the value 8 signifies 8 or more kids.
First we do a normal regression, with income as the dependent variable, and "woman" and "childs" as independents, and also contorlling for age, since it is strongly related to both income and the number of kids.
reg realrinc woman childs age
The coefficient for "childs" shows the expected effect of increasing the variable one step, which means having another child. It is weakly negative and insignificant. There is thus no apparent difference between respondents with few and many children. The effect of woman is negative - women earn less than men.
Now we will add the interaction between woman and childs. We can do that directly in the regression command, by connecting the two variables with ##
. But we must also add c.
in front on the childs variable, to show that it is continuous.
reg realrinc woman##c.childs age
It is now important to remember that we cannot interpret the coefficients in the interaction in the regular way.
To calculate the effect of having another child for both values of woman (that is, men and women), we take the main effect of childs (855.1288) and then add the coefficient for the interaction term, times the value of the woman variable:
For woman = 0 (men): $855.1288 -2060.054*0 = 855.1288$
For woman = 1 (women): $855.1288 -2060.054*1 = -1204.9252$
Men that have another child increase their income by 855.1288, while women decrease their income by 1204.9252. As the interaction term is statistically significant, we know that the difference in effect between the groups is significant.
However, this does not necessarily mean that the effects of childs is significant for each of the groups. For instance, among women, is the effect of having another child significantly negative? To calculate that we can use the margins
command, and then immediately after the marginsplot
command to show the coefficients graphically.
margins, dydx(childs) at(woman=(0 1))
marginsplot, yline(0)
We now get graphically what we calculated manually: The effect of "childs" is positive for people with a 0 on the variable woman (the men) and it is negative for the women. But the effects in themselves are not significantly different from zero. That means that we can be pretty sure that the effect of having more children is different for women than for men, but we can at the same time not be sure that either of the two effects is different from zero! This might be a little bit difficult to grasp.
But what we need to keep in mind is that "not significantly different from zero" not means that it definitely is zero. It is just more than five percent chance that it is zero. But if we look at the graph we can see that the confidence intervals only overlap slightly. In order for the two effects to be the same, the effect for men must be in the lowest part of the interval, at the same time as the effect of women is in the highest part of the interval. THe chance of both those happening at the same time is less than five percent. A bit tricky, but it actually makes sense.
Now we will calculate what the difference between men and women is, over different values of the variable childs. We do so by taking the main effect of "woman" (-6100.754), and then add the interaction term (-2060.054) times different values of children.
0 kids: -6100.754 -2060.054 0 = -6100.754
1 kids: -6100.754 -2060.054 1 = -8160.808
2 kids: -6100.754 -2060.054 2 = -10220.862
3 kids: -6100.754 -2060.054 3 = -12280.916
4 kids: -6100.754 -2060.054 4 = -14340.97
5 kids: -6100.754 -2060.054 5 = -16401.024
6 kids: -6100.754 -2060.054 6 = -18461.078
7 kids: -6100.754 -2060.054 7 = -20521.132
8 kids: -6100.754 -2060.054 * 8 = -22581.186
We thus get nine coefificents, each 2060.054 smaller than the previous. The difference between men and women grow larger for each additional kid there is. Women with 8 kids on average earn 22581.186 less than men with 8 kids!
We can't have 3.5 kids, but we can use the same method for interactions with variables that have decimal values.
Now we will calculate the same coefficients with the margins command, because we then also get the significance values and confidence intervals:
margins, dydx(woman) at(childs=(0/8))
marginsplot, yline(0)
The nine points with confidence intervals thus shows the nine coefficients we just calculated. We can see that the coefficient for woman - that is, the difference between women and men - grows more and more negative as the number of kids increases. All confidence intervals are also different frmo zero, which means that the gender difference always is statistically significant.
The fact that confidence intervals are wider in some places than others have to do with the distribution of values on the childs variable. The regression line is drawn through the center of gravity for the observations, and therefore varies more towards the ends of the data. The intervals are generally smallest where there are the most observations - between 1 and 3 kids. The average of the childs variable in the data is 1.8 kids.
Finally we predict values of the income variable, using the margins command. The easiest way to do so is to show the expected income for women and men at different numbers of kids. But keep in mind that the order of the variables in the at option matters. The variable that is entered first will be on the x axis, and the second will govern the colors. The table is also large, since we now have 18 (9 * 2) coefficients. This table is not generally reported, only the graph.
margins, at(childs=(0/8) woman=(0 1))
marginsplot
Now it is time to take things a step further. What if we have two continuous variables in the interaction? We basically do the same thing, but just have to interpret and present a little bit differently.
Now let's say that we want to look at the relationship between income, age and having another child. It is a bit strange, since the number of kids you have is tightly connecteed with age, but let us still try, for the sake of the exapmle. We then run a regression where we interact the number of kids with age (and write c.
in front of both variables, since they are both continuous).
reg realrinc woman c.childs##c.age
There is a significant interaction effect (-138.2845). Since the term is negative, the interpretation is that the effect of having more kids becomes more negative the older you are. Conversely, the effect of age is more negative the more kids you have.
To for instance calculate the effect of having another another child at different ages we do the following:
20 years old: 6347.314 - 138.2845 20 = 3581.624
30 years old: 6347.314 - 138.2845 30 = 2198.779
40 years old: 6347.314 - 138.2845 40 = 815.934
50 years old: 6347.314 - 138.2845 50 = -566.9108
60 years old: 6347.314 - 138.2845 * 60 = -1949.756
We can as usual illustrate it with margins
and marginsplot
:
margins, dydx(childs) at(age=(20(10)60))
marginsplot, yline(0)
We can see in the graph that the effect of having kids is positive for people that are 20, 30 and 40, and negative for people that are 50 or 60.
And to get the effect of getting one year older at different number of kids we do the following:
0 kids: 474.4715 - 138.2845 0 = 474.4715
2 kids: 474.4715 - 138.2845 2 = 197.9026
4 kids: 474.4715 - 138.2845 4 = -78.6664
6 kids: 474.4715 - 138.2845 6 = -355.2354
8 kids: 474.4715 - 138.2845 * 8 = 631.8044
margins, dydx(age) at(childs=(0(2)8))
marginsplot, yline(0)
The effect of age is positive for people with 0 or 2 kids, but negative for people with 4, 6 or 8 kids.
So far we have looked at the effects, but it gets more tricky if we want to show predicted values. Let's say that we want to have age on the x axis. THen we will have to have one line for each number of kids. But in order not to make the graph to overloaded we will just display a few vales, for instance 0, 3 and 6 kids. The other lines would be drawn in between anyway.
margins, at(age=(20(10)70) childs=(0 3 6))
marginsplot
Here we can see in a different way that the effect of getting older is positive for people with 0 kids, close to zero for people with 3 kids, and negative for people with 6 kids.
But we also have to remember that we now have divided the data into a lot of different subgroups, and we only had about 1600 observations to work with from the beginning. The more interactions we do, the more sensitive is the data to outliers. If there for instance is a person that is very old or have a lot of kids, that person will have a great deal of leverage. It is therefore often better to combine values, so that we for instance compare people with and without kids, or people above or below 40, for instance.
Interaction analyses are often theoretically interesting, and can show important differences in the data. But remember that the goal seldom is to make a map with the scale 1:1 of reality. Instead, we want to sift through large amounts of data to get to the big patterns. The fact that it is possible to find a significant interaction does not imply that it is interesting. There is always a risk of overfitting, that is, building a model that fits our sample perfectly, but not the wider population.