If you have several variables that measure the same thing, it might be appropriate to make an index variable. It is a more comprehensive variable that measures for instance an added score, or an average of the different variables. One advantage of index variables is that measurement error in the constituent variables cancel out, and you get a better measure of the underlying concept that you are really interested in.
Let's say we want to measure a persons knowledge of basic scientific facts. There is no single questions that could capture everything. If we for instance ask "What is smallest, the elctron or the atom?" (it is the electron!) it is of course likely that more knowledgeable persons will know the answer more often. But even a person that generally is ignorant of science might have heard this particular fact, and someone that knows a lot might have missed it. By adding together several questions and combine the answers we can get a more comprehensive picture of how much the person knows.
In this guide we will cover how to make two simple indices over basic scientific knowledge, and how to check the reliability of the index, that is, how well the constituent variables correlate.
We will use the american General Social Survey, which asks several questions about basic scientific facts. We start by loading the data:
use "data/GSS2018.dta", clear
First we need to recode
the variables so that they have a common scale. If we for instance would add a variable that ranges from 1-4 with a variable that has the scale 1-10 we would have problems. The 1-10 variable would carry a lot more "weight" in the index.
We will use five questions. 1) What is smaller, the electorn or the atom? 2) Does the earth move around the sun, or the sun around the earth? 3) Have the continents moved over time? 4) Does antibiotics kill virus as well as bacteria? 5) Does a laser work by focusing sound waves?
The variables are generally coded so that 1 means "true" and 2 "false". We are however only interested in whether the answer is correct or not, and therefore construct a new set of variables that have the value 0 if the respondent answered incorrectly, and 1 if he or she answered correctly. We do this with the recode
command. The new variables are called "c_" and then a brief summary of what the question is about.
recode electron (1 = 1) (2=0), gen(c_electron)
recode earthsun (1 = 1) (2=0), gen(c_earthsun)
recode condrift (1 = 1) (2=0), gen(c_condrift)
recode viruses (1 = 0) (2=1), gen(c_viruses)
recode lasers (1 = 0) (2=1), gen(c_lasers)
We can first look at the averiages of the new variables, to see how many answered correctly. We can do that with the sum
command. If we write sum c_*
the command is run on all variables that start with "c_".
sum c_*
Since the variables only have the values 0 or 1, a mean value of 0.5 means that 50 percent answered correctly. The hardest question was apparently whether antibiotics kill virus (it does not), where 54 percent had the right answer. The easiest question was about continental drift. 87 percent knew the answer here. Now that we have a good set of similar variables we can create the index.
The first index we will create is the simplest possible: A combined score of the variables. A person that answers incorrectly on all questions get 0 points. A person that has all right answers gets 5 points. We construct the index with the generate
command (abbreviated gen
). The result is a new variable. We then produce a histogram
to see the distribution of scores.
gen addindex = c_electron + c_earthsun + c_condrift + c_viruses + c_lasers
histogram addindex, percent
We now have a scale that measures knowledge of basic scientific facts. Only a few people had zero points (if you guess on all questions you should on average get 2.5 points). About 30 percent answered correctly on all questions.
The scale ranges from 0 to 5. If we instead want it to range from 0 to 1, to show the proportion of questions the respondent knew, we cna simply divide the variable by 5. The code would then have looked like below. However, it does not change anything substantially. The ranking of the respondents would have looked exactly the same.
gen addindex_01 = (c_electron + c_earthsun + c_condrift + c_viruses + c_lasers)/5
A limitation of the method is that only persons that have valid answers on all the questions are included in the index. A person that has a missing value on a single variable is excluded. Therefore there are only 565 persons that have a value on the new index, despite there being 777 responses on the least answered question.
An alternative approach is therefore to take the average of all the variables that the respondent has values on. The upside is that we get more observations on the index, the downside is that the index means slightly different things for different people. To only answer one question and be right is not the same thing as answering five questions correctly. It is therefore important to think this through. What is more important, wide coverage, or fair comparisons between respondents?
To create the index this way we use the command egen
- am extend version of gen
combined with rowmean()
. Here we simply list the variables we want to take the average of, we dont place plus signs between them. We then create a histogram to see the distribution again:
egen meanindex = rowmean(c_electron c_earthsun c_condrift c_viruses c_lasers)
histogram meanindex, percent
There are now many more different values, and the distribution looks different. More people have the value 0, since it now includes people who only answered one or two questions (and wrong). But we also now have 1166 persons with a value on the index.
An index can be created for different reasons. In this example the purpose was to see how knowledgeable people are about science. Here it does not matter if the questions "go together", that is, correlate. A persons that answers all questions correctly know more about science than a person who does not. The purpose is to evaluate.
But if we want to measure some underlying attitude, for instance an ideology, it might be important to see whether the questions actually correlate. If they are all indicators of a broader ideology, they should be correlated. Let's say that we had chosen five questions that we thought would capture left/right ideology. If we then find that one of the questions is uncorrelated with the other it might be a good idea to exclude it from the index. It might be picking up something else than the ideology.
We can first look at the correlations between the variables, in a correlation matrix. We write pwcorr
followed by the list of variables:
pwcorr c_electron c_earthsun c_condrift c_viruses c_lasers
All the variables have positive correlations. A person that answers one question correctly is more likely to answer another question correctly. But the relationships are quite weak - the correlation between knowing that the electron is smaller than the atom and that the earth revolves around the sun is only 0.06.
There are established measures for the reliability of indices, meaning how well the variables in the index correlate. A common measure is Cronbach's alpha. It ranges from 0 to 1, and the higher the value, the more the variables in the index tend to correlate. A rule of thumb is that 0.7 indicates that the scale is "reliable" but there is obviously not something magical about that threshold.
We write alpha
followed by the list of variables:
alpha c_electron c_earthsun c_condrift c_viruses c_lasers
The relevant value is found at the bottom: 0.4287. Not particularly high. In this case it does not matter, since our aim was to evaluate, rather than to find an underlying dimension of scientific knowledge. But if we aimed to find a coherent dimension we would want to remove variables that have low correlations with the other variables. By adding the option , item
we can to the right in the table see what the alpha would be, if we removed a particular variable.
alpha c_electron c_earthsun c_condrift c_viruses c_lasers, item
We could see in the correlation matrix that the question about electrons was only weakly correlated with the other variables. If we remove it, the alpha value would increase slightly, to 0.4341. If we instead removed the question about the movement of the earth around the sun, which had high correlations with the other variables, the alpha value would drop to 0.33.
To show the benefit of the index we can look at the relationship between education and the indices we created, as well as with the variables on their own. "addindex" is the index that only included people that had answered all questions, and "meanindex" is the mean of the answered questinos. "educ" is a measure of how many years of schooling the respondent has.
pwcorr educ addindex meanindex c_electron c_earthsun c_condrift c_viruses c_lasers
What is interesting is that "educ" has higher correlations with both versions of the index than with any single variable that is used to create the indices. By combining the different variables we get a better indicator of much you know about science - something we can reasonably expect to be improved by education.
We have here talked about how to create additive indices, when the variables are combined by adding them together (or calculating the mean). But index can of course be created in different ways, using more advanced methods. The important thing is that it is the research question that guides the coding of the variables.