One of the simplest but still most informative analyses you can do on a statistical material is the correlation analysis. It shows the direction and strength of a relationship between two variables. With direction, we mean whether it is positive or negative.
If it is positive, more of one variable is associated with more of the other. Conversely, less of one variable is associated with less of the other. If the relationship instead is negative, more of one variable will be associated with less of the other.
There are many different measures of correlation, that are tailored to variables of different types. The most common is Pearson's R, which shows the linear relationship between two interval scal variables. It runs from -1 (perfect negative relationship) via 0 (no relationship) to +1 (perfect positive relationship).
We will use the QoG basic dataset. If you want to follow, download it and put it in your project folder. Then open it with the use
command. Remember to specify the path correctly.
use "qog_bas_cs_jan19.dta", clear
Let's say that we want to look at the correlation between the level of corruption in a country (as estimated by Transparency International) and the degree of democracy (as measured by the Polity project). We then simply write:
pwcorr ti_cpi p_polity2
We get this very simple correlation matrix, that shows the relationship between two variables (in the intersection of the p_polity2
row and the ti_cpi
column). The correlation coefficient is 0.4178, which means that it is a positive relationship: Higher values on ti_cpi
variable (which means that there is less corruption in the country) is associated with higher values on p_polity2
(which means more democracy). Democratic countries are less corrupt, on average.
We can also add variousu options to get more information. Often we want to know whether the correlation is statistically significant. We do that by adding the option sig
.
pwcorr ti_cpi p_polity2, sig
Below each correlation coefficient we now get the significance value, or the p-value, in this case 0.0000. It means that it is very improbably that a relationship this strong would arise by chance in the sampling procedure. Of course, the countries of the world are not randomly selected from a larger population, so the interpretation is a little bit different. But without going into that, the result still counts as statistically significant (since the p-value is below 0.05).
Then we might want to find out how many observations that were included in the analysis. Then we add obs
among the options. 164 countries were included in this analysis.
pwcorr ti_cpi p_polity2, sig obs
One convenient feature of correlation analysis is that it is easy to investigate relationship between many variables at the same time. The analysis has no "direction": no variable is dependent or independent. It does not matter in which order you enter the variables. It only affects the order in which they are presented in the table.
Now we want to see the relationship between our two variables and the level of social trust in the population, as measured in surveys by the World Values Survey. The variable is called wvs_trust
, and we just enter it in the list of variables. We will here remove the significance numbers, to make the matrix a little bit more readable.
pwcorr ti_cpi p_polity2 wvs_trust, obs
The matrix now has an additional row, and we can see all the pairwise relationships between the different variables. We see that trust is associated with higher values on ti_cpi
(less corruption). We can however not say what is the chicken or the egg from this analysis. Interestingly enough, we can also note that more trust is associated with less democracy, even if the relationship not is particularly strong (nor is it significant, even if it isn't shown here).
A possible problem with the analysis is that there are different number of observations in the different analyses. When the relationship between ti_cpi
and p_polity2
was investigated, 164 countries were included in the analysis. But for the correlation between ti_cpi
and wvs_trust
, the result is only based on 34 countries. The explanation is that the trust variable only is available for a small number of countries.
In one way it is good that we for each analysis use all available observations. But since regression analyses are limited to the observations have complete data on all variables, this correlation matrix will not be directly comparable to regression analyses that inlcude all variables. This might pose a pedagogical problem.
We can therefore limit the matrix to only include observations that have all information, so that it is the same set of observationas that are included for all calculations in the matrix. We do that with the command corr
, instead of pwcorr
. But for some reason, we cannot get significance numbers with the corr command.
corr ti_cpi p_polity2 wvs_trust
Stata now tells us that 34 countries were included in all the analyses. We also see that the relationship between democracy and corruption is weaker in this group of countries: The correlation coefficient r is 0.1145, compared to 0.4178 when it was calculated on the larger group of countries. This is important to keep in mind when interpreting the results - this group of countries appears to differ in some way from the larger group.
Correlation analysis is a very useful tool to explore our data, and quickly get an overview of what the relationships look like. A correlation matrix is often a good feature of a thesis' method section, or in the beginning of the results section, as it makes it easier for the reader to understand what's going on in the more advanced analyses. Regression analysis also builds on the correlations between the included variables.
It is however important to keep in mind that all correlation analsyes are bivariate. They do not take other variables into account, even if there are other variables in the same matrix. The matrix is just a condensed form of presentation: the calculations only take into account two variables at a time.