A very intuitive and simple way to show a relationship between a categorical and a continuous variable is to calculate mean values (averages) of the continous variable for each value on the categorical variable.
For instace: What is the average income among men and women? What is the mean life expectancy in different countries? What is the level of trust in strangers among immigrants and natives? And so on.
More advanced statistial analysis is often necessary to rule out alternative explanations and strengthen our causal claims, but the presentation is almost always improved by first showing the main features of the relationship in a simple way - for instance with mean values or crosstabs.
We will in the examples work with the QoG basic dataset, which has information about the world's countries. Below I load the data directly from the web page, but we can also download the data and load it from the computer instead (generally advisable).
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
sum
and bysort
¶In this example we will take a closer look at the relationship between a country's GDP per capita and its level of democracy. We will measure GDP with the variable "gle_rgdpc" and democracy with "fh_status", as categorized byt he organization Freedom House.
With the command summarize
, which can also be abbreviated to sum
we quickly get the mean value of GDP per capita in the entire dataset. We just type sum
followed by the variable:
sum gle_rgdpc
On average, the 192 countries in the dataset have 12596.3 dollars in GDP per capita. But this is the mean value for the entire dataset, with democracies and dictatorships mixed.
To get at that difference we an use the bysort
command. It is a sort of "pre-command", where we type bysort variable:
, and then the main command. We then get the main command repeated once for each group of the variable we specified after bysort
. Like this:
bysort fh_status: sum gle_rgdpc
We now get three analyses, where the sum
command on the GDP variable first was run only on the 89 countries that are categorized as "Free", then on the 54 that are counted as "Partly free" and finally on the 49 countries that are categorized as "Not free".
The tables might look a bit messy, but we at least get the most important numbers. We can see that GDP per capita is highest in the free countries (18777.71), second highest in the not free countries (8745.244) and lowest in the partly free countries (5902.896). There are quite a few research papers written about this finding, but one thing to keep in mind is that there are some oil-rich countries among the worst dictatorships (oil incomes might even underpin the regimes).
We also got some bonus information (like the min, max and standard deviation), but we generally want to use commands that give us the information we are looking for, and not much more. In this case we just wanted the mean, and then there is a better method.
table
¶The table
command (not to be confused with the tab
command) is a flexible command that give us information about variables in different groups.
The structures is that we first specifiy for which variable we want to compare groups, and in the options of the command we then specify which information we want about the grups. If we want the mean of GDP per capita, for the groups of "fh_status" we type:
table fh_status, contents(mean gle_rgdpc)
table fh_status
meant that we wanted the observations divided according to the different values of "fh_status". With the option contents()
we specify what that information should be - in this case the mean of "gle_rgdpc". But the mean is not the only info we can get. Type help table
for the full list, but some of the more useful are sd
, median
, max
, min
and count
. The last one shows how many observations we have in each group.
We can specify up to five different informations, like this:
table fh_status, contents(mean gle_rgdpc sd gle_rgdpc min gle_rgdpc max gle_rgdpc count gle_rgdpc)
This table is much more neat than what we got with the sum
command, and contains the same information. In our reports or theses we should however never simply paste tables from Stata - they can alway be made more readable. But it is of course more convenient to start with something that looks more like the final product.
table
¶The table
command also lets us divide according to two or three variables, and show information in the smaller groups. We just add the variables after each other, before the comma sign that denotes that options. Below we divided the observations in the dataset not only according to the level of democracy, but also according to whether the country uses proportional representation or not in their electoral system "dpi_pr".
table fh_status dpi_pr, contents(mean gle_rgdpc count gle_rgdpc)
This way we can get more fine-grained information. But when we divide into more groups the number of observations in each group goes down, which we need to be observant of. It might therefore be good to use the count
option. We can for instnace see that there are only 16 countries that are categorized as not free, while also having proportional representation. Our conclusions for that group will therefore be less certain than for the groups with 54 countries.
Our presentation of statistical results should always be as pedagogical and easy to understand as possible. To display the means in different groups is often a good idea, and gives easy numbers that the reader can use as a point of reference while thinking about more advanced analyses. The table
command is a good way to quickly access the relevant information.