Almost all commands in Stata can be combined with so callded if qualifiers. These are conditions that tell Stata which observations that should be included in the command. We might for instance want to recode only a subset of observations, run an analysis on a small part of the dataset, and so on.
The conditions use a set of "logical operators," building blocks by which we can construct both simple and advanced conditions. They are also used in a lot of other softwares. The operators are:
Operator | Meaning |
---|---|
== | Equal to |
!= | Not equal to |
> | Larger than |
< | Less than |
>= | Larger than or equal to |
<= | Less than or equal to |
& | And |
$|$ | Or |
The two last, "and" and "or" can be used to link several conditions. If we for instance work with data on persons, we can create a condition that requires the person to be 25 years old AND unemployed, for instance. Or we could use a condition to select people that are under 22 years old, OR have never voted in a parliamentary election.
The if qualifiers are entered in the command after the list of variables, and before options ,
.
With the aid of the QoG Basic dataset we will look at some examples of how these conditions can be used in a range of applications.
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
If qualifiers help us select specific groups of observations. Let's say we want to look at the level of corruption in the world. To see the mean we can write:
sum ti_cpi
The mean is 42.8 (on a 0-100 scale, where 100 means the least corruption). But now say we want to do this for a smaller group of countries, such as the ones that are categorized as free according to Freedom House. These have the value 1 on the variable fh_status
. We then add an if qualifier to the command:
sum ti_cpi if fh_status==1
The number of observations is now lower, 77 instead of 181. The mean is also higher: 57.3 instead of 42.8. Corruption is in general not as widespread in democratic countries.
We can also use if qualifiers to look at observations that have a value over or under some threshold, such as countries with population unna_pop
that exceeds 50 million:
sum ti_cpi if unna_pop > 50000000
The if qualifiers work on virtually all commands. For instance the command list
. We now want a list of all countries in Eastern Europe and the former Soviet Union, and whether they are categorized as democracies or dictatorships. We start with the variable ht_region
, where these countries have the value 1. We will create a list of two variables, the country's name cname
and its categorization fh_status
. We also add the option clean
that removes the lines in the output. Options are added after the if qualifier.
list cname fh_status if ht_region==1, clean
If we instead want to condense this list, we can display the information as a table of frequencies with the command tab
, where we instead see how many countries that are placed in each category.
tab fh_status if ht_region==1
If qualifiers are very useful when we make graphs, especially with the command twoway
, since we with this command easily can add different layers of graphs on top of each other. Each layer can have its own set of conditions for which observations that are included in the layer, but we can also create if qualifiers that are applied to the graph as a whole. In the graph below we present the relationship between corruption ti_cpi
and ethnic fragmentation al_ethnic
. We add an if qualifier for the entire graph, which limits the sample to either Western Europe and Northern American ht_region==5
OR Sub-Saharan Africa ht_region==4
.
twoway (scatter ti_cpi al_ethnic) if ht_region==5 | ht_region==4
But we can also use if qualifiers within each layer. We will now create a layer where only Western Europe and Northern America are included, and one layer where only Sub-Saharan Africa is included. The benefit of doing so is that we then can modify the looks of each layer. We set the color of the Western European and Northern American dots to blue, and Sub-Saharan Africa red. Note that the if qualifiers here are within the parentheses, and thus only affect what is in the set of parentheses. But even here they are located before the options, as always.
twoway (scatter ti_cpi al_ethnic if ht_region==5, mcolor(blue)) ///
(scatter ti_cpi al_ethnic if ht_region==4, mcolor(red))
A very useful feature of this type of scatterplot is to create a new layer with only a few select observations, that are marked in the graph. For instance, we might want to show the location of Sweden and the United States. We then create a new layer, with a condition that the country name cname
should be "Sweden" OR "United States", and in this layer we also say that the country names should be used as a marker label. We will also add an option to the entire graph, legend(off)
, which removes the legend at the bottom of the graph - we will otherwise get one explanation for each layer, which looks bad.
twoway (scatter ti_cpi al_ethnic if ht_region==5, mcolor(blue)) ///
(scatter ti_cpi al_ethnic if ht_region==4, mcolor(red)) ///
(scatter ti_cpi al_ethnic if cname=="Sweden" | cname=="United States", mlabel(cname)) ///
, legend(off)
If qualifiers can be used in regression analysis (or other types of analyses) to run the analysis in a specific subgroup, or to eliminate certain outliers. First we can try to run a regression analysis on the relationship between corruption and ethnic fragmentation on Sub-Saharan Africa. Please not that only 46 observations are included in the analysis.
reg ti_cpi al_ethnic if ht_region==4
We can also eliminate specific outliers. Singapore, for instance, is a very special case when it comes to corruption. We can try to leave it out of the analysis, to make sure that the results are not affected too mush buy the country. We use the "not equal to" operator, !=
, to remove specifically Singapore.
reg ti_cpi al_ethnic if cname!="Singapore"
If qualifiers can be used in a lot of ways. It is however important to remember that each part of the condition must work on its own even when you string several conditions together. If we for instance want to create a condition that chooses Sweden or the United States, we CANNOT write:
if cname == ("Sweden" | "United States"
)
Instead, we must write it as two conditions:
if cname == "Sweden" | cname == "United States"
The variable name must be included in both parts of the if qualifier. It is also important to bear in mind the difference between OCH &
and OR |
. You could for instance be forgiven for thinking that we in the exmaple above could write if cname == "Sweden" & cname == "United States"
because both country names are ok. But Stata would then have looked for countries where the country name was both "Sweden" and "United States", and it cannot be both at the same time, and no observations would have been chosen.