merge
¶Many statistical analyses requires the that we comvbine several datasets. We might get information about the independent variable from one source, and want to analyze the effects of it on a dependent variable from another source. We thus have the same observations in both datasets, but want to combine the variables. We do this with the command merge
.
Another scenario is when we have different datasets with the same variables, but with different units of analysis. In those cases, we use the command append
, which really is the same as pasting more observations to the dataset. But since merge
is a bit trickier we will focus on that command in this guide.
With the merge
command we can do three (or really, two) different types of merges. Easiest is the 1:1
merge. It means that we have the same observations in both datasets. For instance the same individuals, or the same countries; we just want to add more information about them. We then need a matching variable, a key, that shows the country's or the person's identity, so we can match the information in the two datasets.
But we can also match m:1
. It means that we have many observations in the dataset that is active in Stata, and want to add on data from a smaller number of observations, at a higher level. For instance, we might have data on persons in Europe, and want to add country-level information. Imagine that we have conducted a survey, and now want to see whether the answers are affected by whether the country of residence is an EU member or not. All persons in Sweden will have the same value - Sweden is a part of the EU. All persons in Norway will also have the same value - Norway is not a member of the EU. In our individual level dataset we can have thousands of observations, and still combine it with about 30 country-level observations (one for each country).
The third type of merge is 1:m
. It is the exact same thing, but instead of starting with the individual level dataset, we start with the country level dataset, and add the individual level data.
merge 1:1
¶We start with the simplest type: when we have the same type of observations in both datasets, and want to add mode variables. Let's say we want to combine information about the level of democracy in a country with the level of corrption in the country. We can get information about democracy from Freedom House. The data is in Excel format, but we can just cut the relevant parts and paste into a Stata dataset. Then we get information about corruption from Transparency International, and treat it the same way.
We now have two datasets, fh2017.dta
and cpi2018.dta
. Below we load fh2017.dta
and look at the first five rows in the dataset, with the help of the command list
.
cd "/Users/xsunde/Dropbox/Jupyter/stathelp/data"
use "fh2017.dta", clear
list in 1/5
We have four variables: the country name country
, the level of political rights on a scale from 1 to 7 fh_pr
, the level of civil liberties fh_cl
and a summarizing classification fh_status
. Now let's look at the corruption data.
use "cpi2018.dta", clear
list in 1/5
Here we only have two variables: the country name country
and the level of corruption cpi2018
(where low values indicate more corruption). We will now merge the two datasets.
country
. We want to match the level of democracy in Afghanistan with the level of corruption in Afghanistan. If we don't have a ID variable of this type it is impossible to do the merge. Generally, it is preferable if the ID is a number, and not a string (text) variable. It reduces the risk that the merge is hindered by spelling errors and so on. We can also match on serveral variables. For instance, when working with this type of data we often want to match on both country and year. Sweden 2008 should be matched to Sweden 2008, and so on.We use the command merge
. The principle is that we write merge
, then what type of merge we want to do, then what the ID variable is, and then which dataset we want to match the active data with. In this case, we want a 1:1
merge, with the democracy data. It looks like this:
merge 1:1 country using "fh2017.dta"
The merge is now complete. But we have errors to correct. The summary shows that 168 observations were successfully matched. But we can also see that there were 50 observations that could not be matched: 12 from the "master" dataset, that is, the one we had active (the corruption data). We also have 38 observations in the "using" dataset, the democracy data, that could not find a match.
This does not necessarily have to be a problem, for instance if we match data on EU countries with data from all of the world. We will then of course not be able to match African countries with the EU data.
But here we might have some problems, since there are countries in both datasets that could not find a match. Often this is caused by errors in the ID variable.
Luckily, Stata allows us to easily identify the problematic observations. The merge
command creates a new variable called _merge
. It shows the result of the matching. Observations with the code 1 were only found in the original data. Observations with the code 2 were only found in the new data. Observations with the code 3 were successfully matched. We can list all observations that were not matches, that is, all observations that did not have the value 3 on the _merge
variable:
list country _merge if _merge!=3
The next step is to see if there are any countries that appear twice, in both the "master only" and "using only" categories. One such country is the United States, called "United States of America" in one dataset and "United States" in the other. We can also see that South Korea is called "Korea, South" in one dataset and "Sotuh Korea" in the other. Stata does not understand that these are the same countries. It would also be unable to match if there were differences in capitalization.
We must therefore manually correct the data so that the ID variables match. We need to go back to the data and change the variable in one of the datasets. We can for instance change "United States" to "United States of America" and "Korea, South" to "South Korea" in the democracy data. If we then rerun our do-file with the merge command, the number of unmatched observations should have decreased.
_merge
variable¶When we are happy with the results of our merging we can remove the data we don't need. For instance the unmatched data, or observations that were not included in our original data. If we want to remove observations that only were included in the democracy data (the "using" dataset) we can write:
drop if _merge==2
THen we can drop the _merge
variable itself. This is necessary if we want to add more data - Stata will not go through with the merge if there already exists a _merge
variable in the dataset. We therefore write:
drop _merge
merge m:1
¶Now let's say that we have a dataset of individuals, where the data shows each persons gender and his or her satisfaction with democracy. I made this data up. Let's load it and look at the 10 first persons (there are only 15 in the entire dataset):
use "satisfactiontest.dta", clear
list in 1/10
We have here persons from Afghanistan and Albania, who have different levels of satisfaction with democracy. Some of them are women, and some are not. Now let's merge this dataset with the democracy data, with the help of merge m:1
. The ID variable that we match on is still country
.
merge m:1 country using "fh2017.dta"
15 observations were matched, as planned. 203 observations were not matched, but they were all from the democracy data. We only had persons from three countries, so all other countries are unmatched.
Now let's look at the first 10 persons again:
list in 1/10
Note that all the variables from the democracy data - fh_pr
, fh_cl
and fh_status
have the same value for all persons from each country. All persons from Afghanistan have the same value on the variable fh_pr
, since the level of political rights is the same for everybody in Afghanistan. The variable only varies between countries, not within.
If we instead had started with the country level data loaded in Stata, we would have had to write merge 1:m using "satisfactiontest.dta"
instead, but the end result would have been the same.
_merge
variable when you are done matching, so that you can easily add more data later on.The merge
command is suitable when we want to add more variables. If we want to add observations, with information on the same variables, we need to use the append
command. One such case might be when we want to combine surveys with identical questions, but carried out in 2018 and 2019. append
then lets us combine the survey responses in one dataset.