You can of course not make up data when doing statistical analysis. But transforming variables is something completely different. We then manipulate the variables in systematic ways, so they are better suited for answering a specific question, or better fit into a certain model.
In many cases we have data that has a skewed distribution. One case of that is when we have many small values, but only a few very large.
Such patterns can often arise when we are dealing with self-reinforcing processess. For instance population in a city. Every year new kids are born in the city. The larger the city, the more children are born. Therefore, larger cities will grow faster in absolute numbers than small cities. And a city will often also become more attractive as a destination when it grows, which means that it attracts more movers. In this way, the larger cities will increase the distance to the smaller cities.
The same applies for economy. "For whosoever hath, to him shall be given, and he should have more abundance, but whosoever hath not, from him shall be taken away even that he hath" says the Bible. The meaning might be about faith, but it is a good descriptive principle for economics. It is easier to make money if you are already rich.
Distributions of income are therefore often skewed. Below i s a diagram of the distribution of GDP per capita in the countries of the world, according to the QoG Basic dataset.
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
histogram gle_rgdpc, percent
More than half of the countries can be found in the left-most bar. And to the right we have one single bar, that represents a few extremely rich countries, Monaco among them. This is a classic example of skewed data.
This poses a problem for several statistical analyses. If you for instance imagine regression analysis as a see-saw, Monaco will sit very far out on the board, and will have a lot of leverage. Characteristics among countries with low values can instead be seen as sitting close to the middle of the board, and will have little influence in our analyses.
But with a simple transformation we can change the scale of the variable, thus making the distribution more normal, and better suited for statistical analysis. It will also in many cases make more sense theoretically.
Logarithms have different bases. The simplest one is the base 10-logarithm (sometimes called the common logarithm). For each number, we ask which power we need to raise 10 to, in order to get the original number. Take the number 100. To get 100, we need to raise 10 to the power of 2, because $10\cdot10 = 10^2 = 100$. The base 10 logarithm of 100 is thus 2. The base 10 logarithm of 1000 is 3, since $10^3=100$.
Another, even more useful, base is the number $e$, which is approximately 2.72. The logarithm that has this number as its base is called the natural logarithm. The table below shows some numbers, and what the base 10 logarithm and the natural logarithm of these numbers are.
Number | Base 10 logarithm | Natural logarithm |
---|---|---|
1 | 0.00 | 0.00 |
10 | 1.00 | 2.30 |
20 | 1.30 | 3.00 |
30 | 1.48 | 3.40 |
40 | 1.60 | 3.69 |
50 | 1.70 | 3.91 |
100 | 2.00 | 4.61 |
500 | 2.70 | 6.21 |
1000 | 3.00 | 6.91 |
We can see in the table that the logarithms increase slower and slower, the higher up we are on the scale. To increase the base 10 logarithm from 1 to 2 we need to increase the original number from 10 to 100. But to increase it from 2 to 3 we need to increase the original number from 100 to 1000. The same principle applies also to the natural logarithm.
Logarithmic scales can be interpreted as showing ratios, or relationships, rather than absolute differences. Every step up on the base 10 logarithm means that the original number increases tenfold. Every step up on the natural logarithm means that the original number is multiplied by factor 2.72. The natural number is more useful because we can more easily interpret changes in it as changes in percent.
It is very straigthforward to do a logarithmic transformation in Stata. We use the generate
command and write ln()
if we want to use the natural logarithm, or log10()
if we want the base 10 logarithm. In the code below, we do one of each for the GDP per capita variable:
gen ln_gdpc = ln(gle_rgdpc)
gen log10_gdpc = log10(gle_rgdpc)
To see how the new variable relates to the old we can make a scatterplot with the logarithm on the y axis, and the normal variable on the x axis.
twoway (scatter ln_gdpc gle_rgdpc)
The underlying data shown on the x axis and the y axis is the same. But the scale on the y axis is more "compressed." The three richest countries (Monaqo, Qatar and Lichtenstein) are further away from the rest on the x axis, compared to on the y axis. This is because it is harder and harder to increase the logarithm, the further to the right we are.
If we look at the distribution of the natural logarithm of GDP per capita, it looks completely different compared to the untransformed variable:
histogram ln_gdpc
The distribution is now much more compressed, and normal. It is because the actual distances between the steps of the scale are much larger on the right side of the graph, compared to the left. 6 on the x axis indicates a GDP per capita of 403 dollars. 7 represents 1096. 10 on the x axis represents 22000 dollar, and 11 represents 59874!
This might seem like a strange thing to do with data. But in many cases, it makes theoretical sence. For instance, consider the relationship between quality of life and income. For a poor person, a small increase in income can make a big difference. In his book Factfulness, Hans Rosling describes the important difference between having sandals, and not having sandals, or between having a bike, and not having a bike. But for someone who is already rich, incomes would have to be increased substantially in order to really affect that persons life.
Money has a diminishing rate of return. Every dollar extra has a little less impact than the previous. It is therefore reasonable to look at increases in percent, which we can do with logarithms.
In the two graphs below we can see the relationship between life expectancy and GDP per capita in the countries of the world. The left graph whos GDP in absolute numbers, and in the natural logarithm to the right. I have also drawn lines that show roughly the moving average (the parentheses that start with lowess
in the code.
quietly twoway (scatter wdi_lifexp gle_rgdpc) (lowess wdi_lifexp gle_rgdpc), name(graph_normal, replace)
quietly twoway (scatter wdi_lifexp ln_gdpc) (lowess wdi_lifexp ln_gdpc), name(graph_log, replace)
graph combine graph_normal graph_log
The relationship between life expectancy and GDP per capita is non-linear and diminishing. The relationship between life expectancy and the natural logarithm of GDP per capita is however linear. This means that we get less and less effect of increasing a countries GDP per capita with a fixed sum, for instance 1000 dollars. But an increase in percent, for instance a doubling, has similar effects all over the scale (which we can see on the right). Of course, doubling the GDP of Monaco is much more "expensive" than doubling the GDP of a poorer country. THe variable to the right will be better suited for regression analysis.
Skewed data might be a problem for statistical annalyses, since it gives to much weight to extreme values, and causes the line to fit worse to the big mass of observations.
When the data is skewed in a particular way, and we have many small values but only a few large, logarithmic transformation might be a good option. Especially when we have good reason to believe that the data-generating process is self-reinforcing. Examples are population growth or incomes.
Logarithmic variables can also be used both as dependent and independent variables in regression analyeses. But then they need to be interpreted in a certain way. See a separate guide for that.