I avoid posting technical notes here. This is an exception because I have an agenda.
Log transformation is widely used in modeling data for several reasons: Making data "behave," calculating elasticity etc.
When an outcome variable naturally has zeros, however, log transformation is tricky. Many data modelers (including seasoned researchers) instinctively add a positive constant to each value in the outcome variable. One popular idea is to add 1 to the variable and transform raw zeros to log-transformed zeros. Another idea is to add a very small constant, especially when the scale of the outcome variable is small.
Well, bad news is these are arbitrary choices and the resulting estimations may be biased. To me, if an analysis is correlational (as most are), a small bias may not be a big concern. If it is causal, and for example, an estimated elasticity will be used to take action (with an intention to change an outcome), that's trouble waiting to happen. This is a problem of data centricity.
What is a solution (other than deserting to Poisson etc.)? A recent study by Christophe Bellégo and his coauthors offers a solution called iOLS (iterated OLS). To avoid bias, the iOLS algorithm adds an observation-specific value to the outcome variable. Voila! I haven't tested it yet but I like the idea. Read their nicely written paper here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3444996
My (not so hidden) agenda is regarding the implementation. The authors offer a Stata implementation (https://github.com/ldpape/iOLS). I would love to see it in R (or Python). Hence this is a call for action.
#r #rstats #python #modeling #log #transformation #datascience #analytics #datacentricity