Continuous and Binary Dependent Variables

An exploration of the use of continuous and binary dependent variables in causal inference.

Ben Ewing


In his 2011 paper “Land tenure and investment incentives: Evidence from West Africa,” James Fenske notes that “studies that use binary investment measures…are also less likely to find a statistically significant effect.” He seems to attribute this to (1) small sample sizes, and (2) the lack of nuance associated with binary variables. Fenske argues that continuous measures of investment intensity are better for causal analysis. As Fenske points out continuous variables are usually noisy. For example, imagine asking a farmer how many KGs of fertilizer they applied to each plot of land in the past year versus asking a farmer if they applied any fertilizer to each plot of land in the past year.

While small sample sizes can certainly cause estimation issues (frequentist statistics rely heavily on asymptotic results afterall), I’d like to explore why binary variables may be showing up more often as statistically insignificant in this context.

I will be assuming the use of OLS rather than logistic regression, which is somewhat common in the economics literature for binary variables. I will write in terms of fertilizer amount (continuous) /fertilizer use (binary) and a treatment meant to increase fertilizer use, but these results generalize.

R Setup

# Data manipulation
# Modeling
# Plotting and tables

# R settings
theme_set(theme_minimal() + theme(legend.position = "none"))
theme_update(panel.background = element_rect(fill = "transparent", 
                                             colour = NA),
             plot.background = element_rect(fill = "transparent", 
                                            colour = NA))

opts_chunk$set(echo = T, warning = F, message = F, tidy = F,
               fig.width = 8.5, fig.height = 6,
               dev.args = list(bg = "transparent"))

Information Loss

Binarizing a continuous variable, even if noisy, will result in some amount of information loss. As Fenske points out, we go from estimating the amount of fertilizer used to an indicator for any fertilizer use. If few farmers use fertilizer to begin with, then information loss will be low. That is, we can still tell the difference between farmers who fertilize, and those who don’t. However, if most farmers use some amount of fertilizer, then the binarization will make it very difficult to test for any treatment effect.

# Simulate information loss
    y = rpois(1000, 0.5),
    y_bin = ifelse(y > 0, 1, 0),
    lab = "Low Information Loss"
    y = rpois(1000, 2),
    y_bin = ifelse(y > 0, 1, 0),
    lab = "High Information Loss"
) %>% 
  gather(variable, value, -lab) %>% 
  ggplot(aes(value, fill = variable)) +
  geom_bar(position = "dodge") +
  facet_wrap(vars(lab)) +


I will be using simulations to explore this issue. Data will be drawn from a Poisson distribution, which has mean equal to its only parameter, lambda. This is quite convenient for simulating treatment effects, as we can easily increase the treatment effect by shifting lambda.

sim_settings <- expand_grid(
  # Number of repeats for each group of simulation settings
  m = 1:5,
  # Sample size
  n = seq(100, 3000, 100),
  # Lambda (mean of Poisson)
  lambda = c(0.5, 1, 1.5),
  # Treatment effect in absolute terms
  # (i.e. added to lambda)
  treat_effect = seq(0.01, 0.2, 0.05)
sims <- pmap_df(sim_settings, function(m, n, lambda, treat_effect) {
  # Generate data
  treat <- rep(0:1, each = n/2)
  y <- c(rpois(n/2, lambda), rpois(n/2, lambda + treat_effect))
  y_bin <- ifelse(y > 0, 1, 0)
  # Run models
    lm(y ~ treat) %>% tidy(),
    lm(y_bin ~ treat) %>% tidy()
  ) %>% 
    filter(term == "treat") %>% 
    mutate(m = m, n = n, lambda = lambda, treat_effect = treat_effect,
           outcome = c("continuous", "binary"))

Binarization will cause issues estimating a precise treatment effect, with higher information loss (i.e. higher lambda) leading to more bias.

sims %>% 
  group_by(n, lambda, treat_effect, outcome) %>% 
  summarise(bias = mean(estimate - treat_effect)) %>% 
  ungroup() %>% 
  mutate(lambda = paste0("Lambda = ", lambda),
         treat_effect = paste0("Treat = ", treat_effect)) %>% 
  ggplot(aes(n, bias, colour = outcome)) +
  geom_hline(yintercept = 0.0) +
  geom_point() +
  geom_line() +
  scale_colour_fairyfloss() +
  facet_grid(treat_effect ~ lambda) +
  theme(legend.position = "bottom") +
  labs(title = "Sample Size Vs. Bias",
       subtitle = "across Treatment Effect Size and Lambda")

While effect size estimates are biased, they still seem to work well enough for significance testing in situations where information loss isn’t extreme.

sims %>% 
  group_by(n, lambda, treat_effect, outcome) %>% 
  summarise(p.value = mean(p.value)) %>% 
  ungroup() %>% 
  mutate(lambda = paste0("Lambda = ", lambda),
         treat_effect = paste0("Treat = ", treat_effect)) %>% 
  ggplot(aes(n, p.value, colour = outcome)) +
  geom_hline(yintercept = 0.1) +
  geom_point() +
  geom_line() +
  scale_colour_fairyfloss() +
  facet_grid(treat_effect ~ lambda) +
  theme(legend.position = "bottom") +
  labs(title = "Sample Size Vs. Bias",
       subtitle = "across Treatment Effect Size and Lambda")


I suspect Fenske is correct to question the use of binary investment indicators over continuous intensivity indicators. However, it is clear to me that binary indicators are fine for many situations, e.g. measuring the up-take of new agricultural technologies. For situations where the outcome is already common at a low level (and the sample size is high enough), a coarse grained intensivity measure may provide enough information to capture a reasonable estimate of the treatment effect.

Session Info

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] knitr_1.33        ggcute_0.0.0.9000 ggplot2_3.3.5    
[4] broom_0.7.8       tidyr_1.1.3       purrr_0.3.4      
[7] dplyr_1.0.7      

loaded via a namespace (and not attached):
 [1] highr_0.9         pillar_1.6.1      bslib_0.2.5.1    
 [4] compiler_4.1.0    jquerylib_0.1.4   tools_4.1.0      
 [7] digest_0.6.27     downlit_0.2.1     gtable_0.3.0     
[10] jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0  
[13] tibble_3.1.2      pkgconfig_2.0.3   rlang_0.4.11     
[16] DBI_1.1.1         distill_1.2       yaml_2.2.1       
[19] xfun_0.24         withr_2.4.2       stringr_1.4.0    
[22] generics_0.1.0    vctrs_0.3.8       sass_0.4.0       
[25] grid_4.1.0        tidyselect_1.1.1  glue_1.4.2       
[28] R6_2.5.0          fansi_0.5.0       rmarkdown_2.9    
[31] farver_2.1.0      magrittr_2.0.1    scales_1.1.1     
[34] backports_1.2.1   ellipsis_0.3.2    htmltools_0.5.1.1
[37] assertthat_0.2.1  colorspace_2.0-2  labeling_0.4.2   
[40] utf8_1.2.1        stringi_1.6.2     munsell_0.5.0    
[43] crayon_1.4.1     


For attribution, please cite this work as

Ewing (2020, Feb. 14). Improper Prior | Ben Ewing: Continuous and Binary Dependent Variables. Retrieved from

BibTeX citation

  author = {Ewing, Ben},
  title = {Improper Prior | Ben Ewing: Continuous and Binary Dependent Variables},
  url = {},
  year = {2020}