Here are two related questions but they are not duplicates of mine as the first one has a solution specific to the data set and the second one involves a failure of glm
when start
is supplied alongside an offset
.
I have the following dataset:
library(data.table) df <- data.frame(names = factor(1:10)) set.seed(0) df$probs <- c(0, 0, runif(8, 0, 1)) df$response = lapply(df$probs, function(i){ rbinom(50, 1, i) }) dt <- data.table(df) dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]
such that dt
is:
> dt names probs response 1: 1 0.0000000 0 2: 1 0.0000000 0 3: 1 0.0000000 0 4: 1 0.0000000 0 5: 1 0.0000000 0 --- 496: 10 0.9446753 0 497: 10 0.9446753 1 498: 10 0.9446753 1 499: 10 0.9446753 1 500: 10 0.9446753 1
I am trying to fit a logistic regression model with the identity link, using lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'))
.
This gives an error:
Error: no valid set of coefficients has been found: please supply starting values
I tried fixing it by supplying a start
argument, but then I get another error.
> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0, 1)) Error: cannot find valid starting values: please specify some
At this point these errors make no sense to me and I have no idea what to do.
EDIT: @iraserd has thrown some more light on this problem. Using start = c(0.5, 0.5)
, I get:
> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0.5, 0.5)) There were 25 warnings (use warnings() to see them) > warnings() Warning messages: 1: step size truncated: out of bounds 2: step size truncated: out of bounds 3: step size truncated: out of bounds 4: step size truncated: out of bounds 5: step size truncated: out of bounds 6: step size truncated: out of bounds 7: step size truncated: out of bounds 8: step size truncated: out of bounds 9: step size truncated: out of bounds 10: step size truncated: out of bounds 11: step size truncated: out of bounds 12: step size truncated: out of bounds 13: step size truncated: out of bounds 14: step size truncated: out of bounds 15: step size truncated: out of bounds 16: step size truncated: out of bounds 17: step size truncated: out of bounds 18: step size truncated: out of bounds 19: step size truncated: out of bounds 20: step size truncated: out of bounds 21: step size truncated: out of bounds 22: step size truncated: out of bounds 23: step size truncated: out of bounds 24: step size truncated: out of bounds 25: glm.fit: algorithm stopped at boundary value
and
> summary(lm2) Call: glm(formula = response ~ probs, family = binomial(link = "identity"), data = dt, start = c(0.5, 0.5)) Deviance Residuals: Min 1Q Median 3Q Max -2.4023 -0.6710 0.3389 0.4641 1.7897 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.486e-08 1.752e-06 0.008 0.993 probs 9.995e-01 2.068e-03 483.372 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 69312 on 49999 degrees of freedom Residual deviance: 35984 on 49998 degrees of freedom AIC: 35988 Number of Fisher Scoring iterations: 24
I highly suspect this has something to do with the fact that some of the responses are generated with true probability zero which causes problems as the coefficient of probs
approaches 1.