This thesis takes a step towards developing a vine-based regression method tailored to mixed-type (continuous & discrete) data. We visualize the difference between continuous conditional quantile functions of bivariate copulas and their discretized (discretized by binning the
...
This thesis takes a step towards developing a vine-based regression method tailored to mixed-type (continuous & discrete) data. We visualize the difference between continuous conditional quantile functions of bivariate copulas and their discretized (discretized by binning the continuous variable into bins of equal probability) conditional quantile functions. This showed how discretization into a small number of bins loses dependence, especially in the tails. We show a different binning procedure can improve the preservation of dependence after discretization, motivating the development of methodologies which compute an ‘optimal’ (optimal in the sense of balancing tail dependence with dependence around the median) binning procedure given some data. For application in computing an ‘optimal’ binning procedure, we make an attempt at analytically quantifying the difference between discretized conditional quantiles and continuous conditional quantiles of bivariate copulas, but the analytical derivation is not possible for most settings (with exceptions for bivariate Clayton and bivariate Frank copulas, and the smallest conditioning value of discretized covariate).
From the aforementioned conclusion that discretization loses dependence, we infer a variable
selection measure tailored to mixed-type data should be biased against discretized covariates,
and that this bias should be monotone decreasing with the number of bins of the discretized
covariate. The bias for/against discretized covariates of various variable selection measures is
investigated, in a scenario with a single covariate and a scenario with two covariates. With a single covariate, Pearson’s/polyserial correlation and Kendall’s tau/tau-b were found unsuitable as variable selection measures for mixed-type data, due to a lack of bias against discretized covariates. Conditional log-likelihood and check-loss at the quantile level 0.05 seem nearly identically biased in the bivariate setting, although this should be said with the caveat that all simulation scenarios are homoskedastic. Finally, we show in a three-dimensional setting that correlation between covariates does not seem to affect the predictive performance when both covariates are continuous, but correlation between covariates has significant negative effects on the predictive performance when one of the covariates is discretized. This difference between the effect of discretization in two dimensions and three dimensions should be kept in mind when developing variable selection procedures for mixed-type data.