TM

T.J. Molendijk

info

Please Note

2 records found

Conditional quantiles and variable selection under discretized covariates

This thesis takes a step towards developing a vine-based regression method tailored to mixed-type (continuous & discrete) data. We visualize the difference between continuous conditional quantile functions of bivariate copulas and their discretized (discretized by binning the continuous variable into bins of equal probability) conditional quantile functions. This showed how discretization into a small number of bins loses dependence, especially in the tails. We show a different binning procedure can improve the preservation of dependence after discretization, motivating the development of methodologies which compute an ‘optimal’ (optimal in the sense of balancing tail dependence with dependence around the median) binning procedure given some data. For application in computing an ‘optimal’ binning procedure, we make an attempt at analytically quantifying the difference between discretized conditional quantiles and continuous conditional quantiles of bivariate copulas, but the analytical derivation is not possible for most settings (with exceptions for bivariate Clayton and bivariate Frank copulas, and the smallest conditioning value of discretized covariate).

From the aforementioned conclusion that discretization loses dependence, we infer a variable
selection measure tailored to mixed-type data should be biased against discretized covariates,
and that this bias should be monotone decreasing with the number of bins of the discretized
covariate. The bias for/against discretized covariates of various variable selection measures is
investigated, in a scenario with a single covariate and a scenario with two covariates. With a single covariate, Pearson’s/polyserial correlation and Kendall’s tau/tau-b were found unsuitable as variable selection measures for mixed-type data, due to a lack of bias against discretized covariates. Conditional log-likelihood and check-loss at the quantile level 0.05 seem nearly identically biased in the bivariate setting, although this should be said with the caveat that all simulation scenarios are homoskedastic. Finally, we show in a three-dimensional setting that correlation between covariates does not seem to affect the predictive performance when both covariates are continuous, but correlation between covariates has significant negative effects on the predictive performance when one of the covariates is discretized. This difference between the effect of discretization in two dimensions and three dimensions should be kept in mind when developing variable selection procedures for mixed-type data. ...
Bachelor thesis (2023) - T.J. Molendijk, J. Komjáthy, A. Bishnoi, M.E.L. Jones
The k-truncated metric dimension of a graph is the minimum number of sensors (a subset of the vertex set) needed to uniquely identify every vertex in the graph based on its distance to the sensors, where the sensors have a measuring range of k. We give an algorithm with the goal that given any tree and any value for the measuring range of the sensors k, the algorithm finds the k-truncated metric dimension of that tree. The algorithm presented in this thesis is a modification of the algorithm given by Gutkovich and Song Yeoh [6]. The algorithm in this thesis improves on their algorithm in both validity and time complexity. We show that given any tree and any value k, the algorithm returns a k-resolving set for that tree. Moreover, we conjecture the difference in the k-truncated metric dimension of any tree and the size of the k-resolving set found by the algorithm for that tree is never greater than one. The time complexity of the algorithm is proven to be O(k3n), where k is the measuring range of the sensors and n is the number of vertices in the tree. This implies that the time complexity is linear in n for fixed k. ...