S-estimation in Linear Models with Structured Covariance Matrices

We provide a unified approach to S-estimation in balanced linear models with structured covariance matrices. Of main interest are S-estimators for linear mixed effects models, but our approach also includes S-estimators in several other standard multivariate models, such as multiple regression, multivariate regression, and multivariate location and scatter. We provide sufficient conditions for the existence of S-functionals and S-estimators, establish asymptotic properties such as consistency and asymptotic normality, and derive their robustness properties in terms of breakdown point and influence function. All the results are obtained for general identifiable covariance structures and are established under mild conditions on the distribution of the observations, which goes far beyond models with elliptically contoured densities. Some of our results are new and others are more general than existing ones in the literature. In this way this manuscript completes and improves results on S-estimation in a wide variety of multivariate models. We illustrate our results by means of a simulation study and an application to data from a trial on the treatment of lead-exposed children.


Introduction
Linear models are widely used and provide a versatile approach for analyzing correlated responses, such as longitudinal data, growth data or repeated measurements. In such models, each subject i, i = 1, . . . , n, is observed at k i occasions, and the vector of responses y i is assumed to arise from the model where X i is the design matrix for the ith subject and u i is a vector whose covariance matrix can be used to model the correlation between the responses. One possibility is the linear mixed effects model, in which the random effects together with the measurement error yields a specific covariance structure depending on a vector θ consisting of some unknown covariance parameters. Other covariance structures may arise, for example if the u i are the outcome of a time series, see e.g., [14] or [10], for different possible covariance structures. Maximum likelihood estimation of β and θ has been studied, e.g., in [12,25,15], see also [10,7]. To be resistant against outliers, robust methods have been investigated for linear mixed effects models, e.g., in [22,5,4,13,1,3]. This mostly concerns S-estimators, originally introduced in the multiple regression context by Rousseeuw and Yohai [27] and extended to multivariate location and scatter in [6,17], to multivariate regression in [29], and to linear mixed effects models in [5,13,3].
S-estimators are well known smooth versions of the minimum volume ellipsoid estimator [26] that are highly resistant against outliers. As such, S-estimators have gained popularity as robust estimators, but they may also serve as initial estimators to further improve the efficiency. However, the theory about these estimators is far from complete, even in balanced models where the number of observed responses is the same for all subjects.
In view of this, we provide a unified approach to S-estimation in balanced linear models with structured covariance matrices, and postpone a unified approach for unbalanced models to a future paper. The balanced setup is already quite flexible and includes several specific multivariate statistical models. Of main interest are S-estimators for linear mixed effects models, but our approach also includes S-estimators in several other standard multivariate models, such as multiple regression, multivariate regression, and multivariate location and scatter. We provide sufficient conditions for the existence of S-functionals and S-estimators, establish their asymptotic properties, such as consistency and asymptotic normality, and derive their robustness properties in terms of breakdown point and influence function. All results are obtained for a large class of identifiable covariance structures, and are established under very mild conditions on the distribution of the observations, which goes far beyond models with elliptically contoured densities. In this way, some of our results are new and others are more general than existing ones in the literature.
Existence of S-estimators and S-functionals is established under mild conditions. Although existence of the estimators seems a basic requirement, such results are missing for instance for multivariate regression in [30] and for linear mixed effects models in [5,3]. We obtain robustness properties for S-estimators, such as breakdown point and influence function, under mild conditions on collections of observations and under mild conditions on the distribution of the observations. High breakdown and a bounded influence function seem basic requirements for a robust method, but both properties are not available for linear mixed effects models [5,3]. For multivariate regression [30], the influence function is only determined at distributions with an elliptical contoured density. Finally, we establish consistency and asymptotic normality for S-estimators under mild conditions on the distribution of the observations. A rigorous derivation is missing for multivariate regression [30], or is only available for observations from a normal distribution [27,3].
We apply our asymptotic results, such as influence function and asymptotic normality, to the special case for which the distribution of the observations corresponds to an elliptically contoured density. In this way we retrieve earlier results found in [27,17,30]. Somewhat surprisingly, the asymptotic variances of our S-estimators for linear mixed effects models in which the response has an elliptically contoured density, differ from the ones found in [5]. We investigate this difference by means of a simulation study.
The paper is organized as follows. In Section 2, we explain the model in detail and provide some examples of standard multivariate models that are included in our setup. In Section 3 we define the S-estimator and S-functional and in Section 4 we give conditions under which they exist. In Section 5 we establish continuity of the S-functional, which is then used to obtain consistency of the S-estimator. Section 6 deals with the breakdown point. Section 7 provides the preparation for Sections 8 and 9 in which we obtain the influence function and establish asymptotic normality. Finally, in Section 10, we illustrate our results by means of a simulation and investigate the performance of our estimators by means of an application to data from a trial on the treatment of lead-exposed children. All proofs and some technical lemmas are put in an Appendix at the end of the paper. Other long and technical proofs are available as supplemental material [16].

Balanced models with structured covariances
We consider independent observations (y 1 , X 1 ), . . . , (y n , X n ), for which we assume the following model y i = X i β + u i , i = 1, . . . , n, where y i ∈ R k contains repeated measurements for the i-th subject, β ∈ R q is an unknown parameter vector, X i ∈ R k×q is a known design matrix, and u i ∈ R k are unobservable independent mean zero random vectors with covariance matrix V ∈ PDS(k), the class of positive definite symmetric k × k matrices. The model is balanced in the sense that all y i have the same dimension. Furthermore, we consider a structured covariance matrix, that is, the matrix V = V(θ) is a known function of unknown covariance parameters combined in a vector θ ∈ R l . We first discuss some examples that are covered by this setup. This model arises from u i = Zγ i + i , for i = 1, . . . , n, where Z ∈ R k×g is known and γ i ∈ R g and i ∈ R k are independent mean zero random variables, with unknown covariance matrices G and R, respectively. In this case V(θ) = ZGZ T + R and θ = (vech(G) T  is the unique k(k + 1)/2-vector that stacks the columns of the lower triangle elements of a symmetric matrix A. In full generality, the model is usually overparametrized and one may run into identifiability problems. A more feasible example is obtained by taking R = σ 2 0 I k , Z = [Z 1 · · · Z r ] and γ i = (γ i1 . . . , γ ir ) T , where the Z j 's are known k × g j design matrices and the γ ij ∈ R gj are independent mean zero random variables with covariance matrix σ 2 j I gj , for j = 1, . . . , r. This leads to

4)
with V(θ) = r j=1 σ 2 j Z j Z T j + σ 2 0 I k and θ = (σ 2 0 , σ 2 1 , . . . , σ 2 r ). Example 2. An example with an unstructured covariance is the multivariate linear regression model y i = B T x i + u i , i = 1, . . . , n, (2.5) where B ∈ R q×k is a matrix of unknown parameters, x i ∈ R q is known, and u i , for i = 1, . . . , n, are independent mean zero random variables with covariance matrix V(θ) = C ∈ PDS(k). In this case, the vector of unknown covariance parameters is given by The model can be obtained as a special case of (2.1), by taking X i = x T i ⊗ I k and β = vec(B T ), where ⊗ denotes the Kronecker product and vec(·) is the k 2 -vector that stacks the columns of a matrix. Clearly, the linear multiple regression model is a special case with k = 1.
Example 4. Also the multivariate location-scale model can be obtained as a special case of (2.1), by taking X i = I k , the k × k identity matrix. In this case, β ∈ R k is the unknown location parameter and V(θ) is the unstructured covariance matrix as in Example 2, with θ as in (2.6).
(2.9) This is true for all models in Examples 2, 3 and 4. This may not be true in general for the linear mixed effects model in Example 1 with unknown vech(G) and vech(R). For linear mixed effects models in (2.4), identifiability of θ = (σ 2 0 , σ 2 1 , . . . , σ 2 r ) holds for particular choices of the design matrices Z 1 , . . . , Z r .

Definitions
We start by representing our observations as points in R k × R kq in the following way. For r = 1, . . . , k, let x T r denote the r-th row of the k × q matrix X, so that x r ∈ R q . We represent the pair s = (y, X) as an element in R k × R kq defined by s T = (y T , x T 1 , . . . , x T k ). In this way our observations can be represented as s 1 , . . . , s n , with s i = (y i , X i ) ∈ R k × R kq .
The S-estimator ξ n = (β n , θ n ) is defined as the solution to the following minimization problem min β,θ det(V(θ)) subject to where the minimum is taken over all β ∈ R q and θ ∈ R l , such that V(θ) ∈ PDS(k), with ρ satisfying (R1)-(R2). The S-estimator defined by (3.1) for the setup in (2.1) includes several specific cases that have been considered in the literature. The original regression S-estimator introduced by Rousseeuw and Yohai [27] is obtained as a special case by taking X i = x T i a 1 × q vector and V(θ) = σ 2 > 0. S-estimators for multivariate location and scale, as considered in Davies [6] and Lopuhaä [17] can be obtained by taking X i and V(θ) as in Example 4. For the multivariate regression model in Example 2, S-estimators have been considered by Van Aelst and Willems [30]. Copt and Victoria-Feser [5] and Chervoneva and Vishnyakov [3] consider S-estimators for the parameters in the linear mixed effects model (2.4).
The constant 0 < b 0 < a 0 in (3.1) can be chosen in agreement with an assumed underlying distribution. For the multivariate regression model in [30], it is assumed that y i | X i has an elliptically contoured density of the form with µ = X i β and Σ = V(θ) and h : [0, ∞) → [0, ∞). For the linear mixed effects model in [5], it is assumed that y i | X i has a multivariate normal distribution, which is a special case of (3.2) with h(t) = (2π) −k/2 exp(−t/2). When the underlying distribution corresponds to a density of the form (3.2), then a natural choice is b 0 = E 0,I ρ( z ), where z has density (3.2) with (µ, Σ) = (0, I k ). Finally, it should be emphasized that the ratio b 0 /a 0 determines the breakdown point of the Sestimator (see Theorem 4), as well as its limiting variance (see Corollary 6). By choosing the constant c 0 in (R2) one then has to make a trade-off between robustness and efficiency. Note that at this point we do not assume smoothness of ρ or strict monotonicity on [0, c 0 ]. This means that (R1)-(R2) allow the function ρ(d) = 1 − 1 [−c0,c0] (d), which corresponds to the minimum volume ellipsoid estimator in location-scale models (see [26]) and to the least median of squares estimator in linear regression models (see [28]). Indeed, with ρ(d) = 1 − 1 [−c0,c0] (d), the S-estimator (β n , θ n ) corresponds to the smallest cylinder that contains at least n − nb 0 points. Remark 3.1. Clearly, the definition of the S-estimator in (3.1) has great similarities with the Sestimator for multivariate location and covariance (see [6] and [17]), defined as the solution (t n , C n ) to the minimization problem where the minimum is taken over all t ∈ R k and C ∈ PDS(k). Even more so, if all X i are assumed to be equal to the same design matrix X of full rank, as was done in [5,4]. However, there is a subtle, but important difference between minimization problems (3.4) and (3.1). The important difference is that in (3.4) we minimize over all positive definite symmetric k × k matrices C, whereas in (3.1), we only minimize over positive definite symmetric k × k matrices V(θ), which can arise as the image of the mapping θ → V(θ). The latter collection is a subset of the other: and will typically be a strictly smaller subset. This means that the properties of V(θ n ) and C n are related, but the properties of V(θ n ) cannot simply be derived from properties of C n , not even in the case where all X i are equal to the same X. In fact, this will lead to limiting covariances that differ from the ones found in [5], see Corollary 6.

S-functional
The concept of S-functional is needed to investigate local robustness properties of the corresponding S-estimator, such as the influence function (see Section 8). Let s = (y, X) have a probability distribution P on R k × R kq . The S-functional ξ(P ) = (β(P ), θ(P )) is defined as the solution to the following minimization problem: where the minimum is taken over all β ∈ R q and θ ∈ R l , such that V(θ) ∈ PDS(k), with ρ satisfying (R1)-(R2). As a special case, we obtain the S-estimator ξ n = (β n , θ n ) by taking P = P n , the empirical measure corresponding to the observations (y 1 , X 1 ), . . . , (y n , X n ). In view of this connection, existence and consistency of solutions to (3.1) will follow from general results on the existence and the continuity of solutions to (3.5).
The definition of the S-functionals for the multivariate location-scale model given in Lopuhaä [17] and for the multivariate regression model given by Van Aelst and Willems [30] can be obtained as special cases of (3.5), by choosing X, β and V(θ) as in Examples 4 and 2, respectively. Copt and Victoria-Feser [5] do not pay attention to S-functionals or the influence function in the linear mixed effects model (2.4). However, S-functionals for linear mixed effects models can be also be obtained as a special case of (3.5), by choosing X, β and V(θ) as in Example 1.

Existence
We will first establish existence of the S-functional ξ(P ) defined by (3.5), under particular conditions on the probability measure P . As a consequence, this will also yield the existence of the S-estimator, defined by (3.1). Recall that (y 1 , X 1 ), . . . , (y n , X n ) are represented as points in R k × R kq . Note however, that for linear models with intercept the first column of each X i consists of 1's. This means that the points (y i , X i ) are concentrated in a lower dimensional subset of R k × R kq . A similar situation occurs when all X i are equal to the same design matrix, such as in [5]. In view of this, define X ⊂ R kq as the subset with the lowest dimension p = dim(X ) ≤ kq satisfying P (X ∈ X ) = 1.
Hence, P is then concentrated on the subset R k × X of R k × R kq , which is of dimension k + p, which may be of smaller than k + kq.
The first condition we require, expresses the fact that P cannot have too much mass at infinity, in relation to the ratio r = b 0 /a 0 .
(C1 ) There exists a compact set K ⊂ R k × X , such that P (K ) ≥ r + .
According to (4.1), in (C2 ) one only needs to consider strips in R k × X . Both conditions are satisfied for any 0 < ≤ 1 − r by any probability measure P that is absolutely continuous. Clearly, condition (C1 ) holds for any 0 ≤ ≤ 1 − r for the empirical measure P n corresponding to a collection of n points S n = {s 1 , . . . , s n } ⊂ R k ×X . Condition (C2 ) for = (k + p + 1)/n is also satisfied by the empirical measure P n , when the collection S n is in general position, i.e., no subset J ⊂ S n of k + p + 1 points is contained in the same hyperplane in R k × X . Conditions (C1 ) and (C2 ) together, are similar to condition (C ) in [17]. The reason that (C1 ) slightly deviates from [17], is to handle the presence of X in minimization problem (3.5).
where the infima are taken over all subsets J ⊂ R k × X with P (J) ≥ , all vectors α ∈ R k+kq , with α = 1, and levels ∈ R. Details can be found in [16].
To establish existence of the S-functional, we follow the reasoning in [17]. The idea is to argue that one can restrict oneself to a compact set for finding solutions to (3.5). When the object function in (3.5) is continuous, this immediately yields existence of a solution of (3.5). To this end, we assume the following condition.
The lemma below is fundamental for the existence of the S-functional. It requires that the identity is in V = {V(θ) ∈ PDS(k) : θ ∈ R l } and that V is closed under multiplication with a positive scalar.
Conditions (V1)-(V2) are not very restrictive. For example, all models in Examples 1 to 4 satisfy these conditions. For any k × k matrix A, let λ k (A) ≤ · · · ≤ λ 1 (A) denote the eigenvalues of A. We then have the following key lemma for the existence of S-functionals. The lemma is similar to Lemma 1 in [17] and its proof can be found in [16].
(iii) Let P satisfy (C2 ) and suppose that P (C(β, θ, c)) ≥ a > 0. Suppose there exists a compact where M only depends on c, a 2 , the set K, and a constant γ > 0 that can be deduced from condition (C2 ).
Lemma 1 will ensure that there exists a compact set that contains all pairs (β, V(θ)) that correspond to possible solutions (β, θ) of (3.5). To establish that possible solutions (β, θ) of (3.5) are in a compact set, we need that the pre-image {θ ∈ R l : V(θ) ∈ K} of a compact set K ⊂ R k×k is again compact. Recall that subsets of R l are compact if and only if they are closed and bounded, and note that the pre-image of a continuous mapping of a closed set is closed. Hence, in view of condition (V1), it suffices to require the following condition.
(V3) The mapping θ → V(θ) is such that the pre-image of a bounded set is bounded.
Condition (V3) is satisfied by all models in Examples 1 to 4, including the linear mixed effects model of Example 1, as long as the matrix Z is of full rank. We then have the following theorem. Theorem 1. Consider minimization problem (3.5) with ρ satisfying (R1)-(R2). Suppose that P satisfies (C1 ) and (C2 ), for some 0 < ≤ 1 − r, where r = b 0 /a 0 , and suppose that V satisfies (V1)-(V3). Then there exists at least one solution to (3.5).
The theorem has a direct corollary for the existence of the S-estimator, when dealing with a collections of points. Let S n = {s 1 , . . . , s n }, with s i = (y i , X i ) be a collection of n points in R k ×X . Define κ(S n ) = maximal number of points of S n lying on the same hyperplane in R k × X . (4.4) For example, if the distribution P is absolutely continuous, then κ(S n ) ≤ k + p with probability one. We then have the following corollary. Copt and Victoria-Feser [5] consider S-estimators for the linear mixed effects model (2.4). Despite their Proposition 1 about the asymptotic behavior of solutions to their S-minimization problem [5, equation (7)], the actual existence of such a solution is not established. However, this now follows from our Corollary 1. In their case, V(θ) satisfies conditions (V1) and (V2). It can be seen, that if all matrices Z j , for j = 1, . . . , r, are of full rank, then V(θ) also satisfies (V3). The translated bi-weight ρ-function proposed in [5] satisfies (R1)-(R2). Finally, under their assumption that X i = X is the same and y i | X ∼ N k (Xβ, V(θ)), it follows that κ(S n ) ≤ k. It then follows from Corollary 1 that with b 0 ≤ a 0 (n − k − 1)/n, at least one solution to their S-minimization problem exists.
For the multivariate regression model from Example 2, Van Aelst & Willems [30] do not explicitly prove existence of the S-estimator. Since in their case, V(θ) = C ∈ PDS(k) satisfies (V1)-(V3) and the conditions imposed in [30] on the ρ-function satisfy (R1)-(R2), the existence of their S-estimator now also follows from Corollary 1, when b 0 is chosen suitably.
Existence of S-estimators is obtained from existence of S-functionals at the empirical measure P n . The following corollary shows that existence can be established in general, for probability measures that are close to P . It requires the following condition on P .
(C3) Let C be the class of all measurable convex subsets of R k ×R kq . Every C ∈ C is a P -continuity set, i.e., P (∂C) = 0, where ∂C denotes the boundary of C.
Condition (C3) is needed to apply (A.2). Clearly, this condition is satisfied if P is absolutely continuous.

Continuity and consistency
Consider a sequence P t , t ≥ 0, of probability measures on R k × R kq that converges weakly to P , as t → ∞. By continuity of the S-functional ξ(P ) we mean that ξ(P t ) → ξ(P ), as t → ∞. An example of such a sequence is the sequence of empirical measures P n , n = 1, 2, . . ., that converges weakly to P , almost surely. Continuity of the S-functional for this sequence would then mean that the S-estimator ξ n is consistent, i.e., ξ n = ξ(P n ) → ξ(P ), almost surely. We require an additional condition for the function ρ.
Theorem 2 is an extension of Theorem 3.1 in [17] on the continuity of S-functionals for multivariate location and scale. Continuity of S-functionals for multiple regression has been investigated in [9].
Continuity of the S-functional will be used to derive the influence function of the S-estimator in Section 8. Another nice consequence of the continuity of the S-functional is, that one can directly obtain consistency of the S-estimator. Consider the S-estimator ξ n defined by minimization problem (3.1). Recall that ξ n = ξ(P n ), so that we can use Theorem 2 to establish consistency of the S-estimator. Corollary 3. Let ξ n be a solution to minimization problem (3.1). Suppose ρ satisfies (R1)-(R3) and V satisfies (V1)-(V3). Suppose that P satisfies (C3) as well as (C1 ) and (C2 ), for some 0 < < ≤ 1 − r = b 0 /a 0 . If the solution ξ(P ) of (3.5) is unique, then lim n→∞ ξ n = ξ(P ), with probability one.
Theorem 2 and Corollary 3 require that ξ(P ) is the unique solution to minimization problem (3.5). An example of a distribution P for which ξ(P ) is unique, is when P is such that y | X has an elliptically contoured density (3.2). This situation is very similar to that of multivariate location-scale S-estimators, for which Davies [6,Theorem 1] shows that the corresponding S-minimization problem (3.5) has a unique solution. The next theorem is a direct consequence of that result. Its proof can be found in [16].
Theorem 3. Suppose that ρ : R → [0, ∞) satisfies (R1)-(R2) and suppose that the probability distribution P of (y, X) is such that y | X has an elliptically contoured density f µ,Σ from (3.2), with µ = Xβ 0 and Σ = V(θ 0 ). Suppose that h in (3.2) is non-increasing and such that the functions −ρ and h have at least one common point of decrease d 0 > 0, i.e., for all s, t ≥ 0, such that s < d 0 < t. If X T X is non-singular with probability one, then the minimization problem where the minimum is taken over all β ∈ R q and θ ∈ R l , such that V(θ) ∈ PDS(k), has the unique solution (β, θ) = (β 0 , θ 0 ) with probability one.
Minimization problem (5.2) seems to be slightly different from the one in (3.5). However, note that when P is such that X is equal to a single value with probability one, both minimization problems are identical. This situation was considered, e.g., in [5].
An elliptically contoured density for y i | X i in the context of S-estimators for specific cases of the model (2.1) has been assumed in [6] for the multivariate location-scale model of Example 4, in [30] for the multivariate regression model of Example 2, and in [5] for the linear mixed effects model (2.4). More precisely, in [5] it is assumed that X i = X and that y i | X has a multivariate normal distribution. In that case, the function h in (3.2) satisfies all the conditions of Theorem 3.

Global robustness: the breakdown point
Consider a collection of points S n = {s i = (y i , X i ), i = 1, . . . , n} ⊂ R k × X . To emphasize the dependence on the collection S n , denote by ξ n (S n ) = (β n (S n ), θ n (S n )), the S-estimator, as defined in (3.1). To investigate the global robustness of S-estimators, we compute that finite-sample (replacement) breakdown point. For a given collection S n the finite-sample breakdown point (see Donoho and Huber [8]) of a regression S-estimator β n is defined as the smallest proportion of points from S n that one needs to replace in order to carry the estimator over all bounds. More precisely, * where the minimum runs over all possible collections S m that can be obtained from S n by replacing m points of S n by arbitrary points in R k × X . The estimator θ n determines the covariance estimator V n = V(θ n ). For this reason it seems natural to let the breakdown point of θ n correspond to the breakdown of a covariance estimator. We define the finite sample (replacement) breakdown point of the S-estimator θ n at a collection S n , as * n (θ n , S n ) = min where the minimum runs over all possible collections S m that can be obtained from S n by replacing m points of S n by arbitrary points in R k × X . So the breakdown point of θ n is the smallest proportion of points from S n that one needs to replace in order to make the largest eigenvalue of V(θ(S m )) arbitrarily large (explosion), or to make the smallest eigenvalue of V(θ(S m )) arbitrarily small (implosion). Good global robustness is illustrated by a high breakdown point. The breakdown point of the S-estimators is given the theorem below. It extends the results for S-estimators of multivariate location and scale, see [6] and [19], and S-estimators for multivariate regression, see [30]. For S-estimators in the linear mixed effects model considered in [5], the breakdown point has not been established. This will now follow as a special case from the next theorem. Its proof can be found in [16].
Theorem 4. Consider the minimization problem (3.1) with ρ satisfying (R1)-(R2). Suppose that V satisfies (V1)-(V3). Let S n ⊂ R k × X be a collection of n points s i = (y i , X i ), i = 1, . . . , n. Let r = b 0 /a 0 and suppose that 0 < r ≤ (n − κ(S n ))/(2n), where κ(S n ) is defined by (4.4). Then for any solution (β n , θ n ) of minimization problem (3.5), The largest possible value of the breakdown point occurs when r = (n − κ(S n ))/(2n), in which case nr /n = (n − κ(S n ))/2 /n = (n − κ(S n ) + 1)/2 /n. When the collection S n is in general position, then κ(S n ) = k + p. In that case the breakdown point of both estimators is at least equal to (n − k − p + 1)/2 /n. When all X i are equal to the same X, in [5,4], one has p = 0 and κ(S n ) = k. In that case, the breakdown point of θ n is equal to (n − k + 1)/2 /n. This coincides with the maximal breakdown point for affine equivariant estimators for k × k covariance matrices (see [6,Theorem 6]). [30] also take into account the case r > (n − κ(S n ))/(2n). For this case, by replacing n − nr − κ(S n ) points, a specific solution to the S-minimization problem is constructed that breaks down. However, since there may be multiple solutions to the S-minimization problem, this does not necessarily mean that all solutions break down. In the proof of our Theorem 4, for the case r ≤ (n − κ(S n ))/(2n), we show that all solutions to (3.1) do not break down, when replacing at most nr − 1 points, and that the covariance part of all solutions do break down, when replacing nr points. For the case r > (n − κ(S n ))/(2n), we can show that all solutions to (3.1) do not break down, when replacing at most n − nr − κ(S n ) − 1 points.

Score equations
Up to this point, properties of S-functionals and S-estimators have been derived from the minimization problems (3.1) and (3.5). To obtain the influence function and to establish the limiting distribution of S-estimators, we use the score equations that can be found by differentiation of the Lagrangian corresponding to the constrained minimization problems. To this end, we require the following additional condition on the function ρ, (R4) ρ is continuously differentiable and u(s) = ρ (s)/s is continuous, and the following condition on the mapping θ → V(θ), Obviously, condition (V4) implies the former condition (V1).

General covariance structures
Let ξ P = (β P , θ P ) be a solution to minimization problem (3.5). If we denote the corresponding Lagrange multiplier by λ P , then the pair (ξ P , λ P ) is a zero of all partial derivatives ∂L P /∂β, ∂L P /∂θ, and ∂L P /∂λ, where L P is the Lagrangian given by If E P X < ∞, then under conditions (R4) and (V4), one may interchange the order of integration and differentiation in ∂L P /∂β and ∂L P /∂θ, on a neighborhood of ξ P . It follows that besides the constraint in (3.5), the pair (ξ P , λ P ) satisfies which is solved by .
When we insert this back into the second equation in (7.1), we find where Because l j=1 θ j H j = 0, the system of equations (7.2) is linearly dependent. Similar to [17] we subtract the S-constraint from each equation. For each j = 1, . . . , l, we subtract the term from the left hand side of equation (7.2). We then find that any solution ξ P of (3.5) satisfies the following equation (7.3) and (5.1), respectively, and where we abbreviate V(θ) by V.
The regression score equation for Ψ β with the empirical measure P n for P in (7.4) coincides with the one for the regression S-estimator in the linear mixed effects model (2.4) considered in [5] (see their equation (10)). The empirical regression score equation also coincides with the one for the regression S-estimator in the multivariate regression model of Example 2 considered in [30] (see equation (2.2) in [29]). Similarly, the empirical score equation for Ψ β coincides with the one for the location S-estimator of Example 4 considered in [17].
For general covariance structures the empirical covariance score equation for Ψ θ does not compare directly to existing equations in the literature. However, as we will see in the next subsection, similar comparisons are available for models with a linear covariance structure.

Linear covariance structures
In the previous section, we solved λ from (7.1) and subtracted the S-constraint, leading to score equation (7.4) with Ψ given in (7.5). The fact that this was done in a specific way has the following reason. In cases where V(θ) is linear, say the function Ψ θ simplifies a lot and can also be related to the covariance psi-function in [17]. When V is of the form (7.6), then ∂V/∂θ j = L j and l j=1 θ j (∂V/∂θ j ) = V. In this case, (7.3) simplifies to H j = tr V −1 L j V − kL j , and Ψ θ,j in (7.5) becomes Using that tr(A T B) = vec(A) T vec(B), this can be written as On the right hand side we recognize ku(d)(y − Xβ)(y − Xβ) T − v(d)V, being the covariance psi-function that also appears in (2.8) in [17]. For our purposes we define The functions Ψ θ,j , for j = 1, . . . , l, can be combined in one expression for the vector valued function Ψ θ as follows. First note that Then, the column vector Ψ θ = (Ψ θ,1 , . . . , Ψ θ,l ) can be written as where Ψ V is defined in (7.8) and L in (7.9). Note that the dependence on s = (y, X) in Ψ θ is only through the function Ψ V . We conclude that in the case of a linear covariance structure, any solution ξ P of (3.5) satisfies For the multivariate regression model in Example 2, one has V(θ) = C, where θ = vech(C). The matrix L = ∂vec(V)/∂θ T is then equal to the so-called duplication matrix D k , which is the unique k 2 ×k(k+1)/2 matrix, with the properties D k vech(C) = vec(C) and (D k D k ) −1 D T k vec(C) = vech(C) (e.g., see [20, Ch. 3, Sec. 8]). Because V has full rank, it follows that equation (7.4) holds for Ψ = (Ψ β , Ψ V ). The resulting score equations for the empirical measure P n corresponding to observations (y i , X i ), for i = 1, . . . , n, are then equivalent with the ones found in [30].
For the linear mixed effects model (2.4), the covariance matrix V(θ) has a linear structure with the vector θ = (σ 2 0 , . . . , σ 2 r ) of unknown covariance parameters. The matrix L is then a k 2 × (r + 1) matrix and will typically be of rank r + 1 < k 2 . As a consequence, in this case one cannot further simplify equation (7.4), by removing the factor L T V −1 ⊗ V −1 from the function Ψ θ . The score equation for Ψ β resulting from the empirical measure P n corresponding to observations (y i , X i ), for i = 1, . . . , n, is the same as the one obtained in [5]. The corresponding score equation for Ψ θ differs slightly from the one in [5], because the authors do not subtract a term with ρ(d) − b 0 to remove the linear dependency of the equations (7.2).

Local robustness: the influence function
For 0 < h < 1 and s = (y, X) ∈ R k × R kq fixed, define the perturbed probability measure where δ s denotes the Dirac measure at s ∈ R k × R kq . The influence function of the functional ξ(·) at probability measure P , is defined as if this limit exists. In contrast to the global robustness measured by the breakdown point, the influence function measures the local robustness. It describes the effect of an infinitesimal contamination at a single point s on the functional (see Hampel [11]). Good local robustness is therefore illustrated by a bounded influence function.

The general case
The theorem below gives the influence function for the S-functional ξ. It extends the result for S-functionals of multivariate location and scale [17]. Under the assumption that the limit in (8.1) exists and P has an elliptical contoured density (3.2), Van Aelst and Willems [30] relate the influence function for S-functionals of multivariate regression to that of S-functionals of multivariate location and scale. For the linear mixed effects model considered in [5], the influence function has not been established. The influence function for these functionals now follows as a special case from the theorem below. We will show that the limit in (8.1) exists and derive its expression at general P . Since the value of θ determines the covariance matrix V(θ), we also include the influence function of the covariance functional. Consider the S-functional at P h,s0 . From the Portmanteau theorem [2, Theorem 2.1] it can easily be seen that P h,s0 → P , weakly, as h ↓ 0. Therefore, under the conditions of Corollary 2 and Theorem 2, it follows that there exist solutions ξ(P h,s0 ) and ξ(P ) to minimization problems (3.5) at P h,s0 and P , respectively, and that ξ(P h,s0 ) → ξ(P ), as h ↓ 0.
Theorem 5. Let ξ(P h,s0 ) and ξ(P ) be solutions to minimization problems (3.5) at P h,s0 and P , respectively, and suppose that ξ(P h,s0 ) → ξ(P ), as h ↓ 0. Suppose that ρ satisfies (R4) and V satisfies (V4). Let Ψ be defined in (7.5) and suppose that is continuously differentiable with a non-singular derivative D(P ) at ξ(P ). Then for s 0 ∈ R k ×R kq , For the covariance functional C(P ) = V(θ(P )), it holds that To investigate the local robustness of S-estimators, we derive the following bound on the influence function for ξ(P ).
Its proof can be found in [16].

Elliptically contoured densities
When P is such that y | X has an elliptically contoured density (3.2) and V(θ) is linear, we can obtain a more detailed expression for the influence function, This requires the following condition on the function ρ, (R5) ρ is twice continuously differentiable, and the following condition on the mapping θ → V(θ), Conditions (R5) and (V5) are needed to establish that Λ, as defined in (8.2), is continuously differentiable. Clearly, condition (V5) implies former conditions (V4) and (V1). Suppose that P is such that y | X has an elliptically contoured density f µ,Σ from (3.2), with µ ∈ R k and Σ ∈ PDS(k). When the S-functional is affine equivariant, it suffices to determine the influence function for the case (µ, Σ) = (0, I k ). However, this does not hold in general for the S-functionals in our setting. The reason is that, for a k × k non-singular matrix A and θ ∈ R l , the matrix AV(θ)A T may not be of the form V(θ ), for some θ ∈ R l . Examples are the (linear) covariance structure that corresponds to the linear mixed effects model (2.4) considered in [5] or the models discussed in Example 3.
Nevertheless, note that for the general case with µ ∈ R k and Σ ∈ PDS(k), we can still use the fact that, conditionally on X, the distribution of y is the same as that of Σ 1/2 z + µ, where z has a spherical density f 0,I k . As a consequence, we can still obtain the following result, which enables one to determine the influence functions of the functionals β(P ) and θ(P ) separately.
If P itself is also absolutely continuous, then it satisfies (C3), as well as (C1 ) and (C2 ), for any 0 < < ≤ 1 − r. When ρ and V satisfy (R1)-(R3) and (V1)-(V3), it follows from Theorem 1 and Corollary 2 that ξ(P ) and ξ(P h,s ) exist, for h sufficiently small. If h in (3.2) is non-increasing and not constant on [0, c 2 0 ], then ξ(P ) is unique, according to Theorem 3, so that ξ(P h,s ) → ξ(P ), as h ↓ 0. Hence, in order to apply Theorem 5, it remains to show that Λ in (8.2) is continuously differentiable with a non-singular derivative at ξ(P ). As a first step we obtain that the derivative of Λ is a block matrix.

4)
and where L = ∂vec(V(θ(P )))/∂θ T is the k 2 × l matrix given in (7.9) and The proof is tedious, but straightforward, and can be found in [16].
Remark 8.1. The proof of Lemma 2 uses the fact that for all ξ in a neighborhood of ξ(P ). This holds for general P and any covariance structure V(θ) that satisfies (V2)-(V3) and (V5), see Lemma B.3 in [16]. Furthermore, Lemma 2 is obtained for a linear covariance structure. However, with some additional technicalities, this result can also be shown to hold for Ψ defined in (7.5) corresponding to general covariance structures. For general covariance structures one still obtains (8.3), and that for j, s = 1, . . . , l, and where α 1 = γ 1 /k and The next corollary gives expressions for the influence functions of the functionals β(P ) and θ(P ) separately, at a distribution P that is such that y | X has an elliptically contoured density. The proof is tedious, but straightforward, and can be found in [16].
Suppose that E X 2 < ∞ and suppose that ρ satisfies (R2)-(R5) and that V satisfies (V5), and has a linear structure (7.6). Let α, γ 1 , and γ 2 be defined in (8.4) and (8.5), and suppose that E 0,I k [ρ ( z )] > 0. If X has full rank with probability one, then ) and u(s) = ρ (s)/s. If γ 1 > 0 and the k 2 × l matrix L, as defined in (7.9), has full rank, then IF(s 0 , θ, P ) is given by Note that since Lθ(P ) = vec(V(θ(P ))) = vec(Σ), we can immediately obtain the influence function for the covariance functional C(P ) = V(θ(P )). From Theorem 5 it immediately follows that IF(s 0 , vec(C), P ) is given by Since the functions u(s)s = ρ (s), u(s)s 2 = ρ (s)s, and ρ(s) are bounded, it follows that IF(s, θ, P ) and IF(s, vec(C), P ) are bounded uniformly in both y and X, whereas IF(s, β, P ) is bounded uniformly in y, but not in X. This illustrates the phenomenon in linear regression that leverage points can have a high effect on the regression S-estimator. For the S-estimators in the linear mixed effects model (2.4) with normal errors considered in [5], the influence function is not available. The expression can now be obtained from Corollary 5. The expression for IF(s, β, P ) in Corollary 5 coincides with the one found for the multivariate regression S-functional in [30], where α > 0 is the same constant as the one in the expression of the influence function for the location S-functional in [17]. Furthermore, for the multivariate regression model, one has θ = vech(C) and the matrix L is equal to the duplication matrix D k . From the properties of D k , the expressions for the influence functions simplify. One finds in this case that and the influence function of the covariance functional C(P ) = V(θ(P )) itself is given by This coincides with the expressions found for the covariance S-functionals in [30] and in [17].

Asymptotic normality
To establish asymptotic normality of the S-estimators, we use the score equations obtained from differentiation of the Lagrangian corresponding to the minimization problem (3.1). In the same way as before, we obtain score equation (7.4), with P equal to the empirical measure P n corresponding to observations s 1 , . . . , s n , with , we see that any solution ξ n = ξ(P n ) to the S-minimization problem (3.1) must satisfy

General case
Writing ξ P = ξ(P ), we decompose (9.1) as follows 0 = Ψ(s, ξ n ) dP (s) + Ψ(s, ξ P ) d(P n − P )(s) The essential step in establishing asymptotic normality of ξ n , is to show that the third term on the right hand side of (9.2) is of the order o P (n −1/2 ). To this end we will apply results from empirical process theory as developed in Pollard [23]. This leads to the following theorem.
Theorem 6. Suppose that ρ satisfies (R1)-(R2) and (R4), such that u(s) is of bounded variation, and suppose that V satisfies (V4). Let ξ n and ξ(P ) be solutions to minimization problems (3.1) and (3.5), and suppose that ξ n → ξ(P ) in probability. Suppose that Λ, as defined in (8.2) with Ψ defined in (7.5), is continuously differentiable with a non-singular derivative D(P ) at ξ(P ) and suppose that E X 2 < ∞. Then √ n(ξ n − ξ(P )) is asymptotically normal with mean zero and covariance matrix Theorem 6 is similar to Theorem 4.1 in [17]. Note that Theorem 6 confirms the well know heuristic that relates the limiting covariance of √ n(ξ n − ξ(P )) to the influence function of the functional ξ(·) given in Theorem 5, Van Aelst and Willems [30] consider the limiting behavior of S-estimators in the multivariate regression model of Example 2, but only under P for which y | X has an elliptical contoured density. Copt and Victoria-Feser [5] consider asymptotic normality for S-estimators in the linear mixed effects model (2.4) with a constant design matrix X i = X and only consider P for which y | X has an multivariate normal distribution.

Elliptically contoured densities
Consider the special case that P is such that y | X has an elliptically contoured density f µ,Σ from (3.2), with µ ∈ R k and Σ ∈ PDS(k). As before, in determining the limiting normal distribution of the individual S-estimators, we cannot use affine equivariance and restrict ourselves to the case (0, I k ). Instead, we use some of the results obtained in Section 8.2 to establish the limiting normal distributions of the S-estimators β n = β(P n ), θ n = θ(P n ), and C n = V(θ(P n )).
It follows that the limiting covariance of √ n (vec(V(θ n )) − vec(Σ)) is given by Corollary 6 is a direct consequence of Theorem 6. Its proof, in particular the derivations of the expressions for the limiting covariances, can be found in [16]. Note that the constants E 0,I k ρ ( z ) 2 /(kα 2 ), σ 1 and σ 2 , are the same as the ones found in [17] for the location and covariance S-estimators, respectively. In fact, Corollary 6 is an extension of Corollary 5.1 in [17] for S-estimators in the multivariate location-scale model of Example 4.
Asymptotic normality of S-estimators in the multivariate regression model of Example 2 follows from Corollary 6. These estimators have been considered in [30], but asymptotic normality has not been established. Under the assumption that the heuristic (9.3) holds, asymptotic relative efficiencies are computed on the basis of this heuristic. Indeed, now that Corollary 6 has been established, one may check that (9.3) holds.
Finally, note that the limiting covariances of √ n(β n − β(P )) and √ n(θ n − θ(P )) in Corollary 6 differ from the ones found in [5] for the linear mixed effects model (2.4) with X i = X, for i = 1, . . . , n. The results in [5] are obtained by re-parameterizing Xβ = µ and interpreting the model as a multivariate location-scale model. Then building on the results in [17] for S-estimators of multivariate location-scale, the limiting covariances in [5] are found by application of the delta method. However, in view of Remark 3.1 this does not seem to be a correct approach.
Remark 9.1. Although our expressions for the limiting covariances in Corollary 6 differ from the ones found in Proposition 1 in [5], somewhat surprisingly, they yield the same matrices for the example discussed in Section 5.1 in [5]. However, this is a consequence of the specific structure of the design matrices X and Z in this example. One can easily find other design matrices for which the limiting covariances in Corollary 6 yield different matrices as the ones found in [5]. Moreover, the corresponding confidence regions based on the expressions in Corollary 6 can be substantially smaller than the ones based on the expressions found in [5]. See the simulation in Section 10.

Simulation and data example
We compare the asymptotic results of the S-estimators with their finite sample behavior by means of a simulation. Moreover we investigate the differences between the expressions found in Corollary 6 and the ones in Copt and Victoria-Feser [5]. To this end we will study the behavior of the estimators for samples generated from a model that is close to the one in [5]: a linear mixed effects model with y i in dimension k = 4 and all subjects with the same design matrix X for the fixed effects β = (β 1 , β 2 ) T . Following the setup in [5], the matrix X is built as follows. The first column of X is taken to be a vector 1 consisting of ones of length four. The four x-values in the second column are generated from a standard normal, and then X is rescaled to a new matrix X = [1 x], such that X T X = 4I 2 . For our simulation we used The random effects γ i are independent N (0, σ 2 γ ) distributed random variables, which are independent from the measurement error i ∼ N (0, σ 2 R). This leads to a structured covariance Σ = σ 2 γ ZZ T + σ 2 R, with covariance parameter vector θ = (θ 1 , θ 2 ) T , where θ 1 = σ 2 γ and θ 2 = σ 2 . Following the setup in [5], we set β 1 = β 2 = 1 and θ 1 = θ 2 = 1.
In [5], the authors took Z = (1, 1, 1, 1) T and R = I 4 . With these choices the expression found in [5] for the limiting covariance matrix of √ n(β n − β) (see (14) in [5]), is equal to our expression found in Corollary 6, and similarly for the limiting covariance matrix of √ n(θ n −θ). However, this is just the consequence of the extreme simple choices for X, Z and R. Already, if we keep X as it is, and only take a slight variation of either Z or R, one finds severe differences between (10.2) and (10.3), and similarly for the expression of the limiting covariance matrix of √ n(θ n − θ). We considered the following two alternatives 1. take Z = (1, 2, 3, 4) T and leave X and R = I 4 as they are; 2. take R = (1,4,9,16) T and leave X and Z = (1, 1, 1, 1) T as they are.
We generated 10 000 samples of size n = 100 according to model (10.1) and computed the value of S-estimators β n and θ n by means of Tukey's bi-weight Clearly, the histograms of the repeated estimates for β 1 and β 2 match the graphs of the (marginal) normal densities with the variances given by Var LGRG (β n ), and the scatterplot matches with the 95% contourline corresponding to Var LGRG (β n ). Note that the differences with Var CVF (β n ) are quite severe. For example, this yields that the length of the confidence interval for β 1 based on Var CVF (β n ) will be two times larger than the one based on Var LGRG (β n ). The second row in Figure 1 displays the limiting distributions of √ n(θ n −θ), where we generated the samples with alternative 2. In [5], the limiting covariance matrix was given by (see (15) in [5]) (L T L) −1 L T V Σ L(L T L) −1 , where V Σ = σ 1 (I k 2 + K k,k )(Σ ⊗ Σ) + σ 2 vec(Σ)vec(Σ) T , see Corollary 5.1 in [17]. Because the expression given in [5] becomes This differs from our Corollary 6, which gives For the choices of X, Z and R in [5], both expressions are equal. However, for the alternative choice for R made in alternative 2, one finds Again the differences are quite large. For example, as a consequence the length of the confidence interval for θ 1 based on Var CVF (θ n ) will be 1.5 times larger than the one based on Var LGRG (θ n ). Finally, we illustrate the performance of S-estimators by an application to data from a trial on the treatment of lead-exposed children. This dataset is discussed in [10] and consists of four repeated measurements of blood lead levels obtained at baseline (or week 0), week 1, week 4, and week 6 on 100 children who were randomly assigned to chelation treatment with succimer (a chelation agent) or placebo. On the basis of a graphical display of the mean response over time, it is suggested in [10] that a quadratic trend over time seems suitable. We fitted the following model for i = 1, . . . , 100 and j = 1, . . . , 4, where (t 1 , . . . , t 4 ) = (0, 1, 4, 6) refer to the different weeks, y ij is the blood lead level (mcg/dL) of subject i obtained at time t j , and δ i = 0 if the i-th subject is in the placebo group and δ i = 1, otherwise. The random effects γ i = (γ 1i , γ 2i , γ 3i ), i = 1, . . . , 100, are assumed to be independent mean zero normal random vectors with a diagonal covariance matrix consisting of variances σ 2 γ1 , σ 2 γ2 and σ 2 γ3 , respectively. The measurement errors i = ( i1 , . . . , i4 ), i = 1, . . . , 100, are assumed to be independent mean zero random vectors with covariance matrix σ 2 I 4 , also being independent of the random effects. In this way we are fitting a balanced linear mixed effects model with unknown parameters β = (β 1 , . . . , β 6 ) and θ = (σ 2 γ1 , σ 2 γ2 , σ 2 γ3 , σ 2 ), and a linear covariance structure.
We estimated (β, θ) by means of maximum likelihood and by means of the S-estimator corresponding to Tukey's bi-weight defined in (10.4). The tuning-constant was chosen to be c = 4.097, which corresponds to asymptotic breakdown point 0.5. For each estimate ( β, θ), we determined the estimate V( θ) for the structured covariance and the standardized residuals for each subject The residuals for both estimation procedures are visible in the left picture of Figure 2, with the residuals determined from the S-estimate on the horizonal axis and the ones determined from the ML estimate on the vertical axis. Both estimates identify subject 40 as an outlier, but only the robust S-estimate also clearly identifies observation 98 as outlier. The extreme large observation in week 6 seems to be the reason that observation 40 is identified as outlier by both methods. See the right picture in Figure 2. Observation 98 also seems to deviate from the overall quadratic trend, by having a suspicious low observation in week 6. The corresponding S-residual clearly sticks out from the other S-residuals, whereas this is much less so for the corresponding ML residual.

A Proofs and technical lemmas
Proof of Theorem 1 Proof. Let (β, θ) ∈ R q × R l satisfy the S-constraint in (3.5). Then from (R1)-(R2) it follows that Since 1 − r ≥ , Lemma 1(i) then implies that λ k (V(θ)) ≥ a 1 > 0. Because lim m→∞ ρ ( y /m) dP (y, X) = 0, we can find m 0 > 0, such that ρ( y /m 0 ) dP (y, X) ≤ b 0 . Lemma 1(ii) then yields that λ 1 (V(θ)) ≤ a 2 < ∞. Application of Lemma 1(iii), with a = 1 − r together with (C1 ), implies that β ≤ M < ∞. It follows that β is in a compact subset of R q and V(θ) is in a compact set K ⊂ R k×k . According to (2.9), the mapping θ → V(θ) is one-to-one, so that we can restrict θ to the pre-image V −1 (K). Then with conditions (V1) and (V3) it follows that also V −1 (K) is compact in R l . We conclude that for solving minimization problem (3.5), we can restrict ourselves to a compact set K ⊂ R q × R l . As det(V(θ)) is a continuous function of (β, θ), due to condition (V1), it must attain a minimum on K .

Proof of Corollary 1
Proof. Let P n be the empirical measure corresponding to the collection S n . Then P n satisfies (C1 ) for any 0 < ≤ 1 − r and satisfies (C2 ), for = (κ(S n ) + 1)/n. Clearly 0 < ≤ 1 − r, where r = b 0 /a 0 , so according to Theorem 1 there exists at least one solution to (3.5) with P = P n . This means that there exists at least one solution to (3.1).
It follows that, for t sufficiently large, P t satisfies condition (C2 +η ). Next, consider the compact set K from (C1 ). Without loss of generality we may assume that it belongs to C. Therefore, as P (K) ≥ r + , for t sufficiently large P t (K) ≥ r + + η. It follows that, for t sufficiently large, P t satisfies condition (C1 +η ). Since + η < 1 − r, according to Theorem 1 at least one solution ξ(P t ) exists, for t sufficiently large.

Proof of Corollary 3
Proof. We apply Theorem 2 to the sequence P n , n = 1, 2, . . ., of probability measures, where P n is the empirical measure corresponding to (y 1 , X 1 ), . . . , (y n , X n ). According to the Portmanteau Theorem (e.g., see Theorem 2.1 in [2]), P n converges weakly to P , with probability one. The corollary then follows from Theorem 2.
We conclude This means that the limit of the left hand side exists and Next, consider the covariance functional C(P ) = V(θ(P )). By definition IF(s 0 ; vec(C), P ) = lim h↓0 vec(C(P h,s0 )) − vec(C(P )) h .
Dividing by h and letting h ↓ 0, finishes the proof.

Proof of Theorem 6
Proof. Write ξ P = ξ(P ). Then from (9.2) and Lemma B.8 in [16], it follows that From Theorem 2, we know that ξ n → ξ P with probability one. For the first term on the right hand side we have that using that Λ(ξ P ) = 0, due to the fact that ξ P is the solution of (3.5), see also (7.4). From Lemma B.2 in [16] we have that Ψ β ≤ C 1 X and Ψ θ ≤ C 2 , for universal constants 0 < C 1 , C 2 < ∞. Hence, from the conditions of the theorem, it follows that E Ψ(s, ξ(P )) 2 < ∞. This means that for the second term on the right hand side of (A.5), according to the central limit theorem. It follows that so that ξ n − ξ P = O P (n −1/2 ). If we insert this is in (A.5), we obtain (Ψ(s i , ξ P ) − EΨ(s, ξ P )) + o P (n −1/2 ), from which it follows that (Ψ(s i , ξ P ) − EΨ(s, ξ P )) + o P (1).
After application of the central limit theorem, we conclude that √ n(ξ n − ξ P ) is asymptotically normal with mean zero and covariance matrix where we use that EΨ(s, ξ P ) = Λ(ξ P ) = 0. This proves the theorem.

B Supplemental Material
For later use we first define two important matrix norms and mention some useful properties. For m × n real-valued matrices A, we define the Euclidean norm or Frobenius norm as ij . and the spectral norm by Au .
Recall that for real-valued A, the largest eigenvalue is defined by This means that A 2 2 = λ 1 (A T A). Other useful properties are and for u ∈ R m and v ∈ R n . When A is symmetric then λ 1 (A T A) = λ 1 (A 2 ) = λ 1 (A) 2 . In that case Finally, note that both matrix norms are submultiplicative, that is
Proof. Let g t (s) = g(s, ξ t ) and g L (s) = g(s, ξ L ). Then for every sequence {s t }, such that s t → s, which is bounded and uniformly continuous. Then as a consequence of P t → P weakly, we have lim t→∞ g(s, ξ t ) dP t (s) = lim t→∞ Γ(g t (s)) dP t (s) = Γ(g L (s)) dP (s) = g(s, ξ L ) dP (s).

B.3 Proofs for Section 6
Proof of Theorem 4 Proof. Without loss of generality we may assume that c 0 = 1 and sup ρ = 1, so that r = b 0 . The first step is to show that for both estimators * n ≥ nr /n. To this end, consider a collection S m obtained from the original collection S n by replacing at most m = nr − 1 number of points in R k × X . We must show that at least one solution (β n (S m ), θ n (S m )) to the S-minimization problem (3.1) exists for the corrupted collection S m , and that all possible solutions do not break down.
Denote a possible solution to the S-minimization problem (3.1) corresponding to the corrupted collection S m , by ξ m = (β m , θ m ) = (β n (S m ), θ n (S m )), and consider the corresponding cylinder C(β m , θ m , 1) defined by (3.3). We apply Lemma 1 to the empirical measure P m corresponding to the corrupted collection S m of n points. Because ξ m = (β m , θ m ) must satisfy the S-constraint in (3.1), one can argue as in (A.1), where d(s i , ξ m ) is defined in (5.1). It follows that the cylinder C(β m , θ m , 1) must contain at least n − nb 0 = n − nr number of points from the corrupted collection S m . Furthermore, since r ≤ (n − κ(S n ))/(2n), for any such subset of S m it holds that it contains points of the original collection S n . It follows that the measure P m satisfies condition (C2 ), for = (κ(S n ) + 1)/n and with the value δ > 0 only depending on the original collection S n . Moreover, we also have that P m (C(β m , θ m , 1)) ≥ , for = (κ(S n ) + 1)/n > 0. According to Lemma 1(i), it then follows that λ k (V(θ m )) ≥ a 1 > 0, where a 1 only depends on the original collection S n . Because nb 0 − m = nr − nr + 1 > 0 and lim R→∞ (yi,Xi)∈Sn we can find an R 0 > 0, only depending on the original collection S n , such that The collection S m contains n−m points of the original collection S n . Consider the smallest M > 0, such that Because S m ∩ S n has less points than S n , it holds that M ≤ R 0 . It follows that According to Lemma 1(ii), it then follows that λ 1 (V(θ m )) ≤ a 2 < ∞. where a 2 only depends on a 1 and the collection S n . To show that the estimate β m stays bounded, recall that the cylinder C(β m , θ m , 1) contains a subset J 0 of κ(S n ) + 1 points from the original collection S n , according to (B.10). By definition, κ(S n ) + 1 original points cannot be on the same hyperplane, so that where the first infimum runs over all subsets J ⊂ S n of κ(S n ) + 1 points. By definition of γ n , there exists an original point s 0 ∈ J 0 ⊂ S n ∩ C(β m , θ m , 1), such that Because s 0 ∈ C(β m , θ m , 1), similar to the proof of Lemma 1(iii), it follows that y 0 −X 0 β m 2 ≤ a 2 , and because s 0 ∈ S n , we have that We conclude that there exists a compact set K n , only depending on the original collection S n , that contains the pair (β m , V(θ m )). Similar to the reasoning in the proof of Theorem 1, it follows that at least one solution (β n (S m ), θ n (S m )) to the S-minimization problem (3.1) exists for the collection S m , and that all possible solutions β n (S m ) and θ n (S m ) do not break down.
We continue by showing * n (θ n ) ≤ nr /n. Replace m = nr points of S n to obtain a corrupted collection S m of n points. Suppose that a solution ξ m = (β m , θ m ) = (β n (S m ), θ n (S m )) exists to the S-minimization problem (3.1) corresponding to the corrupted collection S m . We must show that the estimate θ m breaks down. Note that the estimates β m and θ m satisfy the S-constraint in (3.1) for the corrupted collection, We conclude that at least one replaced point which is in contradiction with (B.11). Therefore, the cylinder C(β m , θ m , 1) must contain a point s 0 = (y 0 , X 0 ) from the original collection S n as well as one replaced point s i = (y i , X i ).
Note that for each point s = (y, X) ∈ C(β m , θ m , 1) it holds that where λ j (V(θ m )) > 0, for j = 1, . . . , k, are the eigenvalues of V(θ m ) and q 1 , . . . , q k are the corresponding orthonormal eigenvectors. Now, replace all m points by s i = (y i , X i ) = (z, 0), where 0 is a k × q matrix of zeros and z = t k j=1 q j , so that q T j z = t, for each j = 1, . . . , k. By sending t → ∞ and the fact that at least one replaced point (z, 0) satisfies (B.12), it follows that λ j (V(θ m )) → ∞, for each j = 1, . . . , k. This means θ m breaks down.
The upper bound for * n (β n , S n ) follows from the fact that β n is regression equivariant. Similar to Theorem 2 in [19] it can be shown that the maximal breakdown point of regression equivariant estimators is (n + 1)/2 /n. This proves the theorem.

Proof of Corollary 4
Proof. Take s = (y, X) fixed and consider IF(s; ξ, P ). Since D(P ) does not depend on s, from Theorem 5 and Lemma B.2, it follows immediately that IF(s; ξ, P ) remains bounded in y, but not necessarily in X.
Proof. Write ∂Λ/∂ξ as the block matrix We prove the lemma for each block separately. Consider ∂Λ β /∂β. We have where d = d(s, ξ) is defined by (5.1) and where we abbreviate V(θ) by V. First note that, according to (R5), ξ → u (d(s, ξ))/d(s, ξ) is continuous at ξ(P ), for each s fixed such that d(s, ξ(P )) = 0. Together with (V1), this means that for such s fixed, ξ → ∂Ψ β (s, ξ)/∂β is continuous at ξ(P ). For the first term on the right hand side of (B.20), we apply (B.4) and (B.13). This gives Similarly, for the second term on the right hand side of (B.20), after application of (B.7) and (B.5), we get Since λ 1 (V −1 ) is bounded uniformly on the neighborhood N of ξ(P ) and because u(s) and u (s)s = ρ (s) − u(s) are bounded, due to (R2), it follows that there exists a constant 0 < C 1 < ∞, only depending on P , such that Since E X 2 < ∞, it follows by dominated convergence that for ξ in the neighborhood N of ξ(P ), it holds that and that ∂Λ β /∂β is continuous at ξ(P ). Next consider ∂Ψ β /∂θ. For each j = 1, . . . , l fixed, we have

(B.22)
First note that, similar to (B.20), ξ → ∂Ψ β (s, ξ)/∂θ j is continuous at ξ(P ), for each s fixed such that d(s, ξ(P )) = 0. Consider the first term on the right hand side of (B.22). From (B.13), we have Moreover, similar to the reasoning in (B.15), we find For the second term on the right hand side of (B.22), similar to the reasoning in (B.13), we have According to (V4), the mapping V(θ) is continuously differentiable. This means that ∂V/∂θ j is bounded on the neighborhood N of ξ(P ). Since λ 1 (V −1 ) is bounded uniformly on N and because u(s)s = ρ (s) and u (s)s 2 = ρ (s)s − ρ (s) are bounded, it follows that there exists a constant 0 < C 2 < ∞, only depending on P , such that for j = 1, . . . , l, As before, it follows by dominated convergence that for ξ in the neighborhood N of ξ(P ), it holds that and that ∂Λ β /∂θ is continuous at ξ(P ). Next consider ∂Ψ θ,j /∂β, for j = 1, . . . , l. We have where H j is defined in (7.3). As before, ξ → ∂Ψ θj (s, ξ)/∂β is continuous at ξ(P ), for each s fixed such that d(s, ξ(P )) = 0. Consider the first term on the right hand side of (B.25). From (B.18), Because u (s)s 2 is bounded, together with (B.13), the norm of the first term on the right hand side of (B.25) is bounded by a constant times X λ 1 (V −1 ) 1/2 . Similar to (B.13), for the second term on the right hand side of (B.25), and for the third term on the right hand side of (B.25), we can use (B.13) and (B.17). As before, since u (s)s = ρ (s) − u(s) and u(s)s 2 = ρ (s)s are bounded, it follows that there exists a constant 0 < C 3 < ∞, only depending on P , such that for j = 1, . . . , l, Since E X < ∞, if follows by dominated convergence that for ξ in the neighborhood N of ξ(P ), it holds that ∂Λ θ (ξ) ∂β = ∂Ψ θ (s, ξ) ∂β dP (s), (B.27) and that ∂Λ θ /∂β is continuous at ξ(P ). Finally, consider ∂Ψ θ,j /∂θ t , for j, t = 1, . . . , l. We find As before, together with (V5), ξ → ∂Ψ θj (s, ξ)/∂θ t is continuous at ξ(P ), for each s fixed such that d(s, ξ(P )) = 0. From (B.15) and (B.18), it follows that the first term on the right hand side of (B.28) is bounded by Because u (s)s 3 = ρ (s)s 2 − ρ (s)s is bounded, together with (B.16), we conclude this is bounded on the neighborhood N of ξ(P ). Similar to (B.13), the second term on the right hand side of (B.28) is bounded by Since u(s)s 2 = ρ (s)s is bounded, together with (B.17), we again find that this is bounded on the neighborhood N of ξ(P ). The same holds for the fourth term on the right hand side of (B.28).
We continue with the third term on the right hand side of (B.28). Similar to (B.13), this is bounded by We have that With (B.5), (B.6), and (B.16), we find which is uniformly bounded on the neighborhood N of ξ(P ), and similar to (B.17) we find Because, according to (V5), the mapping θ → V(θ) is twice continuously differentiable, it follows that ∂ 2 V/∂θ j θ t is uniformly bounded on the neighborhood N of ξ(P ). Together with the fact that with (B.16), it follows that the first term of ∂H j /∂θ t is bounded on the neighborhood N of ξ(P ). The traces in the other terms can be handled in the same way, which yields that ∂H j /∂θ t is bounded on the neighborhood N of ξ(P ). Because u(s)s 2 = ρ (s)s is bounded, it follows that the third term on the right hand side of (B.28) is bounded.
Next, consider the fifth term on the right hand side of (B.28). From (B.29), which is uniformly bounded on the neighborhood N of ξ(P ). Because ρ(s) is bounded, it follows that the fifth term on the right hand side of (B.28) is bounded. For the sixth term on the right hand side of (B.28), from (B.30) we have Because V(θ) is twice continuously differentiable and ρ(s) is bounded, we conclude that the sixth term on the right hand side of (B.28) is bounded. Finally, from (B.17) and (B.23) together with the fact that u(s)s 2 = ρ (s)s is bounded, it also follows that the last term on the right hand side of (B.28) is bounded. By putting everything together, it follows that there exists a constant 0 < C 4 < ∞, only depending on P , such that for j, t = 1, . . . , l, It follows by dominated convergence that for ξ in the neighborhood N of ξ(P ), it holds and that ∂Λ θ /∂θ is continuous at ξ(P ). This finishes the proof.
For convenience we state the following result from [17] about spherically contoured densities, see Lemma 5.1 in [17]. This lemma uses the commutation matrix K k,k , which is the k 2 × k 2 block matrix with the (i, j)-block being equal to the k × k matrix ∆ ji consisting of zero's except a 1 at entry (j, i). A useful property (e.g., see [20,Section 3.7]) is that for any k × k matrix A, it holds that K k,k vec(A) = vec(A T ). (B.32) Lemma B.4. Suppose that z has a k-variate elliptical contoured density defined in (3.2), with parameters µ = 0 and Σ = I k . Then u = z/ z is independent of z , has mean zero and covariance matrix (1/k)I k . Furthermore, E 0,I k uu T u = 0 and where σ 1 = σ 2 = (k(k + 2)) −1 .

Proof of Lemma 2
Proof. Write ∂Λ/∂ξ as in (B.19). We determine each block separately and apply Lemma B.3. Let ξ P = (β P , θ P ) = (β(P ), θ(P )). When we also write V P instead of V(θ(P )), then according to Lemma B.3 we have The inner expectation on the right hand side is the conditional expectation of y | X, which has the same distribution as Σ 1/2 z + µ, where z has a spherical density f 0,I k . This implies that the inner expectation on the right hand side is equal to From Lemma B.4, we find where u = z/ z and

It follows that
Next, for j = 1, . . . , l, consider According to Lemma B.4, the inner conditional expectation of the first term on the right hand side of (B.33) can be written as Since the second term on the right hand side is the expectation with respect to a spherical density of an odd function of u, this expectation is equal to zero due to Lemma B.4. Similarly, the second term on the right hand side of (B.33) has inner conditional expectation due to Lemma B.4. It follows that ∂Λ β (ξ P ) ∂θ = 0.
Next, using Lemma B.3 and (B.25), for all j = 1, . . . , l, consider According to Lemma B.4, the first term on the right hand side of (B.34) has inner conditional expectation Again, the second term on the right hand side is the expectation with respect to a spherical density of an odd function of u, and is therefore equal to zero. Similarly, the inner expectation of the second term on the right hand side of (B.34) is equal to and the inner expectation of the third term on the right hand side of (B.34) is equal to It follows that ∂Λ θ (ξ P ) ∂β = 0.
Finally, to determine ∂Λ θ,j (ξ P )/∂θ s , note that when V is linear, we can write where Ψ V is defined in (7.8) and has the property that Ψ V (s, ξ P ) dP (s) = 0, when y | X has an elliptically contoured density f µ,Σ with parameters µ = Xβ P and Σ = V P . This means that for each j, s = 1, . . . , l, we have where Ψ V is defined in (7.8). As before, we find (B.35) The first term on the right hand side of (B.35) is equal to Furthermore, we can write It follows that the first term on the right hand side of (B.35) leads to a first term in ∂Λ θ,j (ξ P )/∂θ s , which is equal to The second term on the right hand side of (B.35) is equal to using Lemma B.4. This leads to a second term in ∂Λ θ,j (ξ P )/∂θ s , which is equal to The third term on the right hand side of (B.35) leads to a third term in ∂Λ θ,j (ξ P )/∂θ s , which is equal to We conclude that ∂Λ θ,j (ξ P )/∂θ s consists of three terms given in (B.36), (B.37) and (B.38). This means that ∂Λ θ,j (ξ P )/∂θ s has a term tr(Σ −1 L j Σ −1 L s ) with coefficient and a term tr(Σ −1 L s )tr(Σ −1 L j ) with coefficient where γ 1 and γ 2 are defined in (8.5), and where we use that u (s)s 3 = ρ (s)s 2 − ρ (s)s and v(s) = ρ (s)s − ρ(s) + b 0 . Finally, from the definition of L in (7.9) it follows that the l × l matrix with entries This proves the lemma.
Proof. Together with Lemma 2, first write Since L has full rank, also E has full rank. This means that (E T E) −1 exists, and satisfies When we multiply the first matrix from the left with a matrix of the same type, then we find four terms. A term I k with coefficient aγ 1 , two terms with coefficient −aγ 2 + bγ 1 , and the term Consider the scalar valued inner product in the middle by application of (B.40) and then (B.39). It follows that the term with coefficient −bγ 2 reduces to kE T vec(I k )vec(I k ) T E(E T E) −1 .
Hence the matrix product in (B.41) is equal to When we multiply the same matrix from the right, we find the same result (B.43). This matrix is equal to I k if and only if aγ 1 = 1 and −aγ 2 + bγ 1 − kbγ 2 = 0, or equivalently where we use that γ 1 > 0 and γ 1 − kγ 2 = E 0,I k [ρ ( z )] /2 > 0, due to (R3)-(R4). We conclude that the inverse of γ 1 I k − γ 2 E T vec(I k )vec(I k ) T E(E T E) −1 exists and is equal to with a and b given in (B.44). Hence, the inverse of the matrix After inserting E = Σ −1/2 ⊗ Σ −1/2 L, this finishes the proof.

Proof of Corollary 5
Proof. Since ρ is strictly increasing on [0, c 0 ], the function u(s) = ρ (s)/s > 0, for 0 < s ≤ c 0 and zero for s > c 0 . This means that Furthermore, since X has full rank, the inverse of E X T Σ −1 X exists. It follows that the matrix ∂Λ β (ξ(P ))/∂β in (8.3) is non-singular. According to Lemma B.5, also ∂Λ θ (ξ(P ))/∂θ is nonsingular. Together, with Lemma B.3 and Lemma 2, we conclude that ∂Λ/∂ξ is continuously differentiable with a non-singular derivative at ξ(P ), so that Theorem 5 applies. Together with Lemma 2, this implies that . From Theorem 5, together with Lemma 2, it also follows that Consider the first term on the right hand side of (B.45). We have that and from Lemma B.5, The first term on the right hand side is equal to and, with (B.39) and (B.40), the second term on the right hand side is equal to . It follows that the first term on the right hand side of (B.45) is equal to . Next consider the second term on the right hand side of (B.45). We have that = aθ(P ) + bθ(P )vec(I k ) T Eθ(P ) = aθ(P ) + bθ(P )vec(I k ) T vec(I k ) = (a + bk)θ(P ).
It follows that the second term on the right hand side of (B.45) is equal to Putting things together, we find that IF(s 0 , θ, P ) is equal to Since We conclude that IF(s 0 , θ, P ) is given by This proves the corollary.

B.5 Proofs of Section 9
As preparation we first establish the following lemma, which is similar to Lemma 22 in [21].
Lemma B.6. Let ρ(·) be a real-valued function of bounded variation on R + . The class of all functions on R p of the form s = (y, X) → ρ( A(y − Xβ) ) with A ranging over all k × k matrices and β ranging over R q , has polynomial discrimination.
Proof. Consider the class of functions According to Lemma 18 in [21], it suffices to show that the functions g A,β (·, ·) span a finitedimensional vector space. In order to do so, write x ir β r y j − q w=1 x jw β w This is a polynomial in s = (y 1 , . . . , y k , x 11 , . . . , x kq ), with coefficients in R. This means that the class of functions g A,β (s, t) = A(y − Xβ) 2 − ρ −1 (t) 2 forms a finite dimensional vector space.
A useful first step is the following lemma.

Consider the classes of functions
for a, b = 1, . . . , p. Denote by G, G a , and G ab , the corresponding classes of graphs of the functions in F, F a , and F ab , respectively. Then G, G a , and G ab , all have polynomial discrimination for a, b = 1, . . . , p.
Proof. Because the function u(·) is of bounded variation, it follows from Lemma B.6 that the class G has polynomial discrimination. To show the same for the class G a , suppose that G a is not of polynomial discrimination. This means that for every integer N there exists a set V ⊂ R p+1 of N points, such that all subsets of V can be written as D a ∩V for some D a ∈ G a . Let Then for every (s, t) ∈ V with s = (s 1 , . . . , s a , . . . , s p ), it must hold that s a = 0, otherwise this point cannot be separated from the other points by an element of the class G a . Define the set V a = {(s, t/s a ) : (s, t) ∈ V }. Note that for D a ∈ G a , we have where D ∈ G. This implies that every subset of V a can be written as D ∩ V a , for some D ∈ G. However, this is in contradiction with the fact that G has polynomial discrimination. We conclude that also G a has polynomial discrimination. A similar argument yields that G ab has polynomial discrimination.
Lemma B.7 is comparable to Lemma 3 in [18], for similar classes of functions built from functions g(y, t, C) = u( C −1/2 (y − t) ), where t ∈ R k and C ∈ PDS(k). With Lemma B.7 we can establish suitable bounds on the third term of (9.2). This is provided by the following key lemma. Once having established Lemma B.8, asymptotic normality can be derived easily from (9.2).
To obtain (B.51), first write M n = V(θ n ) −1 and M P = V(θ P ) −1 , so that M n → M P , in probability, according to condition (V1). Decompose as follows ∂θ j (ρ(d(s, ξ n )) − ρ(d(s, ξ P ))) After integration with respect to P n − P , the first term on the right hand side of (B.52) becomes where with (V4), in probability. Furthermore, note that all functions ρ(d(·, ξ)), for ξ ∈ R q × R l are members of the class From (R1)-(R2) it follows that the function ρ is of bounded variation (being the sum of two monotone functions). According to Lemma B.7, the class G, consisting of subgraphs of functions in the class F, has polynomial discrimination. Moreover, the class F has a constant envelope.
Then as a result of the empirical process theory developed in Pollard [23] (e.g., see Theorem 1 in [18]), it follows that for every δ > 0, Because ξ n → ξ P in probability one, the pair of functions f 1 (s) = ρ(d(s, ξ n ) and f 2 (s) = ρ(d(s, ξ P ) are in the set [δ], for sufficiently large n, with probability tending to one. It follows that in probability. Together with the fact that tr(M −1 P ∂V(θ P )/∂θ j ) is bounded, this proves that (B.53) is of the order o P (1/ √ n). For the second term on the right hand side of (B.52), we have that according to the central limit theorem (ρ(d(s, ξ P )) − b 0 ) d(P n − P )(s) = O P (1/ √ n).
Together with (B.54) this implies that, after integration with respect to P n − P , the second term on the right hand side of (B.52) is of the order o P (1/ √ n). This proves (B.51). To obtain (B.49), decompose as follows Ψ β (s, ξ n ) − Ψ β (s, ξ P ) = u(d(s, ξ n ))X T M n y − u(d(s, ξ P ))X T M P y − u(d(s, ξ n ))X T M n Xβ n − u(d(s, ξ P ))X T M P Xβ P .

(B.56)
We will treat both terms on the right hand side separately. For the first term on the right hand side of (B.56), we write u(d(s, ξ n )) − u(d(s, ξ P )) X T M n y + u(d(s, ξ P )X T (M n − M P )y. (B.57) First consider the first term in (B.57). We consider each single element of the vector X T M n y ∈ R q separately. For i = 1, . . . , q fixed, write x ji (M n y) j = k j=1 x ji k s=1 m n,js y s = k j=1 k s=1 x ji y s m n,js .
Hence, after integration with respect to P n − P , the i-th component of the first term in (B.57) can be written as a finite sum with summands u(d(s, ξ n )) − u(d(s, ξ P )) x ji y s d(P n − P )(s)m n,js , for i = 1, . . . , q and j, s ∈ {1, . . . , k} fixed, where m n,js → m P,js , Because u is of bounded variation and since s = (y, X) = (y 1 , . . . , y k , x 11 , . . . , x kq ), all functions u(d(s, ξ))x ji y s , for ξ ∈ R q × R l , are members of the class F ab = u( V −1/2 (y − Xβ) )s a s b : β ∈ R q , V ∈ PDS(k) .
According to Lemma B.7, the corresponding class of subgraphs has polynomial discrimination, so similar to (B.55) it follows that for each i = 1, . . . , q and j, s ∈ {1, . . . , k} fixed, u(d(s, ξ n )) − u(d(s, ξ P )) x ji y s d(P n − P )(s) = o P (n −1/2 ), which means that u(d(s, ξ n )) − u(d(s, ξ P )) X T M P y d(P n − P )(s) = o P (n −1/2 ). (B.58) Next, consider the second term on the right hand side of (B.57). Similar to the first term, after integration with respect to P n − P , the i-th component of the second term on the right hand side of (B.57) can be written as a finite sum of summands u(d(s, ξ P ))x ji y s d(P n − P )(s)(m n,js − m P,js ), for i = 1, . . . , q, and j, s ∈ {1, . . . , k} fixed. According to the central limit theorem, the integral is of the order O P (n −1/2 ), and since m n,js → m P,js , in probability, it follows that the product is of the order o P (n −1/2 ). We conclude that u(d(s, ξ P ))X T (M n − M P )y d(P n − P )(s) = o P (n −1/2 ). (B.59) Putting together (B.58) and (B.59), it follows for the first term on the right hand side of (B.56) that u(d(s, ξ n ))X T M n y − u(d(s, ξ P ))X T M P y d(P n − P )(s) = o P (n −1/2 ). (B.60) For the second term on the right hand side of (B.56), we write u(d(s, ξ n ))X T M n Xβ n − u(d(s, ξ P ))X T M P Xβ P = u(d(s, ξ n )) − u(d(s, ξ P )) X T M n Xβ n + u(d(s, ξ P ) X T M n Xβ n − X T M P Xβ P .

(B.61)
Consider the first term on the right hand side of (B.61). For i = 1, . . . , q fixed, write X T M n Xβ n i = We see that, after integration with respect to P n − P , the i-th component of the first term on the right hand side of (B.61) can be written as a finite summation of summands u(d(s, ξ n )) − u(d(s, ξ P )) x ji x st d(P n − P )(s)m n,js β n,t , for i = 1, . . . , q and s, t, j ∈ {1, . . . , k}, where m n,js → m P,js and β n,t → β P,t , in probability. All functions u(d(s, ξ))x ji y s , for ξ ∈ R q × R l , are members of the class F ab = u( V −1/2 (y − Xβ) )s a s b : β ∈ R q , V ∈ PDS(k) , and u is of bounded variation. According to Lemma B.7, the corresponding class of subgraphs has polynomial discrimination, so similar to (B.55) it follows that u(d(s, ξ n )) − u(d(s, ξ P )) x ji x st d(P n − P )(s) = o P (n −1/2 ), which means that u(d(s, ξ n )) − u(d(s, ξ P )) X T M n Xβ n d(P n − P )(s) = o P (n −1/2 ). (B.62) Next, consider the second term on the right hand side of (B.61). We then have to deal with summands of the form u(d(s, ξ P ))x ji x st d(P n − P )(s)(m n,js β n,t − m P,js β P,t ), for i = 1, . . . , q, and s, t, j ∈ {1, . . . , k}, where m n,js → m P,js and β n,t → β P,t , in probability. According to the central limit theorem, the integral is of the order O P (n −1/2 ). Because, m n,is β n,t → m P,is β P,t , in probability, it follows that the product is of the order o P (n −1/2 ). We conclude, u(d(s, ξ P ) X T M n Xβ n − X T M P Xβ P d(P n − P )(s) = o P (n −1/2 ). so that M n → M P , in probability, according to condition (V4). Decompose Ψ 2,j (s, ξ n ) − Ψ 2,j (s, ξ P ) as follows u(d(s, ξ n )) − u(d(s, ξ P )) (y − Xβ n ) T M n (y − Xβ n ) + u(d(s, ξ P )) (y − Xβ n ) T M n (y − Xβ n ) − (y − Xβ P ) T M P (y − Xβ P ) . (B.64) The first term in (B.64), can be written as the trace of the matrix u(d(s, ξ n )) − u(d(s, ξ P )) (y − Xβ n )(y − Xβ n ) T M n , where β n → β P and M n → M P , in probability. As before, we consider each single entry of this k × k matrix. The (i, j)-th element of (y − Xβ n )(y − Xβ n ) T is equal to (y − Xβ n ) i (y − Xβ n ) j = y i y j − k s=1 y j x is β n,s − k t=1 y i x jt β n,t + k s=1 k t=1 x is x jt β n,s β n,t .
(B.65) u(d(s, ξ n )) − u(d(s, ξ P )) (y − Xβ n )(y − Xβ n ) T is a combination of four summations. The last of these summations arising from (B.65), after integration with respect to P n − P , has summands u(d(s, ξ n )) − u(d(s, ξ P )) x is x jt d(P n − P )(s) β n,s β n,t , where β n,s β n,t → β P,s β P,t , in probability. All functions u(d(s, ξ))x ji y s , for ξ ∈ R q × R l , are members of the class where u is of bounded variation According to Lemma B.7, the corresponding class of subgraphs has polynomial discrimination, so similar to (B.55) it follows that u(d(s, ξ n )) − u(d(s, ξ P )) x is x jt d(P n − P )(s) = o P (n −1/2 ).
After taking traces, we conclude that u(d(s, ξ n )) − u(d(s, ξ P )) (y − Xβ n ) T M n (y − Xβ n ) d(P n − P )(s) = o P (n −1/2 ). (B.66) Next, consider the second term on the right hand side of (B.64). First, note that x js β s m ij x is x jt β s β t m ij .
This means that y j x it (β n,t m n,ij − β P,t m P,ij ) x is x jt (β n,s β n,t m n,ij − β P,s β P,t m P,ij ) .

(B.67)
We see that the (i, j)-th entry of u(d(s, ξ P )) (y − Xβ n ) T M n (y − Xβ n ) − (y − Xβ P ) T M P (y − Xβ P ) can be written as the combination of four summations. The last of the summations arising from (B.67), after integration with respect to P n − P , has summands u(d(s, ξ P ))x is x jt d(P n − P )(s) (β n,s β n,t m n,ij − β P,s β P,t m P,ij ) , where β n,s β n,t m n,ij → β P,s β P,t m P,ij , in probability. According to the central limit theorem, the integral is of the order O P (n −1/2 ), whereas the second term tends to zero. All functions u(d(s, ξ))x is x jt , for ξ ∈ R q × R l , are members of the class where u is of bounded variation According to Lemma B.7, the corresponding class of subgraphs has polynomial discrimination, so similar to (B.55) it follows that u(d(s, ξ P ))x is x jt d(P n − P )(s) (β n,s β n,t m n,ij − β P,s β P,t m P,ij ) = o P (n −1/2 ).
The other three summations that arise from the right hand side of (B.67) can be handled in the same way, so that
When we insert the expressions for δ 1 , δ 2 , a, and b given in (B.69) and (B.44), then we find By substituting E = Σ −1/2 ⊗ Σ −1/2 L, we find that the limiting covariance of √ n(θ n − θ(P )) is given by This finishes the proof.