Multiple Kernel k-Means with Incomplete Kernels

Xinwang Liu; Miaomiao Li; Lei Wang; Yong Dou; Jianping Yin; En Zhu

doi:10.1109/TPAMI.2019.2892416

. Author manuscript; available in PMC: 2020 Jul 14.

Published in final edited form as: IEEE Trans Pattern Anal Mach Intell. 2019 Jan 14;42(5):1191–1204. doi: 10.1109/TPAMI.2019.2892416

Multiple Kernel k-Means with Incomplete Kernels

Xinwang Liu ¹, Miaomiao Li ², Lei Wang ³, Yong Dou ⁴, Jianping Yin ⁵, En Zhu ⁶

PMCID: PMC6626696 NIHMSID: NIHMS1010546 PMID: 30640600

Abstract

Multiple kernel clustering (MKC) algorithms optimally combine a group of pre-specified base kernels to improve clustering performance. However, existing MKC algorithms cannot efficiently address the situation where some rows and columns of base kernels are absent. This paper proposes a simple while effective algorithm to address this issue. Different from existing approaches where incomplete kernels are firstly imputed and a standard MKC algorithm is applied to the imputed kernels, our algorithm integrates imputation and clustering into a unified learning procedure. Specifically, we perform multiple kernel clustering directly with the presence of incomplete kernels, which are treated as auxiliary variables to be jointly optimized. Our algorithm does not require that there be at least one complete base kernel over all the samples. Also, it adaptively imputes incomplete kernels and combines them to best serve clustering. A three-step iterative algorithm with proved convergence is designed to solve the resultant optimization problem. Extensive experiments are conducted on four benchmark data sets to compare the proposed algorithm with existing imputation-based methods. Our algorithm consistently achieves superior performance and the improvement becomes more significant with increasing missing ratio, verifying the effectiveness and advantages of the proposed joint imputation and clustering.

Introduction

The recent years have seen many effort devoted to designing effective and efficient multiple kernel clustering (MKC) algorithms (Zhao, Kwok, and Zhang 2009; Yu et al. 2012; Gönen and Margolin 2014; Du et al. 2015; Liu et al. 2016; Li et al. 2016; Cao et al. 2015a; Zhang et al. 2015; Cao et al. 2015b; Zhang et al. 2016). They aim to optimally combine a group of pre-specified base kernels to perform data clustering. For example, the work in (Zhao, Kwok, and Zhang 2009) proposes to find the maximum margin hyperplane, the best cluster labeling, and the optimal kernel simultaneously. A novel optimized kernel k-means algorithm is presented in (Yu et al. 2012) to combine multiple data sources for clustering analysis. In (Günen and Margolin 2014), the kernel combination weights are allowed to adaptively change to capture the characteristics of individual samples. Replacing the squared error in k-means with an ℓ_2,1-norm based one, the work in (Du et al. 2015) develops a robust multiple kernel k-means (MKKM) algorithm that simultaneously finds the best clustering labels and the optimal combination of kernels. Observing that existing MKKM algorithms do not sufficiently consider the correlation among base kernels, the work in (Liu et al. 2016) designs a matrix-induced regularization to reduce the redundancy and enhance the diversity of the selected kernels. These algorithms have been applied to various applications and demonstrated attractive clustering performance (Yu et al. 2012; Gönen and Margolin 2014).

One underlying assumption commonly adopted by the above-mentioned MKC algorithms is that all of the base kernels are complete, i.e., none of the rows or columns of any base kernel shall be absent. In some practical applications such as Alzheimer’s disease prediction (Xiang et al. 2013) and cardiac disease discrimination (Kumar et al. 2013), however, it is not uncommon to see that some views of a sample are missing, and this causes the corresponding rows and columns of related base kernels unfilled. The presence of incomplete base kernels makes it difficult to utilize the information of all views for clustering. A straightforward remedy may firstly impute incomplete kernels with a filling algorithm and then perform a standard MKC algorithm with the imputed kernels. Some widely used filling algorithms include zero-filling, mean value filling, k-nearest-neighbor filling and expectation-maximization (EM) filling (Ghahramani and Jordan 1993). Recently, more advanced imputation algorithms have been developed (Trivedi et al. 2010; Xu, Tao, and Xu 2015; Bhadra, Kaski, and Rousu 2016; Shao, He, and Yu 2015; Liu et al. 2014; 2015). The work in (Trivedi et al. 2010) constructs a full kernel matrix for an incomplete view with the help of the other complete view (or equally, base kernel). By exploiting the connections of multiple views, the work in (Xu, Tao, and Xu 2015) proposes an algorithm to accomplish multi-view learning with incomplete views, where different views are assumed to be generated from a shared subspace. In (Shao, He, and Yu 2015), a multi-incomplete-view clustering algorithm is proposed. It learns latent feature matrices for all the views and generates a consensus matrix so that the difference between each view and the consensus is minimized. In addition, by modelling both within-view and between-view relationships among kernel values, an approach is proposed in (Bhadra, Kaski, and Rousu 2016) to predict missing rows and columns of a base kernel. Though demonstrating promising clustering performance in various applications, the above “two-stage” algorithms share a drawback that they disconnect the processes of imputation and clustering, and this prevents the two learning processes from negotiating with each other to achieve the optimal clustering. Can we design a clustering-oriented imputation algorithm to enhance a kernel for clustering?

To address this issue, we propose an absent multiple kernel k-means algorithm that integrates imputation and clustering into a single optimization procedure. In our algorithm, the clustering result at the last iteration guides the imputation of absent kernel elements, and the latter is in turn used to conduct the subsequent clustering. These two procedures are alternately performed until convergence. By this way, the imputation and clustering processes are seamlessly connected, with the aim to achieve better clustering performance. The optimization objective of the proposed absent multiple kernel clustering algorithm is carefully designed and an efficient algorithm with proved convergence is developed to solve the resultant optimization problem. Extensive experimental study is carried out on four multiple kernel learning (MKL) benchmark data sets to evaluate the clustering performance of the proposed algorithm. As indicated, our algorithm significantly outperforms existing two-stage imputation methods, and the improvement is particularly significant at high missing ratios, which is desirable. It is expected that the simplicity and effectiveness of this clustering algorithm will make it a good option to be considered for practical applications where incomplete views or kernels are encountered.

Related Work

Kernel k-means clustering (KKM)

Let ${x_{i}}_{i = 1}^{n} \subseteq X$ be a collection of n samples, and $ϕ (\cdot) : x \in X \mapsto H$ be a feature mapping that maps x onto a reproducing kernel Hilbert space $H$ . The objective of kernel k-means clustering is to minimize the sum-of-squares loss over the cluster assignment matrix Z ∈ {0, 1}^n×k, which can be formulated as the following optimization problem,

\min_{Z \in {0, 1}^{n \times k}} \sum_{i = 1, c = 1}^{n, k} Z_{i c} {‖ ϕ (x_{i}) - μ_{c} ‖}_{2}^{2} s . t . \sum_{c = 1}^{k} Z_{i c} = 1,

(1)

where $n_{c} = \sum_{i = 1}^{n} Z_{i c} and μ_{c} = \frac{1}{n_{c}} \sum_{i = 1}^{n} Z_{i c} ϕ (x_{i})$ are the size and centroid of the c-th cluster.

The optimization problem in Eq.(1) can be rewritten as the following matrix-vector form,

{min}_{Z \in {0, 1}^{n \times k}} Tr (K) - Tr (L^{\frac{1}{2}} Z^{⊤} K Z L^{\frac{1}{2}}) s . t . Z 1_{k} = 1_{n},

(2)

where K is a kernel matrix with K_ij = ϕ(x_i)^⊤ϕ(x_j), $L = diag ([n_{1}^{- 1}, n_{2}^{- 1}, \dots, n_{k}^{- 1}])$ and $1_{l} \in ℝ^{l}$ is a column vector with all elements being 1.

The variable Z in Eq.(2) is discrete, and this makes the optimization problem difficult to solve. A common approach is to relax Z to take real values. Specifically, by defining $H = Z L^{\frac{1}{2}}$ and letting H take real values, a relaxed version of the above problem can be obtained as

{min}_{H} Tr (K (I_{n} - H H^{⊤})) s . t . H \in ℝ^{n \times k}, H^{⊤} H = I_{k},

(3)

where I_k is an identity matrix with size k × k. The optimal H for Eq.(3) can be obtained by taking the k eigenvectors having the larger eigenvalues of K (Jegelka et al. 2009).

Multiple kernel k-means clustering (MKKM)

In a multiple kernel setting, each sample has multiple feature representations defined by a group of feature mappings ${ϕ_{p} (\cdot)}_{p = 1}^{m}$ . Specifically, each sample is represented as ϕ_β(x) = [β₁ϕ₁(x)^⊤, ⋯, β_mϕ_m(x)^⊤]^⊤, where β = [β₁, ⋯, β_m]^⊤ consists of the coefficients of the m base kernels. These coefficients will be optimized during learning. Based on the definition of ϕ_β(x), a kernel function can be expressed as

κ_{β} (x_{i}, x_{j}) = ϕ_{β} {(x_{i})}^{⊤} ϕ_{β} (x_{j}) = \sum_{p = 1}^{m} β_{p}^{2} κ_{p} (x_{i}, x_{j}) .

(4)

By replacing the kernel matrix K in Eq.(3) with K_β computed via Eq.(4), the objective of MKKM can be written as

\min_{H, β} Tr (K_{β} (I_{n} - H H^{⊤})) s . t . H \in ℝ^{n \times k}, H^{⊤} H = I_{k}, β^{⊤} 1_{m} = 1, β_{p} \geq 0, \forall p .

(5)

This problem can be solved by alternately updating H and β: i) Optimizing H given β. With the kernel coefficients β fixed, H can be obtained by solving a kernel k-means clustering optimization problem shown in Eq.(3); ii) Optimizing β given H. With H fixed, β can be optimized via solving the following quadratic programming with linear constraints,

{min}_{β} \sum_{p = 1}^{m} β_{p}^{2} Tr (K_{p} (I_{n} - H H^{⊤})) s . t . β^{⊤} 1_{m} = 1, β_{p} \geq 0, \forall p .

(6)

As noted in (Yu et al. 2012; Gönen and Margolin 2014), using a convex combination of kernels $\sum_{p = 1}^{m} β_{p} K_{p}$ to replace K_β in Eq.(5) is not a viable option, because this could make only one single kernel be activated and all the others assigned with zero weights. Other recent work using ℓ₂-norm combination can be found in (Kloft et al. 2011; 2009; Cortes, Mohri, and Rostamizadeh 2009; Liu et al. 2013).

The Proposed Algorithm

Formulation

Let s_p (1 ≤ p ≤ m) denote the sample indices for which the p-th view is present and $K_{p}^{(c c)}$ be used to denote the kernel sub-matrix computed with these samples. Note that this setting is consistent with the literature, and it is even more general since it does not require that there be at least one complete view across all the samples, as assumed in (Trivedi et al. 2010).

The absence of rows and columns from base kernels makes clustering challenging. Existing two-stage approaches first impute these base kernels and then apply a conventional clustering algorithm with them. We have the following two arguments. Firstly, although such imputation is sound from the perspective of “general-purpose”, it may not be an optimal option when it has been known that the imputed kernels are used for clustering. This is because for most, if not all, practical tasks a belief holds that these pre-selected base kernels or views (when in their complete form) shall, more or less, be able to serve clustering. However, such a belief was not exploited by these two-stage approaches as prior knowledge to guide the imputation process. Secondly, from the perspective that the ultimate goal is to appropriately cluster data, we shall try to directly pursue the clustering result, by treating the absent kernel elements as auxiliary unknowns during this course. In other words, imputed kernels could be merely viewed as the by-products of clustering.

These two arguments motivate us to seek a more natural and reasonable manner to deal with the absence in multiple kernel clustering. That is to perform imputation and clustering in a joint way: 1) impute the absent kernels under the guidance of clustering; and 2) update the clustering with the imputed kernels. By this way, the above two learning processes can be seamlessly coupled and they are allowed to negotiate with each other to achieve better clustering. In specific, we propose the multiple kernel k-means algorithm with incomplete kernels as follows,

\min_{H, β, {K_{p}}_{p = 1}^{m}} Tr (K_{β} (I_{n} - H H^{⊤})) s . t . H \in ℝ^{n \times k}, H^{⊤} H = I_{k}, β^{⊤} 1_{m} = 1, β_{p} \geq 0, K_{p} (s_{p}, s_{p}) = K_{p}^{(c c)}, K_{p} ≽ 0, \forall p,

(7)

The only difference between the objective function in Eq.(7) and that of traditional MKKM in Eq.(5) lies at the incorporation of optimizing ${K_{p}}_{p = 1}^{m}$ . Note that the constraint $K_{p} (s_{p}, s_{p}) = K_{p}^{(c c)}$ is imposed to ensure that K_p maintains the known entries during the course. Though the model in Eq.(7) is simple, it admits the following advantages: 1) Our objective function is more direct and well targets the ultimate goal, i.e., clustering, by integrating kernel completion and clustering into one unified learning framework, where the kernel imputation is treated as a by-product; 2) Our algorithm works in a MKL scenario (Rakotomamonjy et al. 2008), which is able to naturally deal with a large number of base kernels and adaptively combine them for clustering; 3) Our algorithm does not require any base kernel to be completely observed, which is however necessary for some of the existing imputation algorithms such as (Trivedi et al. 2010). Besides, our algorithm is parameter-free once the number of clusters to form is specified.

Alternate optimization

Although Eq.(7) is not difficult to understand, the positive semi-definite (PSD) constraints on ${K_{p}}_{p = 1}^{m}$ make it difficult to optimize. In the following, we design an efficient algorithm to solve it. In specific, we design a three-step algorithm to solve this problem in an alternate manner:

i). Optimizing H with fixed β and ${K_{p}}_{p = 1}^{m}$ .

Given β and ${K_{p}}_{p = 1}^{m}$ , the optimization in Eq.(7) for H reduces to a standard kernel k-means problem, which can be efficiently solved as Eq.(3);

Algorithm 1.

Proposed Multiple Kernel k-means with Incomplete Kernels

1:	Input: ${K_{p}^{(c c)}}_{p = 1}^{m}$ , ${S_{p}}_{p = 1}^{m}$ and ϵ₀
2:	Output: H, β and ${K_{p}}_{p = 1}^{m}$ .
3:	Initialize β⁽⁰⁾ = 1_m/m, ${K_{p}^{(0)}}_{p = 1}^{m}$ and t = 1.
4:	repeat
5:	$K_{β}^{(t)} = \sum_{p = 1}^{m} {(β_{p}^{(t - 1)})}^{2} K_{p}^{(t - 1)} .$
6:	Update H^(t) by solving Eq.(3) with $K_{β}^{(t)}$ .
7:	Update ${K_{p}^{(t)}}_{p = 1}^{m}$ with H^(t) by Eq.(12).
8:	Update β^(t) by solving Eq.(6) with H^(t) and ${K_{p}^{(t)}}_{p = 1}^{m}$ .
9:	t = t + 1.
10:	until (obj^(t−1) − obj^(t))/obj^(t) ≤ ϵ₀

Open in a new tab

ii). Optimizing ${K_{p}}_{p = 1}^{m}$ with fixed β and H.

Given β and H, the optimization in Eq.(7) with respect to ${K_{p}}_{p = 1}^{m}$ is equivalent to the following optimization problem,

\min_{{K_{p}}_{p = 1}^{m}} \sum_{p = 1}^{m} β_{p}^{2} Tr (K_{p} (I_{n} - H H^{⊤})) s . t . K_{p} (s_{p}, s_{p}) = K_{p}^{(c c)}, K_{p} ≽ 0, \forall p .

(8)

Directly solving the optimization problem in Eq.(8) appears to be computationally intractable because it involves multiple kernel matrices. Looking into this optimization problem, we can find that the constraints are separately defined on each K_p and that the objective function is a sum over each K_p. Therefore, we can equivalently rewrite the problem in Eq.(8) as m independent sub-problems, as stated in Eq.(9),

{min}_{K_{p}} Tr (K_{p} U) s . t . K_{p} (s_{p}, s_{p}) = K_{p}^{(c c)}, K_{p} ≽ 0,

(9)

where U = I_n − HH^⊤ and p = 1, ⋯, m.

Considering that K_p is PSD, we can decompose K_p as $A_{p} A_{p}^{⊤}$ . Inspired by the work in (Trivedi et al. 2010), we write $A_{p} = [A_{p}^{(c)}; A_{p}^{(m)}]$ with $A_{p}^{(c)} A_{p}^{{(c)}^{⊤}} = K_{p}^{(c c)}$ . In this way, the optimization problem in Eq.(9) can be rewritten as

{min}_{A_{p}^{(m)}} Tr ({[A_{p}^{(c)}; A_{p}^{(m)}]}^{⊤} [\begin{array}{r} U^{(c c)} & U^{(c m)} \\ U^{{(c m)}^{⊤}} & U^{(m m)} \end{array}] [A_{p}^{(c)}; A_{p}^{(m)}]),

(10)

where the matrix U is expressed in a blocked form as

[\begin{matrix} U^{(c c)} & U^{(c m)} \\ U^{(c m)}^{⊤} & U^{(m m)} \end{matrix}] .

By taking the derivative of Eq.(10) with respect to $A_{p}^{(m)}$ and letting it vanish, we can obtain an analytical solution to the optimal $A_{p}^{(m)}$ as

A_{p}^{(m)} = {(U^{(m m)})}^{- 1} U^{{(c m)}^{⊤}} A_{p}^{(c)} .

(11)

Correspondingly, we have a closed-form expression for the optimal K_p in Eq.(9):

[\begin{array}{c} K_{p}^{(c c)} & - K_{p}^{(c c)} U^{(c m)} {(U^{(m m)})}^{- 1} \\ - {(U^{(m m)})}^{- 1} U^{{(c m)}^{⊤}} K_{p}^{(c c)} & {(U^{(m m)})}^{- 1} U^{{(c m)}^{⊤}} K_{p}^{(c c)} U^{(c m)} {(U^{(m m)})}^{- 1} \end{array}] .

(12)

iii). Optimizing β with fixed H and ${K_{p}}_{p = 1}^{m}$ .

Given H and ${K_{p}}_{p = 1}^{m}$ , the optimization in Eq.(7) for β is a quadratic programming with linear constraints, which can be efficiently solved as in Eq.(6).

In sum, our algorithm for solving Eq.(7) is outlined in Algorithm 1, where the absent elements of ${K_{p}^{(0)}}_{p = 1}^{m}$ are initially imputed with zeros and obj^(t) denotes the objective value at the t-th iteration. It is worth pointing out that the objective of Algorithm 1 is guaranteed to be monotonically decreased when optimizing one variable with others fixed at each iteration. At the same time, the objective is lower-bounded by zero. As a result, our algorithm is guaranteed to converge. Also, as shown in the experimental study, it usually converges in less than 30 iterations. As MKKM, our algorithm solves an eigen-decomposition and a QP problem per iteration, which brings no much extra computation since imputation is done analytically in Eq.(12).

Table 1:

Datasets used in our experiments.

Dataset	#Samples	#Kernels	#Classes
Flower17	1360	7	17
Flower102	8189	4	102
Caltech102	3060	10	102
CCV	6773	6	20

Open in a new tab

Experimental Result

Experimental settings

The proposed algorithm is experimentally evaluated on four widely used MKL benchmark data sets shown in Table 1. They are Oxford Flower17¹, Oxford Flower102², Columbia Consumer Video (CCV)³ and Caltech102⁴. For Flower17, Flower102 and Caltech102 data sets, all kernel matrices are pre-computed and can be publicly downloaded from the above websites. For Caltech102, we use its first ten base kernels for evaluation. For CCV data set, we generate six base kernels by applying both a linear kernel and a Gaussian kernel on its SIFT, STIP and MFCC features, where the widths of the three Gaussian kernels are set as the mean of all pairwise sample distances, respectively.

We compare the proposed algorithm with several commonly used imputation methods, including zero filling (ZF), mean filling (MF), k-nearest-neighbor filling (KNN) and the alignment-maximization filling (AF) proposed in (Trivedi et al. 2010). The algorithms in (Xu, Tao, and Xu 2015; Shao, He, and Yu 2015; Zhao, Liu, and Fu 2016) are not incorporated into our experimental comparison since they only consider the absence of input features while not the rows/columns of base kernels. Compared with (Bhadra, Kaski, and Rousu 2016), the imputation algorithm in (Trivedi et al. 2010) is much simpler and more computationally efficient. Therefore, we choose (Trivedi et al. 2010) as a representative algorithm to demonstrate the advantages and effectiveness of joint optimization on imputation and clustering. The widely used MKKM (Gönen and Margolin 2014) is applied with these imputed base kernels. These two-stage methods are termed ZF+MKKM, MF+MKKM, KNN+MKKM and AF+MKKM in this experiment, respectively. We do not include the EM-based imputation algorithm due to its high computational cost, even for small-sized samples. The Matlab codes of kernel k-means and MKKM are publicly downloaded from https://github.com/mehmetgonen/lmkkmeans.

Following the literature (Cortes, Mohri, and Rostamizadeh 2012), all base kernels are centered and scaled so that we have κ_p(x_i, x_i) = 1 for all i and p. For all data sets, it is assumed that the true number of clusters is known and it is set as the true number of classes. To generate incomplete kernels, we create the index vectors ${S_{p}}_{p = 1}^{m}$ as follows. We first randomly select round(ε ∗ n) samples, where round(·) denotes a rounding function. For each selected sample, a random vector v = (v₁, ⋯, v_m) ∈ [0, 1]^m and a scalar v₀ (v₀ ∈ [0, 1]) are then generated, respectively. p-th view will be present for this sample if v_p ≥ v₀ is satisfied. In case none of v₁, ⋯, v_m can satisfy this condition, we will generate a new v to ensure that at least one view is available for a sample. Note that this does not mean that we require a complete view across all the samples. After the above step, we will be able to obtain the index vector s_p listing the samples whose p-th view is present. The parameter ε, termed missing ratio in this experiment, controls the percentage of samples that have absent views, and it affects the performance of the algorithms in comparison. Intuitively, the larger the value of ε is, the poorer the clustering performance that an algorithm can achieve. In order to show this point in depth, we compare these algorithms with respect to ε. Specifically, ε on all the four data sets is set as [0.1 : 0.1 : 0.9].

The widely used clustering accuracy (ACC), normalized mutual information (NMI) and purity are applied to evaluate the clustering performance. For all algorithms, we repeat each experiment for 50 times with random initialization to reduce the affect of randomness caused by k-means, and report the best result. Meanwhile, we randomly generate the “incomplete” patterns for 30 times in the abovementioned way and report the statistical results. The aggregated ACC, NMI and purity are used to evaluate the goodness of the algorithms in comparison. Taking the aggregated ACC for example, it is obtained by averaging the averaged ACC achieved by an algorithm over different ε.

Experimental results

Figure 1 presents the ACC, NMI and purity comparison of the above algorithms with different missing ratios on the four data sets. To help understand the performance achieved by our algorithm, we also provide MKKM as a reference. Note that there is not any absence in the base kernels of MKKM. As observed: 1) The proposed algorithm (in red) consistently demonstrates the overall best performance among the MKKM methods with absent kernels in all the sub-figures; 2) The improvement of our algorithm is more significant with the increase of missing ratio. For example, it improves the second best algorithm (AF+MKKM) by nearly five percentage points on Flower102 in terms of clustering accuracy when the missing ratio is 0.9 (see Figure 1(c)); 3) The variation of our algorithm with respect to the missing ratio is relatively smaller when compared with other algorithms, demonstrating its stability in the case of intensive absence; and 4) The performance of our algorithm is the closest one to or even better than the performance of MKKM (in green) in multiple cases.

We attribute the superiority of our algorithm to its joint optimization on imputation and clustering. On one hand, the imputation is guided by the clustering results, which makes the imputation more directly targeted at the ultimate goal. On the other hand, this meaningful imputation is beneficial to refine the clustering results. These two learning processes negotiate with each other, leading to improved clustering performance. In contrast, ZF+MKKM, MF+MKKM, KNN+MKKM and AF+MKKM algorithms do not fully take advantage of the connection between the imputation and clustering procedures. This could produce imputation that does not well serve the subsequent clustering as originally expected, affecting the clustering performance. The aggregated ACC, NMI and purity, and the standard deviation are reported in Table 2, where the one with the highest performance is shown in bold. Again, we observe that the proposed algorithm significantly outperforms ZF+MKKM, MF+MKKM, KNN+MKKM and AF+MKKM algorithms, which is consistent with our observations in Figure 1.

Table 2:

Aggregated ACC, NMI and purity comparison (mean±std) of different clustering algorithms on four data sets.

Datasets	ZF+MKKM	MF+MKKM	KNNF+MKKM	AF+MKKM (Trivedi et al. 2010)	Proposed
ACC
Flower17	37.09 ± 0.42	36.93 ± 0.48	37.88 ± 0.62	42.46 ± 0.59	44.56 ± 0.61
Flower102	17.95 ± 0.15	17.92 ± 0.16	18.26 ± 0.14	19.09 ± 0.17	21.40 ± 0.18
Caltech102	23.10 ± 0.26	23.15 ± 0.24	23.87 ± 0.26	26.56 ± 0.22	28.22 ± 0.27
CCV	14.80 ± 0.16	15.03 ± 0.16	14.73 ± 0.19	16.51 ± 0.25	19.91 ± 0.32
NMI
Flower17	37.40 ± 0.35	37.38 ± 0.40	38.36 ± 0.46	41.85 ± 0.42	43.50 ± 0.42
Flower102	37.39 ± 0.08	37.39 ± 0.08	37.83 ± 0.09	38.32 ± 0.11	39.55 ± 0.10
Caltech102	44.90 ± 0.15	44.94 ± 0.14	45.67 ± 0.18	47.74 ± 0.14	49.10 ± 0.18
CCV	10.11 ± 0.13	10.23 ± 0.13	10.25 ± 0.16	11.76 ± 0.19	14.80 ± 0.20
Purity
Flower17	38.61 ± 0.40	38.49 ± 0.48	39.38 ± 0.56	43.96 ± 0.54	45.92 ± 0.53
Flower102	22.44 ± 0.12	22.43 ± 0.11	22.82 ± 0.14	23.63 ± 0.15	25.95 ± 0.14
Caltech102	24.62 ± 0.25	24.66 ± 0.26	25.44 ± 0.27	28.15 ± 0.22	29.87 ± 0.25
CCV	18.26 ± 0.15	18.48 ± 0.16	18.33 ± 0.20	19.83 ± 0.26	23.79 ± 0.28

Open in a new tab

Besides comparing the above-mentioned algorithms in terms of clustering performance, we would like to gain more insight on how close the imputed base kernels (as a byproduct of our algorithm) are to the ground-truth, i.e., the original, complete base kernels. To do this, we calculate the alignment between the ground-truth kernels and the imputed ones. The kernel alignment, a widely used criterion to measure the similarity of two kernel matrices, is used to serve this purpose (Cortes, Mohri, and Rostamizadeh 2012). We compare the alignment resulted from our algorithm with those from existing imputation algorithms. The results under various missing ratios are shown in Figure 2. As observed, the kernels imputed by our algorithm align with the ground-truth kernels much better than those obtained by the existing imputation algorithms. In particular, our algorithm wins the second best one (KNN+MKKM) by more than 22 percentage points on Caltech102 when the missing ratio is 0.9. The aggregated alignment and the standard deviation are reported in Table 3. We once again observe the significant superiority of our algorithm to the compared ones. These results indicate that our algorithm can not only achieve better clustering performance, but is also able to produce better imputation result by exploiting the prior knowledge of “serve clustering”.

Figure 2: — Kernel alignment between the original kernels and the imputed kernels by different algorithms under different missing ratios.

Table 3:

Aggregated alignment between the original kernels and the imputed kernels (mean±std) on four data sets.

Datasets	ZF+MKKM	MF+MKKM	KNNF+MKKM	AF+MKKM (Trivedi et al. 2010)	Proposed
Flower17	80.07 ± 0.08	80.05 ± 0.08	81.45 ± 0.06	86.50 ± 0.08	88.70 ± 0.12
Flower102	75.55 ± 0.05	75.55 ± 0.05	73.35 ± 0.04	76.71 ± 0.05	79.20 ± 0.06
Caltech102	74.40 ± 0.05	74.43 ± 0.05	83.32 ± 0.05	80.00 ± 0.05	95.99 ± 0.03
CCV	75.03 ± 0.07	76.60 ± 0.07	79.15 ± 0.06	77.10 ± 0.07	85.11 ± 0.24

Open in a new tab

From the above experiments, we conclude that the proposed algorithm: 1) effectively addresses the issue of row/columns absence in multiple kernel clustering; 2) consistently achieves performance superior to the comparable ones, especially in the presence of intensive absence; and 3) can better recover the incomplete base kernels by taking into account the goal of clustering. In short, our algorithm well utilizes the connection between imputation and clustering procedures, bringing forth significant improvements on clustering performance. In addition, our algorithm is theoretically guaranteed to converge to a local minimum according to (Bezdek and Hathaway 2003). In the above experiments, we observe that the objective value of our algorithm does monotonically decrease at each iteration and that it usually converges in less than 30 iterations. Two examples of the evolution of the objective value on Flower17 and Flower102 are demonstrated in Figure 3.

Figure 3: — Evolution of the objective value in our algorithm.

Conclusion

While MKC algorithms have recently demonstrated promising performance in various applications, they are not able to effectively handle the scenario where base kernels are incomplete. This paper proposes to jointly optimize the kernel imputation and clustering to address this issue. It makes these two learning procedures seamlessly integrated to achieve better clustering. The proposed algorithm effectively solves the resultant optimization problem, and it demonstrates well improved clustering performance via extensive experiments on benchmark data sets, especially when the missing ratio is high. In the future, we plan to further improve the clustering performance by considering the correlations of different base kernels (Bhadra, Kaski, and Rousu 2016).

Acknowledgements

This work was supported by the National Natural Science Foundation of China (project No. U1435219, 61403405, 61672528 and 61232016). The authors would like to thank Dr. Luping Zhou for discussions on the application of our algorithm to the Alzheimers disease prediction.

Footnotes

http://www.robots.ox.ac.uk/˜vgg/data/flowers/17/

http://www.robots.ox.ac.uk/˜vgg/data/flowers/102/

http://www.ee.columbia.edu/ln/dvmm/CCV/

⁴

http://files.is.tue.mpg.de/pgehler/projects/iccv09/

Contributor Information

Xinwang Liu, School of Computer, National University of Defense Technology Changsha, China, 410073.

Miaomiao Li, School of Computer, National University of Defense Technology Changsha, China, 410073.

Lei Wang, School of Computer Science and Software Engineering University of Wollongong NSW, Australia, 2522.

Yong Dou, School of Computer National University of Defense Technology Changsha, China, 410073.

Jianping Yin, School of Computer National University of Defense Technology Changsha, China, 410073.

En Zhu, School of Computer National University of Defense Technology Changsha, China, 410073.

References

Bezdek JC, and Hathaway RJ 2003. Convergence of alternating optimization. Neural, Parallel Sci. Comput 11(4):351–368. [Google Scholar]
Bhadra S; Kaski S; and Rousu J 2016. Multi-view kernel completion. In arXiv:1602.02518. [Google Scholar]
Cao X; Zhang C; Fu H; Liu S; and Zhang H 2015a. Diversity-induced multi-view subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, 586–594. [Google Scholar]
Cao X; Zhang C; Zhou C; Fu H; and Foroosh H 2015b. Constrained multi-view video face clustering. IEEE Trans. Image Processing 24(11):4381–4393. [DOI] [PubMed] [Google Scholar]
Cortes C; Mohri M; and Rostamizadeh A 2009. L2 regularization for learning kernels. In UAI, 109–116. [Google Scholar]
Cortes C; Mohri M; and Rostamizadeh A 2012. Algorithms for learning kernels based on centered alignment. JMLR 13:795–828. [Google Scholar]
Du L; Zhou P; Shi L; Wang H; Fan M; Wang W; and Shen Y-D 2015. Robust multiple kernel k-means clustering using ℓ₂₁-norm. In IJCAI, 3476–3482. [Google Scholar]
Ghahramani Z, and Jordan MI 1993. Supervised learning from incomplete data via an EM approach. In NIPS, 120–127. [Google Scholar]
Gönen M, and Margolin AA 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In NIPS, 1305–1313. [Google Scholar]
Jegelka S; Gretton A; Schölkopf B; Sriperumbudur BK; and von Luxburg U 2009. Generalized clustering via kernel embeddings. In KI 2009: Advances in Artificial Intelligence, 32nd Annual German Conference on AI, 144–152. [Google Scholar]
Kloft M; Brefeld U; Sonnenburg S; Laskov P; Müller K; and Zien A 2009. Efficient and accurate lp-norm multiple kernel learning. In NIPS, 997–1005. [Google Scholar]
Kloft M; Brefeld U; Sonnenburg S; and Zien A 2011. l_p-norm multiple kernel learning. JMLR 12:953–997. [Google Scholar]
Kumar R; Chen T; Hardt M; Beymer D; Brannon K; and Syeda-Mahmood TF 2013. Multiple kernel completion and its application to cardiac disease discrimination. In ISBI, 764–767. [Google Scholar]
Li M; Liu X; Wang L; Dou Y; Yin J; and Zhu E 2016. Multiple kernel clustering with local kernel alignment maximization. In IJCAI, 1704–1710. [Google Scholar]
Liu X; Wang L; Yin J; Zhu E; and Zhang J 2013. An efficient approach to integrating radius information into multiple kernel learning. IEEE Trans. Cybernetics 43(2):557–569. [DOI] [PubMed] [Google Scholar]
Liu X; Wang L; Zhang J; and Yin J 2014. Sample-adaptive multiple kernel learning. In AAAI, 1975–1981. [Google Scholar]
Liu X; Wang L; Yin J; Dou Y; and Zhang J 2015. Absent multiple kernel learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25–30, 2015, Austin, Texas, USA., 2807–2813. [Google Scholar]
Liu X; Dou Y; Yin J; Wang L; and Zhu E 2016. Multiple kernel k-means clustering with matrix-induced regularization. In AAAI, 1888–1894. [Google Scholar]
Rakotomamonjy A; Bach FR; Canu S; and Grandvalet Y 2008. Simplemkl. JMLR 9:2491–2521. [Google Scholar]
Shao W; He L; and Yu PS 2015. Multiple incomplete views clustering via weighted nonnegative matrix factorization with ℓ_2,1 regularization. In ECML PKDD, 318–334. [Google Scholar]
Trivedi A; Rai P; Daumé H III; and DuVall SL 2010. Multiview clustering with incomplete views. In NIPS 2010: Machine Learning for Social Computing Workshop, Whistler, Canada. [Google Scholar]
Xiang S; Yuan L; Fan W; Wang Y; Thompson PM; and Ye J 2013. Multi-source learning with block-wise missing data for alzheimer’s disease prediction. In ACM SIGKDD, 185–193. [Google Scholar]
Xu C; Tao D; and Xu C 2015. Multi-view learning with incomplete views. IEEE Trans. Image Processing 24(12):5812–5825. [DOI] [PubMed] [Google Scholar]
Yu S; Tranchevent L-C; Liu X; Glänzel W; Suykens JAK; Moor BD; and Moreau Y 2012. Optimized data fusion for kernel k-means clustering. IEEE TPAMI 34(5):1031–1039. [DOI] [PubMed] [Google Scholar]
Zhang C; Fu H; Liu S; Liu G; and Cao X 2015. Low-rank tensor constrained multiview subspace clustering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015, 1582–1590. [Google Scholar]
Zhang C; Fu H; Hu Q; Zhu P; and Cao X 2016. Flexible multi-view dimensionality co-reduction. IEEE Trans. Image Processing. [DOI] [PubMed] [Google Scholar]
Zhao B; Kwok JT; and Zhang C 2009. Multiple kernel clustering. In SDM, 638–649. [Google Scholar]
Zhao H; Liu H; and Fu Y 2016. Incomplete multimodal visual data grouping. In IJCAI, 2392–2398. [Google Scholar]

[R1] Bezdek JC, and Hathaway RJ 2003. Convergence of alternating optimization. Neural, Parallel Sci. Comput 11(4):351–368. [Google Scholar]

[R2] Bhadra S; Kaski S; and Rousu J 2016. Multi-view kernel completion. In arXiv:1602.02518. [Google Scholar]

[R3] Cao X; Zhang C; Fu H; Liu S; and Zhang H 2015a. Diversity-induced multi-view subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, 586–594. [Google Scholar]

[R4] Cao X; Zhang C; Zhou C; Fu H; and Foroosh H 2015b. Constrained multi-view video face clustering. IEEE Trans. Image Processing 24(11):4381–4393. [DOI] [PubMed] [Google Scholar]

[R5] Cortes C; Mohri M; and Rostamizadeh A 2009. L2 regularization for learning kernels. In UAI, 109–116. [Google Scholar]

[R6] Cortes C; Mohri M; and Rostamizadeh A 2012. Algorithms for learning kernels based on centered alignment. JMLR 13:795–828. [Google Scholar]

[R7] Du L; Zhou P; Shi L; Wang H; Fan M; Wang W; and Shen Y-D 2015. Robust multiple kernel k-means clustering using ℓ₂₁-norm. In IJCAI, 3476–3482. [Google Scholar]

[R8] Ghahramani Z, and Jordan MI 1993. Supervised learning from incomplete data via an EM approach. In NIPS, 120–127. [Google Scholar]

[R9] Gönen M, and Margolin AA 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In NIPS, 1305–1313. [Google Scholar]

[R10] Jegelka S; Gretton A; Schölkopf B; Sriperumbudur BK; and von Luxburg U 2009. Generalized clustering via kernel embeddings. In KI 2009: Advances in Artificial Intelligence, 32nd Annual German Conference on AI, 144–152. [Google Scholar]

[R11] Kloft M; Brefeld U; Sonnenburg S; Laskov P; Müller K; and Zien A 2009. Efficient and accurate lp-norm multiple kernel learning. In NIPS, 997–1005. [Google Scholar]

[R12] Kloft M; Brefeld U; Sonnenburg S; and Zien A 2011. l_p-norm multiple kernel learning. JMLR 12:953–997. [Google Scholar]

[R13] Kumar R; Chen T; Hardt M; Beymer D; Brannon K; and Syeda-Mahmood TF 2013. Multiple kernel completion and its application to cardiac disease discrimination. In ISBI, 764–767. [Google Scholar]

[R14] Li M; Liu X; Wang L; Dou Y; Yin J; and Zhu E 2016. Multiple kernel clustering with local kernel alignment maximization. In IJCAI, 1704–1710. [Google Scholar]

[R15] Liu X; Wang L; Yin J; Zhu E; and Zhang J 2013. An efficient approach to integrating radius information into multiple kernel learning. IEEE Trans. Cybernetics 43(2):557–569. [DOI] [PubMed] [Google Scholar]

[R16] Liu X; Wang L; Zhang J; and Yin J 2014. Sample-adaptive multiple kernel learning. In AAAI, 1975–1981. [Google Scholar]

[R17] Liu X; Wang L; Yin J; Dou Y; and Zhang J 2015. Absent multiple kernel learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25–30, 2015, Austin, Texas, USA., 2807–2813. [Google Scholar]

[R18] Liu X; Dou Y; Yin J; Wang L; and Zhu E 2016. Multiple kernel k-means clustering with matrix-induced regularization. In AAAI, 1888–1894. [Google Scholar]

[R19] Rakotomamonjy A; Bach FR; Canu S; and Grandvalet Y 2008. Simplemkl. JMLR 9:2491–2521. [Google Scholar]

[R20] Shao W; He L; and Yu PS 2015. Multiple incomplete views clustering via weighted nonnegative matrix factorization with ℓ_2,1 regularization. In ECML PKDD, 318–334. [Google Scholar]

[R21] Trivedi A; Rai P; Daumé H III; and DuVall SL 2010. Multiview clustering with incomplete views. In NIPS 2010: Machine Learning for Social Computing Workshop, Whistler, Canada. [Google Scholar]

[R22] Xiang S; Yuan L; Fan W; Wang Y; Thompson PM; and Ye J 2013. Multi-source learning with block-wise missing data for alzheimer’s disease prediction. In ACM SIGKDD, 185–193. [Google Scholar]

[R23] Xu C; Tao D; and Xu C 2015. Multi-view learning with incomplete views. IEEE Trans. Image Processing 24(12):5812–5825. [DOI] [PubMed] [Google Scholar]

[R24] Yu S; Tranchevent L-C; Liu X; Glänzel W; Suykens JAK; Moor BD; and Moreau Y 2012. Optimized data fusion for kernel k-means clustering. IEEE TPAMI 34(5):1031–1039. [DOI] [PubMed] [Google Scholar]

[R25] Zhang C; Fu H; Liu S; Liu G; and Cao X 2015. Low-rank tensor constrained multiview subspace clustering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015, 1582–1590. [Google Scholar]

[R26] Zhang C; Fu H; Hu Q; Zhu P; and Cao X 2016. Flexible multi-view dimensionality co-reduction. IEEE Trans. Image Processing. [DOI] [PubMed] [Google Scholar]

[R27] Zhao B; Kwok JT; and Zhang C 2009. Multiple kernel clustering. In SDM, 638–649. [Google Scholar]

[R28] Zhao H; Liu H; and Fu Y 2016. Incomplete multimodal visual data grouping. In IJCAI, 2392–2398. [Google Scholar]

PERMALINK

Multiple Kernel k-Means with Incomplete Kernels

Xinwang Liu

Miaomiao Li

Lei Wang

Yong Dou

Jianping Yin

En Zhu

Abstract

Introduction