MTECH PROJECTS
Online sketching of big categorical data with absent features With the scale of data growing every day, reducing the dimensionality (a.k.a. sketching) of high-dimensional vectors has emerged as a task of increasing importance. Relevant issues to address in this context include the sheer volume of data vectors that may consist of categorical (meaning finite-alphabet) features, the typically streaming format of data acquisition, and the possibly absent features. To cope with these challenges, the present paper brings forth a novel rank-regularized maximum likelihood approach that models categorical data as quantized values of analog-amplitude features with low intrinsic dimensionality. This model along with recent online rank regularization advances are leveraged to sketch high-dimensional categorical data `on the fly.’ Simulated tests with synthetic as well as real-world datasets corroborate the merits of the novel scheme relative to state-of-the-art alternatives.