Exploring Autoencoder-based Representations for Tabular Data Classification

Il’murat Tokhtakhunov1,2, Email

Marat Nurtas1,3, Email

Alexander Neftissov4,5, Email

Sharofiddin Pirnaev6, Email

Ilyas Kazambayev4,5

Lalita Kirichenko4,5

1Department of Mathematical and Computer Modelling, International Information Technology University, 34/1 Manas street, Almaty, 05000, Kazakhstan
2School of Digital Technologies, Narxoz University, 55 Zhandosov street, Almaty, 050035, Kazakhstan
3Faculty of Information technology, Al-Farabi Kazakh National University, 71 Al-Farabi Avenue, Almaty, 050040, Kazakhstan
4Science Innovation Center Industry 4.0, Astana IT University, Mangilik El C1, Astana, 010000, Kazakhstan
5Academy of Physical Education and Mass Sports, Mangilik El B2.2, Astana, 010000, Kazakhstan
6Department of Engineering Technological Machines, Tashkent State Transport University, 1 Temiryolchilar street, Mirabad district, Tashkent, 100167, Uzbekistan

 

Abstract

Autoencoders are evaluated as a means of constructing compact and informative vector representations for classification tasks involving high-dimensional tabular data. The methodology addresses the limitations of traditional models that rely on manual feature engineering and task-specific training. Emphasis is placed on building a generalized look-alike model for targeted advertising, using embeddings derived from subscriber-related entities. The approach is assessed on a real-world telecommunications dataset comprising subscriber demographics, devices, tariffs, and network characteristics. Experimental results demonstrate that embeddings produced by autoencoders outperform classical dimensionality reduction methods such as Principal Component Analysis (PCA), both in predictive quality and computational efficiency. Compressed representations enable the identification of nonlinear patterns and semantic similarities, improving classification accuracy across multiple metrics. The study further introduces an integrated vector architecture by concatenating embeddings from heterogeneous entities. Cosine similarity is employed as a metric for identifying similar users, enabling the development of a scalable and automated recommendation service for Business-to-Business (B2B) applications. Performance is benchmarked using traditional quality metrics (precision, recall, Harmonic Mean of Precision and Recall (F1-score), Receiver Operating Characteristic – Area Under the Curve (ROC AUC)) as well as business-specific indicators such as conversion rate and lift. The findings support the applicability of autoencoders in modeling complex tabular structures with minimal information loss. Prospects include the development of domain-specific autoencoder ensembles and the exploration of alternative vector similarity metrics for broader industrial adoption. The suggested solution can be applied for water resource monitoring system as improvement for classification and further prediction.