Depression is a widespread mental health disorder that affects individuals globally; it can be detected using traditional methods as well as advanced technological approaches. The study works on various modalities and methods, like textual, audio, and visual modalities, as well as advanced techniques. This study includes research methodology for depression detection using a multimodal dataset containing textual, audio, and visual modalities. The study also consists of clustering and classifying audio and visual modality features. The proposed system incorporates the Ensemble model, which consists of logistic regression, support vector classifier, random forest, and gradient boosting (LSRG). The same clustering and classification process applies to six audio and two visual features. The study determined the final depression level by averaging the results of all modalities using the Late Fusion method. The multimodal system predicts depression levels using heterogeneous data sources. The Mamdani fuzzy gives 93.10% accuracy on the textual modality for text data of Extended Distress Analysis Interview Corpus (E-DAIC), the ensemble model gives 98.21% accuracy on labeled E-DAIC dataset, and late fusion gives 99.54% accuracy for E-DAIC's PHQ-8 dataset. The study aims to integrate multiple models using better-performing techniques. Also, the study offers important suggestions to patients for using a multimodal depression detection system.