Applying Author Profiling On Reddit Comments At The Document-Level

  • Idriss Oulahbib L2IS Laboratory, Faculty of Science and Techniques, Cadi Ayyad University, Marrakesh, Morocco
  • Meriem Benhaddi L2IS Laboratory, Faculty of Science and Techniques, Cadi Ayyad University, Marrakesh, Morocco
  • Salah El hadaj L2IS Laboratory, Faculty of Science and Techniques, Cadi Ayyad University, Marrakesh, Morocco
Keywords: Author Profiling, Document-Level, Machine Learning, Deep Learning, Reddit comments

Abstract

Author Profiling (AP) encompasses the task of discerning an author’s biological, psychological, and socio- cultural attributes, including but not limited to gender, age, religion, profession, and personality, from their written content. This task is commonly approached as a form of text classification, where models are trained using features extracted from the author’s text to predict labels such as gender and age category. This study investigates the effectiveness of Machine Learning (ML), Deep Learning (DL), and Transformer-based models for age and gender classification at the document level on a large dataset of Reddit comments annotated using Regular Expressions (REGEX). We employed various algorithms, including Naive Bayes (NB), Random Forest (RF), Logistic Regression (LR), Multi Layer Perceptrons (MLP), Convolutional Neural Networks 1 Dimension (CNN1D), and Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). For feature extraction, we utilized Bag Of Words (BOW), Term-Frequency Inverse Document Frequency (TF-IDF), dictionary scores from Linguistic Inquiry Word Count (LIWC), averaged FastText embeddings (both pre-trained and trained on Reddit), and concatenated Subreddit embeddings to enhance contextual representation. Our experimental results revealed that traditional ML models with TF-IDF features, particularly LR, achieved competitive performance compared to deeper architectures. The best accuracy for gender classification was obtained by the DistilBERT + Subreddit embeddings model with 0.65 at the document level and 0.80 at the author level using majority voting. For age classification, the highest accuracy reached 0.37 with the same model configuration, outperforming all baseline approaches. These findings demonstrate that Transformer-based models enriched with contextual features offer a significant improvement over ML and traditional DL models in document-level AP.
Published
2025-11-24
How to Cite
Oulahbib, I., Benhaddi, M., & El hadaj, S. (2025). Applying Author Profiling On Reddit Comments At The Document-Level. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2762
Section
ICCSAI'24