Applying Author Profiling On Reddit Comments At The Document-Level

Idriss Oulahbib; Meriem Benhaddi; Salah El hadaj

doi:10.19139/soic-2310-5070-2762

Applying Author Profiling On Reddit Comments At The Document-Level

Idriss Oulahbib L2IS Laboratory, Faculty of Science and Techniques, Cadi Ayyad University, Marrakesh, Morocco
Meriem Benhaddi L2IS Laboratory, Faculty of Science and Techniques, Cadi Ayyad University, Marrakesh, Morocco
Salah El hadaj L2IS Laboratory, Faculty of Science and Techniques, Cadi Ayyad University, Marrakesh, Morocco

DOI: https://doi.org/10.19139/soic-2310-5070-2762

Keywords: Author Profiling, Document-Level, Machine Learning, Deep Learning, Reddit comments

Abstract

Author Profiling (AP) encompasses the task of discerning an author’s biological, psychological, and socio- cultural attributes, including but not limited to gender, age, religion, profession, and personality, from their written content. This task is commonly approached as a form of text classification, where models are trained using features extracted from the author’s text to predict labels such as gender and age category. This study investigates the effectiveness of Machine Learning (ML), Deep Learning (DL), and Transformer-based models for age and gender classification at the document level on a large dataset of Reddit comments annotated using Regular Expressions (REGEX). We employed various algorithms, including Naive Bayes (NB), Random Forest (RF), Logistic Regression (LR), Multi Layer Perceptrons (MLP), Convolutional Neural Networks 1 Dimension (CNN1D), and Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). For feature extraction, we utilized Bag Of Words (BOW), Term-Frequency Inverse Document Frequency (TF-IDF), dictionary scores from Linguistic Inquiry Word Count (LIWC), averaged FastText embeddings (both pre-trained and trained on Reddit), and concatenated Subreddit embeddings to enhance contextual representation. Our experimental results revealed that traditional ML models with TF-IDF features, particularly LR, achieved competitive performance compared to deeper architectures. The best accuracy for gender classification was obtained by the DistilBERT + Subreddit embeddings model with 0.65 at the document level and 0.80 at the author level using majority voting. For age classification, the highest accuracy reached 0.37 with the same model configuration, outperforming all baseline approaches. These findings demonstrate that Transformer-based models enriched with contextual features offer a significant improvement over ML and traditional DL models in document-level AP.

Published

2025-11-24

How to Cite

Oulahbib, I., Benhaddi, M., & El hadaj, S. (2025). Applying Author Profiling On Reddit Comments At The Document-Level. Statistics, Optimization & Information Computing, 15(2), 1343-1356. https://doi.org/10.19139/soic-2310-5070-2762

Download Citation

Issue

Vol 15 No 2 (2026): ICCSAI'24

Section

Research Articles

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).