Using Transfer Adaptation Method for Dynamic Features Expansion in Multi-label Deep Neural Network for Recommender Systems

In this paper, we propose to use a convertible deep neural network (DNN) model with a transfer adaptation mechanism to deal with varying input and output numbers of neurons. The flexible DNN model serves as a multi-label classifier for the recommender system as part of the retrieval systems’ push mechanism, which learns the combination of tabular features and proposes the number of discrete offers (targets). Our retrieval system uses the transfer adaptation , mechanism, when the number of features changes, it replaces the input layer of the neural network then freezes all gradients on the following layers, trains only replaced layer, and unfreezes the entire model. The experiments show that using the transfer adaptation technique impacts stable loss decreasing and learning speed during the training process. Furthermore, our proposed model demonstrates notable advantages in production scenarios. Specifically, it exhibits enhanced efficiency, manifesting in accelerated processing times and improved resource utilization, thereby contributing to a more sustainable and cost-effective training of machine learning solutions in real-world applications.

An automation process for discovering the proper recommendations, play crucial role in e-commerce industry [1] [2] .Recommender systems are used for push phase [3] in most retrieval engines [4] [5].There are multiple strategies to build proposals mechanism [6], commonly two most accepted methods are: the content based filtering [7] and collaborative filtering [8].Generally, in both methods, there are two disjoin sets of discreet nodes, called profiles and items.The profiles are active agents with historical data that represents an interaction with items.For example, profile P i buys item I j , where interactions between profiles and items can carry different meaning.The purpose of the recommender system is to forecasting items for profiles by considering their historical data.
In content-based filtering, recommendations determined by the content of profiles data that already made some actions, later, when the new profile is registered, the classifier starts to measure distances [9] between registered and the rest of profiles' features.Based on these distances [10], the system proposes same or similar offers.
For collaborative filtering, the central criteria to propose new offers, done by analyzing the interaction between profiles and items.This mapping with historical data gives non-linear data representation where the various AI methods such as graph neural networks [11] can be used.Additionally, the structure can be enriched with temporal information, what gives the new dimension to evaluate recommender decisions.There are also hybrid approaches that take into account profiles content as well [12] [13] [14] [15] as the profiles-items interaction most of them based on DNN [16] [17], to approximate matrix factorization via embedding layers.
In our system, we build recommender system as multi-label deep learning network for handling tabular data [16] where on each set of tabular features F , there are we faced with such architecture.described in "problem statement" section.Next, in the "method" section, we presented multiphase algorithm so solve dynamic input expansion in multi-label classifier.Afterward, the results in experiment section, confirms our assumptions regarding the transfer-adaptation strategy.In conclusion section, we describe some possible enlargement for proposed method.

Problem statement.
Suppose there are DNN based model M in production, that accepts F features, At some point there the number of features changed from n to k where k > n.The system must adapt model M to be able to accept extended features number k during the runtime.
There are multiple ways to handle features expansion [18] for neural networks model in production.The most common approach, based on manually adjusting [19] neural networks input sizes, following training phase, starting from the initial state.But there are several cons with such organizing: 1. Factually, this is the replacement of one model with another, where the set of learnable parameters W are lost and must be rediscovered from the ground up, 2. It causes to more time and GPUs energy consumption for retraining process.
One way to handle dynamic attributes rising, is to enhance fully connected layers with convolutional predecessor network [20] [21].This approach involves having multiple filters to conduct the output with a certain size.It works good for large datasets, but still has obstacles on the input part where convolutional input layer, requires to be initialized with known number of channels.
Dealing with re-adjustable feature numbers in deep neural network causes to rethink classical neural network model view, with dynamically changeable architecture on the input and output layers, that we describe in following section.

Method.
Our approach provides a non-expensive way to adapt DNN for the features expansion.These operations are provided by the part of our retrieval system that responsible for recommenders push service [22].The method contains three-phase steps to handle feature extension in production shown below.

Reproduction phase.
To provide the reliable online service for multiple clients during architectural update, the model M is replicated M ′ .Afterwards all clients' requests are redirected to the new copied model M ′ .This, proxy model still handles n previous number of features.In production such approach called blue-green strategy [23].

Adaptation phase.
An adaptation phase includes core concept of the transfer adaptation (transfer learning [24] like) approach.Retrieval system updates model's input/output layer, depending on the features/offers expansion, and trains only the updated layer.The algorithms of transfer-adaptation shown in Figure1., where L input 2 -is second layer's input size.After replacing and initializing [25] an input layer, the rest of DNN's layers freeze (autograd [26] [27] parameter are set to False).The process starts to train (adapt) models' input layer for an e number of epochs (in our experiments [5..15]) to fit k features number.Then, all models' layers are unfrozen and autograds are activated.The algorithm.The similar operation is done for output layer multi-label's classifier.The chosen number of epochs is not necessary to be as high as in full training.Because there is only one layer that must be retrained, the adaptation process is fast.After an adaptation is done, then whole model can be continue to be train during relatively low number of epochs.In Experiment part we show the difference between train the whole model with skipping an adaptation part versus including the adaptation.

Update phase.
After the adaptation, a model sets back to production.The figure 2. shows how retrieval system handles features extension to provide push service during the runtime.M (F n ) -is the model that handles n number of the features n, that comes from the preprocessor object P (S) that receives structural/unistructural features from the different sources S ∈ {s 1 , s 2 , . . .s l } including other models.Preprocess object P , provides structural data for model that is in production (shown by the arrow), on the last stage is sets back by the retrieval system from the copied model M ′ (F n ) to adapted model M (F k ).Then, model M ′ is destroyed and garbage collected.To proof, that our assumptions work, we decided to provide experiments with various structural datasets that have multiple discreet target classes simultaneously.These multi-label classes represent offers while, features defined as aggregated data that belongs to profiles.

Model.
We took a simplest model architecture that consist of fully connected layer [28] stack, with ReLU [29] nonlinear activation functions and list of the output fully-connected layers with sigmoid activation function.We use sigmoid activation function on the last layer instead of softmax [30], because of the multi-label classification problem where each offer's output is independent from the other.For optimization strategy we used Adam [31] optimizer.

Experiment 1.
For the experience we took the yeast [32] [33] and foodtruck [34] structural dataset, with 14 binary labels, and 103 continues features.To check how features adaptation works, we created four datasets: two for each train/validation before and after features expansion phases.We dropped 40 last continues features to simulate the state before features expansion and hold the all the features for the second phase, so we had two datasets with 103 and 63 features (each with train/validation sets respectively).
For training we created 2 model copies: On the first model we not applied transfer adaptation, on another we applied Figure 3.The training process took 3 phases that shows on the Figure 3: on first phase there was training with 63 features X 63 , next another dataset has been applied with 103 features X 103 for 20 more epochs.In first model update the first neural network's layer, but we didn't apply the transfer adaptation technique.In the second experiment all the steps were the same except the transfer adaptation that has been applied after updating the first neural networks layer.For train and adaptation phase we use the same learning rate Lr = 0.0004.The adaptation phase took 15 epochs for only one layer which was fast and inexpensive to compute.

Conclusion
Our paper shows the advantage of partial fine-tuning process after updating neural networks architecture.Just by adapting one layer, the loss reduces much smoother and for the same amount of epochs the results contrast is significant.This approach can also be used for another parts of retrieval systems for example in the ranking model, where features number are also increases during the runtime.

Figure 2 .
Figure 2. Features extension phases, that's done by the retrieval system' process.

Table 1 .
The contrast of learning process with and without applied transfer adaptation.