Access the full text.
Sign up today, get DeepDyve free for 14 days.
A Text-based Deep Reinforcement Learning Framework Using Self-supervised Graph Representation for Interactive Recommendation CHAOYANG WANG, ZHIQIANG GUO, JIANJUN LI, GUOHUI LI, and PENG PAN, Huazhong University of Science and Technology Due to its nature of learning from dynamic interactions and planning for long-run performance, Reinforce- ment Learning (RL) has attracted much attention in Interactive Recommender Systems (IRSs). However, most of the existing RL-based IRSs usually face large discrete action space problem, which severely limits their ef- ficiency. Moreover, data sparsity is another problem that most IRSs are confronted with. The utilization of recommendation-related textual knowledge can tackle this problem to some extent, but existing RL-based recommendation methods either neglect to combine textual information or are not suitable for incorporating it. To address these two problems, in this article, we propose a Text-based deep Reinforcement learning frame- work using self-supervised Graph representation for Interactive Recommendation (TRGIR). Specifically, we leverage textual information to map items and users into a same feature space by a self-supervised embedding method based on the graph convolutional network, which greatly alleviates data sparsity problem. Moreover, we design an effective method to construct an action candidate set, which reduces the scale of the action space directly. Two types of representative reinforcement learning algorithms have been applied to implement TRGIR. Since the action space of IRS is discrete, it is natural to implement TRGIR with Deep Q-learning Network (DQN). In the TRGIR implementation with Deep Deterministic Policy Gradient (DDPG), denoted as TRGIR-DDPG, we design a policy vector, which can represent user’s preferences, to generate discrete ac- tions from the candidate set. Through extensive experiments on three public datasets, we demonstrate that TRGIR-DDPG achieves state-of-the-art performance over several baselines in a time-efficient manner. CCS Concepts: • Information systems→ Recommender systems;; Additional Key Words and Phrases: Recommender system, representation learning, graph convolutional net- work, Reinforcement Learning, textual information ACM Reference format: Chaoyang Wang, Zhiqiang Guo, Jianjun Li, Guohui Li, and Peng Pan. 2022. A Text-based Deep Reinforcement Learning Framework Using Self-supervised Graph Representation for Interactive Recommendation. ACM/IMS Trans. Data Sci. 2, 4, Article 44 (May 2022), 25 pages. https://doi.org/10.1145/3522596 A preliminary version of this article appeared in the 24th European Conference on Artificial Intelligence (ECAI). This work was partially supported by the National Natural Science Foundation of China under Grant No. 61672252 and the Fundamental Research Funds for the Central Universities under Grant No. 2019kfyXKJC021. Authors’ address: C. Wang, Z. Guo, J. Li (corresponding author), G. Li, and P. Pan, Huazhong University of Science and Technology, Wuhan, China, 430074; emails: {sunwardtree, zhiqiangguo, jianjunli, guohuili, panpeng}@hust.edu.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 2577-3224/2022/05-ART44 $15.00 https://doi.org/10.1145/3522596 ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:2 C. Wang et al. 1 INTRODUCTION In the era of information explosion, recommender systems play a critical role in resisting the infor- mation overload. Recently, Interactive Recommender System (IRS) [49], which continuously recommends items to individual users and receives their feedbacks to refine its recommendation policy, has attracted much attention and plays an important role in personalized services, such as Amazon, Pandora, and YouTube. In the past few years, there have been some attempts to address the interactive recommenda- tion problem by modeling the recommendation process as a Multi-Armed Bandit (MAB) prob- lem [22, 35, 49], but these methods are not designed for long-term planning explicitly, which makes their performance unsatisfactory [5]. It is well recognized that Reinforcement Learning (RL) performs excellently in finding policies on interactive long-running tasks, such as playing com- puter games [25] and solving simulated physics problems [23]. Therefore, it is natural to introduce RL to model the interactive recommendation process. In fact, recently there have been some works on applying RL to address the interactive recommendation problem [5, 6, 11, 16, 32, 44–47, 50]. However, most of the existing RL-based methods [6, 16, 32, 44–47, 50] suffer from the problem of making a decision in linear time complexity with respect to the size of the action space, i.e., the number of available items, which makes them inefficient (or unscalable) when the IRS action space size is large. To improve efficiency, based on Deep Deterministic Policy Gradient (DDPG), Dulac-Arnold et al. [11] proposed DDPG-kNN, which first learns an action representation (vector) in a continuous hidden space, and then finds the valid item by using k nearest neighbor search. However, because DDPG is not designed for discrete IRS action space, and DDPG-kNN ignores the importance of each dimension in the action vector, the effectiveness of such a method is limited. Moreover, this method still needs to find the k nearest-neighbors from the whole action space, which is still time- consuming. Recently, Chen et al. [5] proposed a tree-structured policy gradient recommendation framework, within which a balanced hierarchical clustering tree is built over the items. Then, picking an item is formulated as seeking a path from the root to a certain leaf in the tree, which dramatically reduces the time complexity. But this method introduces the burden of building a clustering tree; especially when new items appear frequently, the tree needs to be reconstructed and this may cost a lot. However, most of the existing RL-based recommendation methods use the past interaction data, such as ratings, purchase logs, or viewing history, to model user preferences and item fea- tures [5, 11, 48]. A major limitation of such kind of methods is that they may suffer serious perfor- mance degradation when facing the data sparsity problem, which is very common in real-world recommendation systems. As well known, textual information such as comments by users and item descriptions provided by suppliers contains more knowledge than interaction data. Nowa- days, textual information has been readily available in many e-commerce and review websites, such as Amazon and Yelp. Thanks to the invention of word embedding, applying textual informa- tion for recommendation is possible, and there have been some successful attempts in conventional recommender systems [3, 8, 51]. But for IRS, existing RL-based methods either neglect to leverage textual information or are not suitable for incorporating textual information due to their unique structures for processing rating sequence. To address the aforementioned problems, in this article, we propose a Text-based deep Rein- forcement learning framework using self-supervised Graph representation for Interac- tive Recommendation (TRGIR). Specifically, to leverage textual information, we first embed descriptions and comments with pre-trained word vectors [27]. Then, we build a relation graph that consists of four types of nodes (user, item, description, and comment) and use description and ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:3 comment vectors to initialize the embeddings of the corresponding nodes. By a self-supervised em- bedding method that is based on the graph convolutional network (GCN) [19], we can learn the embedding vectors of users and items with semantics, which alleviates the data sparsity prob- lem to a great extent. Next, based on the user vectors, we classify users into several clusters by the K-means algorithm [1]. Inspired by the thought of collaborative filtering, we construct an action candidate set, which consists of positive, negative, and ordinary items that are selected based on the user’s historical logs and classification results, to reduce the scale of the action space directly. Considering that the action in IRS is discrete and the widely used Deep Q-learning Network (DQN) [25] is designed for dealing with discrete action space problems, we first present an imple- mentation of TRGIR based on DQN, denoted as TRGIR-DQN. However, when facing large action space, DQN’s efficiency and exploration ability to decide the proper action will degrade signifi- cantly. To address this problem, we further propose to utilize DDPG [23]asthe RL modeland denote this implementation as TRGIR-DDPG. Specifically, we design a policy vector to generate discrete actions from the candidate set. The policy vector, which can represent the user’s pref- erence in a feature space, is dynamically learned from the actor network of TRGIR-DDPG. By combining the candidate set with the policy vector, we can enhance the exploration ability and further improve the efficiency. Finally, considering that it is too expensive to train and test our model in an online manner, we build an environment simulator to mimic online environments with principles derived from real- world data. Through extensive experiments on several real-world datasets with different settings, we demonstrate that TRGIR-DDPG achieves high efficiency and remarkable performance improve- ment over several state-of-the-art baselines, especially for large-scale high-sparsity datasets. To sum up, our main contributions of this work are as follows: • To reduce the negative influence of rating sparsity in IRSs, we build a relation graph, which is initialized with the description and comment embeddings calculated by textual information and pre-trained word vectors. Through learning the embeddings of users and items by a GCN-based self-supervised embedding method on this graph, we can derive user and item vectors with semantics efficiently. • Based on the thought of collaborative filtering, we classify users into several clusters and build the candidate set, which reduces the scale of the action space directly. Further, in TRGIR-DDPG, we represent the preferences of users by implicit policy vectors and propose a method based on DDPG to learn the policy vectors dynamically. The policy vector, com- bining with the candidate set, is used to generate discrete actions, which can enhance the exploration ability and improve the efficiency simultaneously. • Extensive experiments are conducted on three benchmark datasets; the results verify the high efficiency and superior performance of TRGIR-DDPG over state-of-the-art methods for IRSs. The remainder of this article is organized as follows: Section 2 discusses related work; Sec- tion 3 formally defines the research problem and details the proposed TRGIR framework, as well as the corresponding learning algorithms; Section 4 presents and analyzes the experimental results; Section 5 concludes the article with some remarks. 2 RELATED WORK 2.1 RL-based Recommendation Methods RL-based recommendation methods usually formulate the recommendation procedure as a Markov Decision Process (MDP). They explicitly model the dynamic user’s status and plan for ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:4 C. Wang et al. long-run performance [5, 6, 11, 16, 32, 36, 44–47, 50]. As mentioned earlier, most existing RL-based methods [6, 16, 32, 36, 44–47, 50] suffer from the large-scale discrete action space problem. To address such a problem in IRS, there are some impressive attempts. Dulac-Arnold et al. [11] proposed DDPG-kNN, which first leverages prior information about the actions to embed them in a continuous space to generate a proto-action. Then, via a k nearest neighbor search, this method finds a set of discrete actions closest to the proto-action as the candidate in logarithmic time, which can improve the efficiency dramatically. However, there are two flaws: (1) the DDPG is not designed for discrete IRS action space; (2) this method ignores the negative influences of the dimensions that users do not care about, which affects the performance of DDPG- kNN. Moreover, the k nearest-neighbor search needs to be conducted on the whole action space, which still surfers a high runtime overhead. Later, Zhao et al. [48] used the actor network of the actor-critic net- work to gain k weight vectors at once, each of which can pick up a maximum-score item from the remaining items. But the relation of these vectors is blurry, which causes the order of the k items cannot be explained. Most recently, based on Deterministic Policy Gradient (DPG),Chen et al. [5]proposeda Tree-structured Policy Gradient Recommendation (TPGR) framework. In TPGR, a balanced hierarchical clustering tree is built over all the items. Then, making a decision can be formulated as seeking a path from the root to a certain leaf in the clustering tree, which also reduces the time complexity significantly. But limited by the search method that can only search one leaf node at a time, this method only supports Top-1 recommendation. Moreover, when new items appear frequently, the clustering tree needs to be reconstructed, which incurs extra costs. 2.2 Text-related Recommendation Methods Most of the recommendation models (including RL-based ones) that merely exploit interaction ma- trix usually face the data sparsity problem, which can potentially be alleviated by exploiting the large amount of knowledge in the textual information [51]. The development of deep learning in natural language processing (NLP) makes it possible for using textual information to enhance the recommendation performance [3, 7, 8, 51]. In fact, there are already some works that incor- porate their proposed models with the vectors gained by sentiment analysis [3], convolutional neural networks [9, 51], or pre-trained word vectors on large corpora [8] from textual information (such as descriptions and comments) for better performance. Recently, some researchers try to combine Knowledge Graph (KG, a kind of relation graph) embedding models with the textual information of entities. Richard et al. [31] introduced an expressive neural tensor network suitable for reasoning over relations between two entities and found that the performance was improved when entities are represented as an average of their constituting word vectors. Xie et al. [41]pro- posed an RL method for KGs by taking advantage of entity descriptions, which are learned by a continuous bag-of-words model and a deep convolutional neural model. Xiao et al. [40]proposed the semantic space projection model, which jointly learns from the symbolic triples and textual descriptions and showed the effectiveness of it. IRS also suffers from the rating sparsity problem, but most of the existing RL-based meth- ods for IRS either neglect to incorporate with textual information [11, 16, 46, 47]orhavedif- ficulty in utilizing textual information, since they adopt time-related structures to input rating sequence [5, 50]. Recently, through combining images with textual information, Zhang et al. [44] proposed a novel constraint-augmented RL framework to efficiently incorporate user preferences over time. Specifically, they leveraged a discriminator to detect recommendations violating user historical preference, which is incorporated into the standard RL objective of maximizing expected cumulative future rewards. Different from our method that introduces the textual information to alleviate the rating sparsity problem, Reference [44] mainly focuses on utilizing constraint- augmented RL to address the problem that recommendations can easily violate preferences of ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:5 users from their past natural-language feedback. Moreover, like most existing RL-based meth- ods [5, 6, 11, 16, 32, 45–47, 50], Reference [44] also suffers from the large-scale discrete action space problem, which is another limitation of existing methods that we tend to address in this work. It is also noteworthy that in the domain of conversational recommender system (CRS), Basile et al. [2] proposed a framework that combines deep learning and reinforcement learning and uses text-based features to provide relevant recommendations and produce meaningful dia- logues. But different from CRS, in our RL-based method for IRS, the textual information is utilized to learn the implicit long-term preferences, not the proactive immediate needs of users. 2.3 Other Relevant Recommendation Methods With the development of deep learning, there are some works that apply deep learning models on recommendation. Sedhain et al. [29] proposed AutoRec to learn embeddings that can recon- struct the ratings of a user from his records. Yao et al. [43]proposed Collaborative Denoising AutoEncoders (CDAE), which utilizes the idea of Denoising Auto-Encoders and contains more flexible components with implicit feedback. By using a two-pathway neural network representa- tion learning architecture, Xue et al. proposed deep matrix factorization (DMF) [42]tomap the users and items into a common low-dimensional latent space with non-linear projections, and then utilize cosine similarity as the matching function to calculate predictive scores. To learn the complex structure of user interaction data, He et al. [15] replaced the inner product matching function with a non-linear MLP architecture. Moreover, by fusing the neural matching function learning structure MLP with a representation learning structure generalized matrix factorization, NeuMF was proposed to obtain better performance. Further, considering that the DNN-based rep- resentation learning and matching function learning suffered from two fundamental flaws, i.e., the limited expressiveness of inner product and the weakness in capturing low-rank relations, respec- tively, Deng et al. [10] proposed DeepCF, which combines the strengths of neural representation learning and neural matching function learning, to overcome these flaws. To gain more effective representations, some recent works try to exploit the structure of in- teraction graph by propagating user and item embeddings on it. GC-MC [4] was proposed to ap- ply a GCN-based auto-encoder framework on the bipartite user-item graph, but it only employs GCN for link prediction between users and items. Inspired by GCN [19], NGCF [37]exploitsthe collaborative signal in the embedding function and explicitly encodes the signal in the form of high-order connectivity by performing embedding propagation. The embedding propagation rule of NGCF is the same as that of standard GCN, which is originally proposed for node classification on the attributed graph, where each node has rich attributes. By removing feature transforma- tion and non-linear activation that will negatively increase the difficulty for training in NGCF, LightGCN [14] achieves significant accuracy improvements. Recently, Wu et al. [ 39] applied self- supervised learning on the user-item graph to improve the accuracy and robustness of GCNs for recommendation. Different from the bipartite graphs that only contain user-item interactions, the heterograph we considered in this work contains other node types. Compared with GCN, rela- tional graph convolutional network (RGCN) [28] has been shown to be capable of dealing with the highly multi-relational data characteristic on heterograph. In view of this, we apply it to propagate information on our heterograph. 3 PROPOSED METHOD 3.1 Problem Formulation We consider a recommender system with M users U = {u ,...,u }, N items V = {v ,...,v }, 1 M 1 N N×M and use Y ∈ R to denote the rating matrix, where y is the rating of user u on item v . For the i,j i j ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:6 C. Wang et al. Fig. 1. Framework overview. textual information, we denote D = {d ,...,d } and C = {c ,...,c } as the set of descriptions and 1 P 1 Q comments, respectively. This kind of interactive Top-k recommendation process can be modeled as a special Markov Decision Process (MDP), where the key components are defined as follows: • State. Use S to denote the state space. A state s ∈ S is defined as the possible interaction between a user and the recommender system, which can be represented by n item vectors in a certain order. • Action. Use A to denote the action space. An action a ∈ A contains n ordered items, each of which is represented by a vector. For the interactive Top-k recommendation, the scale of Aislarge. • Reward function. After receiving an action a at state s, our environment simulator returns a reward r, which reflects the user’s feedback to the recommended items. We use R (s, a) to denote the reward function. • Transition. In our model, since the state is a set of item vectors, once the action is deter- mined and the user’s feedback is given, the state transition is also determined. Consider an agent that interacts with the environmentE in discrete timesteps. At each timestep t, the agent can receive a state s by observing the current environment, then it takes an action a and gets a reward r . An agent’s behavior is defined by a policy π, which maps states to a t t probability distribution over the action, i.e., π :S→P (A). Based on the above notations, we can define the instantiated MDP for our recommendation problem, M = S, A,R,P,T,γ,where T is the maximal decision step, and γ is the discount factor. The objective of this work is to learn a policy π that maximizes the expected discounted cumulative reward. 3.2 Framework Overview Figure 1 gives an overview of our framework, which contains two major steps: data preparation and training. In data preparation, we first build a relation graph that contains four types of nodes (user, item, description, and comment) and use vectors of descriptions and comments obtained from pre-trained word vectors [27] to initialize the embeddings of description and comment nodes. Note that the embeddings of user and item nodes are initialized randomly. On the relation graph, the user, item, description, and comment embeddings in the lth propagation layer are denoted (l ) (l ) (l ) (l ) (l ) (l ) (l ) (l ) (l ) (l ) (l ) (l ) = {u ,..., u }, V = {v ,..., v }, D = {d ,..., d },and C = {c ,..., c }, as U 1 M 1 N 1 P 1 Q respectively. After propagating with L layers by utilizing a GCN-based self-supervised embed- (L) (L) (L) (L) ding method, we can learn the embeddings of user and item U = {u ,..., u } and V = 1 M (L) (L) {v ,..., v }. Through the learning process on the relation graph, U and V cangainthe 1 N ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:7 semantics contained in D and C. Then, based on the user embeddings, we utilize the unsuper- vised K-means [1] algorithm to classify the users into several clusters, which will later be used for helping construct the action candidate set. In the training phase, with the objective of implementing a more personalized recommendation, we train a unique model for each cluster. Take cluster 2 for an example; we randomly select a user u from it. Based on the historical logs of u and the user classification results, we sample positive, i i negative, and ordinary items for u to construct a candidate set, which will later be used in the reinforcement model for action selection. The reinforcement model interacts with the simulator, which is based on historical logs, to learn the inner relations among all possible states and actions. For the specific implementation, we employ DQN [ 13, 25] (TRGIR-DQN) and DDPG [23] (TRGIR- DDPG) as our reinforcement model, respectively. In particular, by utilizing policy vector in the DDPG implementation, we can improve the efficiency dramatically. The training phase will stop when the model loss reaches stable. 3.3 GCN-based Self-supervised Embedding Descriptions and comments are the most important textual information in recommender sys- tems. The descriptions, which contain items’ advantages, and the comments, which contain users’ attitudes, along with the ratings, can express the preferences of users well. To obtain well ex- pressive embeddings of users and items, we build a relation graph (as shown in the left part of Figure 1) including user nodes, item nodes, description nodes, and comment nodes. By initializing the node embeddings of descriptions and comments with pre-trained word vectors GloVe [27], we can get semantics from the textual information. Note these original descriptions and comments contain many meaningless words that can affect the quality of the constructed vector. We remove them in advance according to the Long Stopword List. Then, we pick up pre-trained word vectors GloVe.6B, which has been trained on large corpora (Wikipedia 2014 and Gigaword 5), to calculate (0) (0) d and c . Specifically, p q n n d c 1 1 (0) (0) d = w , c = w , (1) i q i n n d c i=1 i=1 where w denotes the vector of word w , n (n ) denotes the number of words that d (c ) contains i i d c p q after removing the stop words. Note that the word vectors with similar semantics have closer Euclidean distance than the word vectors with large semantics differences [ 27], which ensures that (0) comments (or descriptions) with similar semantics are closer to each other. Different from d and (0) (0) (0) c , the initial embeddings of user and item nodes, u and v , are constructed randomly. q m n Then, we introduce a GCN-based self-supervised embedding method to learn the representa- tions of users and items. Let e ∈ E denote the entities in the relation graph, where E = U∪V∪D∪C, and let e denote the representation of e. According to the feature propagation model of RGCN [28], an entity e is capable to receive the messages propagated from its l-hop neighbors by stacking l embedding propagation layers. In the lth step, the embedding of e is recursively formulated as, (l ) (l−1) (l−1) (l−1) (l−1) e = L W e + W e , (2) e ,e i j i j j i self e e r ∈R r e ∈N (l−1) (l−1) where r ∈ R denotes one of the relations between different entities, e and e denote the i j embeddings of e and e generated from the previous message propagation steps, N denotes i j https://www.ranks.nl/stopwords. http://nlp.stanford.edu/data/glove.6B.zip. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:8 C. Wang et al. 2 2 Fig. 2. (a) Example illustrating message propagation; (b) Example illustrating the calculation of U and V . (l−1) (l−1) the set of neighbor entities that directly connected to e under relation r , W and W are self e e r r trainable weights under relation r ,and L = 1/ |N ||N | is the symmetric normalization. As e ,e e e i j i j showninFigure 2(a), we take the message propagation of user u as an example. It is clear to see that there are totally four types of relations (i.e., |R| = 4): user-item relation, user-comment relation, item-comment relation, and item-description relation. After a two depth propagation, the message from the related nodes on different relations can be aggregated to the target node u . Note the word vectors in GloVe have the linear substructures [27]. By the sum average operation, (0) (0) the linear substructure feature is kept in the initial embeddings d and c . To avoid destroying p q linear substructures and accelerating the training phase, we remove the non-linear active functions in standard RGCN and find that this operation can improve the performance. The self-supervised loss function of our embedding method encourages the nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct. Inspired by the contrastive self-supervised method [26] and considering the linear sub- structures of these representations in Euclidean space, we choose mean square error as the distance b b metric. Based on the randomly selected batch of users and items U and V , the loss function of graph network parameters θ is designed as, G 2 2 L(θ ) = (u − u ) − (u − u ) m i m j u ∈N u ∈F u ∈U i u j u m m m (3) 2 2 G + (v − v ) − (v − v ) + λ||θ || , n i n j 2 v ∈N v ∈F v ∈V i v j v n n n whereN andN are the nearby nodes set of useru and items i ,F andF are the disparate u v m n u v m n m n nodes (nodes far from the current node) set of user u and items i , λ is a hyper-parameter to m n 2 T control the strength of the regularizer to avoid model overfitting. Note that we define U = H×H 2 T N×M and V = H × H,where H ∈ R denotes the user-item adjacency matrix. For any h ∈ H,if m,n u has interacted with v ,then h = 1; otherwise, h = 0. Moreover, to help understand the m n m,n m,n 2 2 calculation of U and V , Figure 2(b) presents an example based on the relational graph depicted 2 2 2 2 2 2 in Figure 1.Notethatboth U and V are symmetric matrices, i.e., U = U and V = V .The m,i i,m n,i i,n 2 2 diagonal elements in U (or V ) represent the total interacted items (or users) of the corresponding user (or item). If there is at least one item that both users u and u have interacted with, then m j 2 2 2 2 2 U > 0(e.g., U = 1); otherwise, U = 0(e.g., U = 0). The same rule goes for V . Hence, m,j 1,2 m,j 1,3 2 2 U > 0 represents there exists at least one item that is preferred by both u and u ,while V > 0 m j m,j n,i represents there exists at least one user that prefers both v and v . For any user u ,if U > 0, n i x m,x u belongs toN ; otherwise, u belongs to F . Similar definitions go for N and F . i u x u v v m m n n ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:9 Fig. 3. An example to illustrate how the text could benefit the recommendation process. The training of our GCN-based self-supervised method is independent with the RL model and will stop when the loss reaches stable. Moreover, for a specific batch of users and items, when the relations of them changes, we can utilize Equation (3) to update the neural network and their (L) (L) embeddings for this batch locally. After propagating with L layers, we can obtain u and v as m n the important foundation for clustering and the construction of state and action. To better illustrate the superiority of integrating textual information, we pick a real user (with ID: A2P49WD75WHAG5) in the Amazon Digital Music dataset (one of the datasets we used in the experiments) to illustrate how the text could benefit our recommendation process. As shown in Figure 3, the left scatter graph shows the distribution of the interacted items’ embeddings obtained by Matrix Factorization (MF) without textual information, and the right scatter graph shows the same items’ embeddings obtained by Self-supervised Graph (SG) representation learning method with textual information (the category information). We utilize the widely used principal component analysis [12] to reduce the dimension of the above-mentioned embeddings to two dimensions. We can observe that the relative distance between 2,365 (“Gia”) and 2,607 (“Eighteen Visions”) is much shorter than that between 2,365 and 2,464 (“Fijacion Oral”) in the left scatter graph, while an opposite result can be observed in the right scatter graph. By carefully screening the category information of these three items, we can conclude that 2,365 is more similar to 2,464 rather than 2,607, and hence should be more closer to 2,464 in the visualized embedding graph. This demonstrates the effectiveness of integrating textual information for better embedding learning. 3.4 Construction of the Candidate Set In our RL-based recommendation, the state is defined as a set of n items. For the model that k k recommends Top-k items at once, there are a total of A (note here A is a permutation) M−n M−n s s actions that can be chosen as an action. With the increase of the number of items (M), the scale of the action space will increase rapidly. Based on the assumption that the preferences can be obtained by a set of items that users like and dislike, we pick up the positive and negative items to build a candidate set c. Additionally, to maintain generalization, we randomly add some ordinary items into c. Given a user u , according to the historical logs, if the corresponding rating is greater than a given bound y (e.g., y = 2 in a rating system with the highest rating 5), then the interacted b b record is regarded as positive; otherwise, it is negative. We use V and V to denote the set u u i i of items that are in u ’s positive and negative interacted records, respectively. For u ,wesample i i ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:10 C. Wang et al. ALGORITHM 1: Candidate set construction for u p p p Input: n , α,V ,V ,V ,andV . u u f i i cl cl Output: Candidate set c. 1 Initialize c = ∅, n = n × α pos c 2 if n ≤|V | then pos u 3 c ← randomly select n items fromV ; pos 4 else p p 5 n = |V |; c ←V ; pos u u i i 6 end 7 n = (n − n )/2 neд c pos 8 if n ≤|V | then neд 9 c ← c ∪ randomly select n items fromV ; neд 10 else n n 11 n = n −|V |; c ← c∪V ; neд neд u u i i p p p 12 V ←V − (V ∩ (V ∪V )); neд f f u cl i cl cl l l 13 if n ≤|V | then neд neд 14 c ← c ∪ randomly select n items fromV ; neд neд 15 else 16 c ← c∪V ; neд 17 end 18 end 19 n = n −|c|; ord c 20 c ← c ∪ randomly select n items not in c; ord 21 return c; positive items from V , negative items from V , and ordinary items by random. Since users i u usually skip the items they do not like, the negative items inV are rare [24]. Based on the reverse thought of collaborative filtering, i.e., the more differences between two users, the more possible that the one’s likes are another’s dislikes, we classify users into several clusters by K-means [1]to supplement negative items. Specifically, as shown in Figure 4(a), we denote the set of items that appear in the positive interacted records of users in cluster l as V (user u belongs to cluster cl cl ) and use cl to denote the cluster that has the farthest distance from the current cluster cl . l l If the negative items in V are not enough, then the rest negative items will be selected from p p p V ←V − (V ∩ (V ∪V )). In this way, we can reduce the scale of the action space from neд f f u cl i cl cl l l M − n to n ,where n is the number of items in the candidate set c. s c c Algorithm 1 shows the detail of the construction for the candidate set, in which the positive items account for no more than α percent (α is a hyper-parameter), and the negative and ordinary items each share 50% of the remaining part of n (line 7). In the training phase, since constructing a candidate set only contains some simple operations, such as randomly select and merge, and the size of candidate set is always fixed, it is not difficult to see the time complexity of Algorithm 1 is constant. 3.5 Specific Implementations of TRGIR The goal of a typical reinforcement learning model is to learn a policy π that can maximize the dis- counted future reward, i.e., the Q-value, which is usually estimated by the Q-value function Q (·). ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:11 Fig. 4. (a) An example illustrates the components of the candidate set; (b) The structure of TRGIR-DQN. Combined with deep neural networks, there are many algorithms that try to approximate the Q-value function, which can be roughly categorized into three types: value-based (e.g., DQN [25], Double DQN [13]), policy-based (e.g., DPG [30]), and hybrid algorithms (e.g., DDPG [23]). Con- sidering value-based DQN [13, 25] is widely utilized for the scenario where the action space is discrete, we will first implement TRGIR with DQN (TRGIR-DQN) to show its effectiveness. 3.5.1 Implementation with DQN. The structure of TRGIR-DQN is shown in Figure 4(b), and we utilize the improved DQN (Double DQN) [13] as the RL model. To avoid the overestimation and thus improve performance, Double DQN decouples the action selection with the target Q value calculation. To make the action selection more reasonable, we introduce the features of items as input by concatenating item vector of c with s to derive ϕ , t t t t ϕ = ϕ s ,c = c ⊕ s , (4) t t t k k where ⊕ denotes the vector concatenation operation, c denotes the vector of tth item in the kth candidate set c . The concatenation of state s and c decides the action selection. k t In each timestep t, the Q-value network takes ϕ as input. By a multiple-layer perceptron (MLP) network, we can learn two Q-values, which we term as recommendation action and skip action, respectively. If the Q-value of the recommendation action is greater than that of the skip action, then we will recommend item c . Otherwise, we just skip it. As shown in Figure 4(b), af- ter receiving a , s ,and c , the simulator return the next state s . Specifically, according to the t t t+1 t k Q-value, if we recommend item c ,thenweput c at the head of s and select the top n items as t s s . Otherwise, we let s = s . t+1 t+1 t Algorithm 2 shows the training phase of TRGIR-DQN. By maximizing the cumulated discounted rewards, the model parameters (θ and θ ) can be learned. Based on the assumption that similar users have similar preferences, our method classifies users and trains models for each cluster. In the beginning, we randomly initialize θ and the replay buffer B. After constructing a candidate set c for u in cluster cl , the agent selects and executes an action according to an ϵ-greedy policy. k i l The TRGIR-DQN algorithm focuses on minimizing the gap between the current Q-value Q (s, a|θ ) and the expected Q-value z , which is measured by the following loss: L (θ ) = E z − Q ϕ , a |θ , (5) j l s ,a ∼ρ (·) j j j l j j ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:12 C. Wang et al. ALGORITHM 2: Learning TRGIR-DQN Input: The maximum step T ; The number of user clusters n ; Target network update rate τ;The cl ϵ-greedy policy rate ϵ;The action size n ; The candidate size n . a c Output: Model parameters θ and θ . 1 for l = 1 to n do cl 2 Randomly initialize current action-value network parameters θ ; 3 Initialize target network: θ = θ ; 4 Initialize replay buffer B; 5 repeat 6 Randomly select u in cluster cl ; i l 7 Initialize observation state s ; 8 for k = 1 to T//n do 9 Construct a candidate set c for u by Algorithm 1; 10 for t = 1 to n − 1 do 11 With probability ϵ randomly select an action a ; 12 Otherwise get ϕ according to Equation (4), a = max Q (ϕ , a|θ ); t t a t t+1 13 Interact with simulator by a and observe r and s , ϕ = ϕ (s ,c ); t t t+1 t+1 t+1 14 Store transition (ϕ ; a ; r ; ϕ ) in B; t t t t+1 15 Sample a random minibatch of N transitions (ϕ ; a ; r ; ϕ ) from B; j j j j+1 16 Update the current network parameters θ according to Equation (5); 17 Update the target network parameters: θ = τθ + (1− τ )θ ; l l 18 end 19 end 20 until converge; 21 end 22 return θ and θ where ρ (·) is the action distribution, and the expected Q-value z can be defined as, r + γQ (ϕ , max Q (ϕ , a , θ ), θ ), if non-terminal; j j a j+1 j+1 j+1 z = (6) r , if terminal. Differentiating the loss function with respect to the weights, we arrive at the gradient as, ∇ J = E [(r + γQ (ϕ , max Q (ϕ , a , θ ), θ ) θ s ,a ∼ρ (·);s ∼E j j j+1 j+1 l l j j j+1 l N j+1 j (7) − Q (ϕ , a |θ ))∇ Q (ϕ , a ; θ )]. j j l θ j j l Then, we update the target network parameters by soft updates [23] with a rate τ . Finally, when the loss reaches stable, the training will stop. To get k items at once, for the recommended items, we order these items by the Q-values of recommendation action. 3.5.2 Implementation with DDPG. DDPG combines the advantages of DQN and DPG [30]and can concurrently learn policy and Q (s , a ) in high-dimensional, continuous action spaces by t t using neural function approximation [23]. For more performance and efficiency improvement, we further implement TRGIR with DDPG (TRGIR-DDPG). But it is noteworthy that the action space of IRSs is discrete and is not suitable for DDPG. To address this problem, inspired by DDPG-kNN [11], we propose the policy vector p that represents the user’s preference and can be dynamically learned by TRGIR-DDPG. By utilizing p to generate discrete actions from the candidate set, we can overcome the gap between discrete actions in IRSs and continuous actions in DDPG. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:13 Fig. 5. (a) The Structure of TRGIR-DDPG; (b) An example for illustrating policy vector for TRGIR-DDPG. For Top-k recommendation, TRGIR-DQN has to calculate Q-values for each item to decide whether to recommend or not, which is still inefficient when the scale of the candidate set is large. Moreover, large-scale action space also limits the exploration ability of TRGIR-DQN, which in turn will affect the performance. To solve these problems, we consider utilizing DDPG as the RL model, which can explore large continuous space efficiently, to enhance exploration ability and further improve the efficiency. The structure of TRGIR-DDPG is shown in Figure 5(a). During the embedding process, we have mapped the discrete actions into the continuous feature space, where each item is represented by a feature vector. Then, by conducting the dot product between p and the item vectors in c ,wecan select the actions from a discrete space. Figure 5(b) gives an example for helping understand the policy vector. Suppose a user selects a movie according to the preference that can be represented as explicit policies such as Prefer Detective Comics, Insensitive to genres, and Like Superman.By our method, a policy vector in the feature space, e.g., (0.7, 0, 5, 0.1, 0.9), can be learned, where the value of each dimension represents how much emphasis the dimension is for this user. By conducting a dot product between the policy vector and the item vectors, we finally can choose the movie Superman Returns with the highest score of 2.79 for recommendation (assume Top-1 recommendation here). In each timestep t, the actor network takes a state s as input. By an MLP network, we can learn a continuous vector, which we term as the policy vector, denoted by p . The critic network takes state s and policy vector p as input. By another MLP network, it can learn the current Q- value to evaluate p . As mentioned in Figure 5(b), p represents a user’s preferences in the feature t t vector space, it is a continuous weight vector that can measure the importance of each dimension. Combining with the candidate set c ,wecan get n items with the highest score, each of which is t a denoted by Score (v ) and, Score (v ) = p v . (8) i i As shown in Figure 5(b), TRGIR-DDPG generates s by a sliding-window manner. Specifically, t+1 among ordered items in a ,wekeeptheorderand select theitemsthatarenotin s as a .Then, t t we put a at the head of s and select the top n items as s . Moreover, to cover the action space t s t+1 to a large extent, the candidate set is randomly generated at each timestep. Q μ Q μ The training phase (as shown in Algorithm 3) learns model parameters θ , θ , θ ,and θ through maximizing the cumulated discounted rewards of all the decisions. As mentioned be- fore, TRGIR-DDPG also trains models for each cluster. At the beginning of the training phase, we ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:14 C. Wang et al. ALGORITHM 3: Learning TRGIR-DDPG Input: The maximum step T ; The number of user clusters n ; Target network update rate τ;The cl action size n . Q μ Q μ Output: Model parameters θ , θ , θ and θ . 1 for l = 1 to n do cl Q μ 2 Randomly initialize critic network Q (s, p|θ ) and actor network μ (s|θ ); Q Q μ μ 3 Initialize target network: θ = θ , θ = θ ; l l l l 4 Initialize replay buffer B; 5 repeat 6 Randomly select u in cluster cl ; 7 Initialize observation state s ; 8 Initialize a random processN for exploration; 9 for t = 1 to T do 10 Construct a candidate set c for u by Algorithm 1; t i 11 p = μ s |θ +N ; t t 12 Use p to select n items from c as a ; a t t 13 Interact with simulator by a and observe r and s ; t t t+1 14 Store transition (s ; p ; r ; s ) in B; t t t+1 15 Sample a random minibatch of N transitions (s ; p ; r ; s ) from B; j j j+1 b j 16 Update critic by minimizing Equation (9); 17 Update actor by Equation (11); Q Q Q μ μ μ 18 Update the target networks: θ = τθ + (1− τ )θ , θ = τθ + (1− τ )θ ; l l l l l l 19 end 20 until converge; 21 end Q μ Q μ 22 return θ , θ , θ and θ randomly initialize the network parameters and the replay buffer B. For better action exploration, we initialize a random processN , adding some uncertainty when generating p. The critic network focuses on minimizing the gap between the current Q-value Q (s , p |θ ) and the expected Q-value z , which is measured by the following loss: 1 2 Q Q L(θ ) = z − Q s , p |θ , (9) j j l l where z can be expressed in a recursive manner by using the Bellman equation, z = r + γQ s , μ s |θ |θ . (10) j j j+1 j+1 l l The objective of the actor network is to optimize the policy vector p through maximizing the Q-value. The actor network is trained by the sampled policy gradient: Q μ μ μ ∇ J ≈ ∇ Q s, p|θ | ∇ μ s|θ | . (11) p s=s ,p=μ (s ) s θ j j θ j l l l l For TRGIR-DDPG, we also update the parameters of the target actor and the target critic network by soft updates [23] with a rate τ . Finally, when the loss reaches stable, the training phase will stop. Note that in our implementations with DQN and DDPG, to avoid insufficient training, we set a minimum training step threshold based on the size of buffer B. Only when the number of steps is greater than the minimum threshold and the loss reaches stable, the training will stop. Moreover, ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:15 we also set a maximum training step threshold based on the size of B, when the number of steps is larger the maximum threshold, the training will stop. 3.6 Environment Simulator It is expensive and time-consuming to utilize a real interactive environment for the training of RL models. The same as several previous works [5, 38], we build the environment simulator based on historical interactions. For a user u in the timestep t, the simulator receives present state s and i t action a , then returns reward r and the next state s . In this way, the reward function can be t t t+1 written as R (s , a ), and the definitions of R (·) are different for different RL models. t t At each timestep t, TRGIR-DQN only recommends at most one item to user u . The reward r i t of TRGIR-DQN is defined as, y , if recommend item v ; i,j r = R (s , a ) = (12) t t t 0, otherwise, where y is the adjusted rating of u on v . To give proper rewards for different types of items, i j i,j y is designed as follows: i,j y − y , if v ∈V ; i,j b j u ⎪ n y − y − 1, if v ∈V ; ∗ i,j b j ⎨ i y = (13) i,j −0.5, if v ∈V ; j neд 0, otherwise. Recall here y is the initial rating of u on v ,and y is the rating bound. By this formula, i,j i j b positive items will get positive feedback, and negative items will get negative feedback. Moreover, the supplemented negative items will get half of the minimum negative feedback, i.e.,−0.5, while the other items will get feedback of 0. For TRGIR-DDPG, it recommends n items to user u at each timestep. The reward function of a i TRGIR-DDPG not only guides the model to capture users’ preferences, but also evaluates the rank quality of the recommended items. Specifically, the reward r is determined by two values, w and t k y , i,j r = R (s , a ) = w × y , (14) t t t k i,j k=1 where w is the ranking weight of the items in a . Inspired by DCG [17, 38], the ranking weight k t is calculated by w = 1/log (k + 1). (15) Note that the methods of generating s are different in TRGIR-DQN and TRGIR-DDPG, as t+1 detailed in the above specific implementations sections. 4 EXPERIMENTS AND RESULTS 4.1 Experimental Settings In this section, to demonstrate the effectiveness of the proposed method, we first introduce the experimental settings and then present and discuss the experimental results from the perspective of both performance and efficiency to answer the following research questions: • RQ1: How do the methods that implement our TRGIR framework with DQN and DDPG perform as compared with other state-of-the-art methods? • RQ2: How is the recommendation sparsity problem alleviated by utilizing the textual infor- mation in different ways? ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:16 C. Wang et al. Table 1. Statistics of Datasets DataSet #Users #Items #Ratings of Pos. #Ratings of Neg. Sparsity Size of Des. Size of Com. Music 5,541 3,568 58,905 5,801 0.9967 2,338 KB 65,758 KB Beauty 22,363 12,101 176,520 21,982 0.9993 5,735 KB 83,251 KB Clothing 39,387 23,033 252,022 26,655 0.9997 3,960 KB 80,208 KB • RQ3: How does the efficiency benefit from the candidate action set and the policy vector? • RQ4: How do the key hyper-parameters (e.g., the dimension of feature space, the number of clusters, the size of candidate set) affect the performance? Note that, when analyzing one factor, we keep the others fixed. The default settings are: the input dimension of GCN n is 100, the output dimension of GCN n is 64, the depth of GCN in out propagation layer n is 2, the number of clusters n is 5, the size of the candidate set n is 50, the дcn cl c rate of positive items α is 0.1, the size of state n is 20, the size of action n is 10 (but for TRGIR- s a DQN, n is fixed to be 2). We have implemented our framework with DQN and DDPG, which can be accessed in GitHub. 4.1.1 Datasets. Jure Leskovec et al. [21] collected and categorized a variety of Amazon products and built several datasets including ratings, descriptions, and comments. We evaluate our models on three publicly available Amazon datasets: Digital Music (Music for short), Beauty and Clothing Shoes and Jewelry (Clothing for short), which all have at least five comments for each product. Table 1 shows the statistical details of the datasets we used. 4.1.2 Baseline Methods. We compare TRGIR-DQN and TRGIR-DDPG with eight methods, where ItemPop is a conventional recommendation method, DMF is an MF-based method with neu- ral network, ANR is a neural recommendation method that leverages textual information, Caser and SASRec are time-related deep learning–based methods, LinearUCB is a MAB-based (MAB stands for Multi-Armed Bandit) method, D-kNN and TPGR are all RL-based methods. • ItemPop recommends the most popular items from currently available items to the user. This method is non-personalized and is often used as a benchmark for recommendations. • DMF [42] is a matrix factorization model using deep neural networks. Specifically, it utilizes two distinct MLPs to map the users and items into a common low-dimensional space. • ANR [8] uses an attention mechanism to focus on the relevant parts of comments and esti- mates aspect-level user and item importance in a joint manner. • Caser [33] embeds a sequence of recent items into an image and learns sequential patterns as local features of the image by using convolutional filters. • SASRec [18] is a self-attention-based sequential model for next item recommendation. It models the entire user sequence and adaptively considers consumed items for prediction. • LinearUCB [22] is a contextual-bandit recommendation approach that adopts a linear model to estimate the upper confidence bound for each arm. • D-kNN [11] addresses the large discrete action space problem by combining DDPG with an approximate kNN method. • TPGR [5] builds a balanced hierarchical clustering tree and formulates picking an item as seeking a path from the root to a certain leaf of the tree. https://github.com/SunwardTree/TRGIR. http://snap.stanford.edu/data/amazon/productGraph/categoryFiles. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:17 Table 2. Overall Recommendation Performance Compared Methods Ours Dataset Metric D-kNN D-kNN TRGIR TRGIR ItemPop DMF ANR Caser SASRec LinearUCB (k = 0.1M)(k = M) -DQN -DDPG HR@10 0.2447 0.3318 0.4980 0.8097 0.8897 0.3201 0.3274 0.3436 0.8037 0.9886 F1@10 0.0454 0.0621 0.1128 0.1676 0.1910 0.0631 0.0648 0.0692 0.1844 0.2304 NDCG@10 0.1101 0.1569 0.2756 0.5351 0.6212 0.1462 0.1527 0.1617 0.7719 0.9436 Music HR@20 0.4889 0.5885 0.7097 0.9090 0.9635 0.5747 0.5838 0.6001 0.8048 0.9935 F1@20 0.0525 0.0626 0.1084 0.1048 0.1151 0.0638 0.0647 0.0676 0.1003 0.1251 NDCG@20 0.1716 0.2210 0.3252 0.5542 0.6325 0.2095 0.2171 0.2258 0.7722 0.9446 HR@10 0.2551 0.2734 0.4550 0.6125 0.6823 0.3219 0.2585 0.2772 0.7278 0.8845 F1@10 0.0482 0.0502 0.0990 0.1218 0.1386 0.0614 0.0489 0.0519 0.1463 0.1798 NDCG@10 0.1134 0.1249 0.2252 0.3939 0.4569 0.1447 0.1170 0.1258 0.6213 0.6949 Beauty HR@20 0.5278 0.5273 0.6993 0.7817 0.8330 0.5911 0.5142 0.5377 0.7349 0.9501 F1@20 0.0543 0.0529 0.1006 0.0826 0.0907 0.0613 0.0529 0.0547 0.0782 0.1024 NDCG@20 0.1817 0.1885 0.2850 0.4344 0.4942 0.2122 0.1809 0.1910 0.6230 0.7104 HR@10 0.2265 0.2393 0.3421 0.5060 0.5817 0.2500 0.2541 0.2768 0.6593 0.7544 F1@10 0.0413 0.0437 0.0663 0.0934 0.1084 0.0458 0.0467 0.0510 0.1222 0.1405 NDCG@10 0.1033 0.1041 0.1622 0.2900 0.3525 0.1130 0.1131 0.1242 0.4577 0.4865 Clothing HR@20 0.4964 0.5044 0.6008 0.7196 0.7655 0.5041 0.5043 0.5293 0.7288 0.8973 F1@20 0.0482 0.0488 0.0659 0.0702 0.0758 0.0489 0.0490 0.0517 0.0711 0.0881 NDCG@20 0.1706 0.1704 0.2264 0.3427 0.3968 0.1756 0.1757 0.1874 0.4754 0.5225 Best performance is in boldface and second best is underlined. Note that for D-kNN, larger k (i.e., the number of nearest neighbors) will result in better perfor- mance but poor efficiency. For a fair comparison, we consider setting k as 0.1M and M (M is the number of items), respectively. 4.1.3 Evaluation Metrics and Methodology. The methods that achieve their goals by Top-k rec- ommendation take evaluation on the indexes such as Hit Ratio (HR) [42], Precision [33, 50], Re- call [33], F1 [5], and normalized Discounted Cumulative Gain (nDCG) [18, 38, 48, 50]. To cover as many aspects of Top-k recommendation as possible, we chose HR@k,F1@k, and nDCG@k as the evaluation metrics. The test data was constructed in data preparation, and all the evaluated methods were tested by using this data. We now describe the test method in detail: For each user, we first classify user’s history logs into positive and negative ones and sort the items in positive history logs by time- stamp. Then, we choose the last 10% of the ordered items in the positive logs as positive items. Finally, the negative items are randomly selected from the cluster that is farthest from the one that the current user belongs to. Based on such a strategy, the recommendation methods (except TPGR, which only recommends one item in each episode) can generate a ranked Top-k list to evaluate the metrics mentioned above. 4.2 Comparison and Analysis (RQ1) Table 2 shows the summarized results of our experiments on the three datasets in terms of six met- rics, including HR@10, F1@10, nDCG@10, HR@20, F1@20, and nDCG@20. Note that, since TPGR is not suitable for Top-k recommendation, we did not include it as a competitor when evaluating the recommendation performance. From the results, we have the following key observations: • Compared with ItemPop, the methods that utilize deep neural networks are more effective. Moreover, the text-based method ANR consistently outperforms DMF that only uses inter- action information for embedding, which demonstrates the importance of utilizing textual information to alleviate the negative effects of data sparsity for better performance. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:18 C. Wang et al. • For IRSs, the preferences of users are always time-related. Caser, which learns sequential patterns by using convolutional filters, and SASRec, which utilizes self-attention to model the entire sequence, can improve the performance of IRSs dramatically. • The interactive methods, LinerUCB and D-kNN, perform similarly and can only outperform ItemPop. LinerUCB is a traditional MAB-based method, which cannot do long-term plan explicitly. As for D-kNN, only considering the distance but ignoring the importance of each dimension in the latent space for users results in missing proper items. • Our method TRGIR-DDPG achieves the best performance and obtains remarkable improve- ments over the state-of-the-art methods, which demonstrates the effectiveness of combining the candidate action set and the policy vector to address the large discrete space problem. TRGIR-DQN also performs well but is inferior to TRGIR-DDPG. The performance gap be- tween them may be due to the fact that the exploring ability of the policy vector is more powerful. 4.3 Utilizing Textual Information (RQ2) To figure out if the excellent performance can be retained when not leveraging textual information or utilizing textual information differently, we compare our TRGIR framework with the deep Rein- forcement learning framework using Matrix-factorization representation for Interactive Rec- ommendation (RMIR) [20] and the Text-based deep Reinforcement learning framework using Sum-average word representation for Interactive Recommendation (TRSIR) [34]. The same as TRGIR, we also implement RMIR and TRSIR with DQN and DDPG, respectively. The results on the three datasets (arranged in increasing order of scale and sparsity) in terms of HR@10, F1@10, nDCG@10, HR@20, F1@20, and nDCG@20 are shown in Table 3. From Table 3, it is clear to see that no matter with DQN or DDPG, the data sparsity problem can be alleviated by utilizing textual information. Meanwhile, the performance improvement increases along with the increase of data scale and data sparsity. This justifies the effectiveness of TRSIR and TRGIR that leverage textual information in RL-based recommendation, especially for large- scale and high-sparsity datasets. Further, for the representations of users and items, the results comparison between TRSIR and TRGIR demonstrates that the self-supervised GCN embedding method is much more powerful than the simple sum average word vector operation. Moreover, to discuss the influence of the settings in our GCN-based self-supervised embedding method, we conduct an ablation study on the three datasets for TRGIR-DQN and TRGIR-DDPG in all metrics. Since the performances under different datasets, metrics, and implementations ex- hibit similar trends, we only present the performance of TRGIR-DDPG with default setting (red bar) on Music in terms of HR@10 and F1@10, as shown in Figure 6. Note for the two sub-graphs in Figure 6, the left three blue bars show the performance of TRGIR-DDPG with different rela- tions settings. Specifically, without textual information (W/O Text) only contains user-item relations, while without descriptions (W/O Des.) contains user-item, user-comment, and item- comment relations. Because the comments are related to both users and items, we cannot remove any one of them separately. Thus, without comments (W/O Com.) just contains user-item and item-description relations. We can find that the performance degrades greatly when the method only contains user-item relations, and the model with all the relations (Default) obtains the best performance. Moreover, we can see that the item-description relations are more important than the comment-related relations for our model. The reason might be that our document-level nature language processing method introduces more noises in comments than in descriptions. The green bar in Figure 6 shows that without self-connection (W/O Self-con.), our model performs worse, demonstrating its effectiveness. To keep the linear substructures and simplify our model, as men- tioned in Section 3.3, we have removed the activation function in our self-supervised embedding ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:19 Table 3. The Comparison of Embedding Methods (RMIR, TRSIR, and TRGIR) on Specific Implementations RMIR TRSIR TRGIR RMIR TRSIR TRGIR Dataset Metric -DQN -DQN -DQN -DDPG -DDPG -DDPG HR@10 0.7164 0.7944 0.8037 0.8717 0.9455 0.9886 F1@10 0.1603 0.1805 0.1844 0.1920 0.2124 0.2304 NDCG@10 0.6843 0.7204 0.7719 0.6478 0.7152 0.9436 Music HR@20 0.7183 0.7965 0.8048 0.9487 0.9760 0.9935 F1@20 0.0865 0.0987 0.1003 0.1145 0.1203 0.1251 NDCG@20 0.6845 0.7207 0.7722 0.6643 0.7219 0.9446 HR@10 0.6553 0.6685 0.7278 0.6258 0.7449 0.8845 F1@10 0.1305 0.1306 0.1463 0.1267 0.1463 0.1798 NDCG@10 0.5165 0.5121 0.6213 0.4025 0.4896 0.6949 Beauty HR@20 0.6613 0.7050 0.7349 0.8156 0.8909 0.9501 F1@20 0.0698 0.0731 0.0782 0.0874 0.0936 0.1024 NDCG@20 0.5178 0.5209 0.6230 0.4486 0.5244 0.7104 HR@10 0.3953 0.6251 0.6593 0.3290 0.6622 0.7544 F1@10 0.0726 0.1157 0.1222 0.0602 0.1226 0.1405 NDCG@10 0.2655 0.4545 0.4577 0.1647 0.3917 0.4865 Clothing HR@20 0.4680 0.6886 0.7288 0.5805 0.8545 0.8973 F1@20 0.0453 0.0671 0.0711 0.0562 0.0835 0.0881 NDCG@20 0.2838 0.4704 0.4754 0.2273 0.4398 0.5225 Fig. 6. Ablation experiments: Performance of TRGIR-DDPG on Music w.r.t. (a) HR@10; (b) F1@10. method. The orange bar in Figure 6 shows that our model performs better than the model with an activation function (W/ Active-Fun.), which verifies the rationality of our design principle. It is noteworthy that our way of introducing text may also introduce noise or irrelevant infor- mation. To alleviate this problem, we first utilize the Long Stopword List to filter out meaningless words (Section 3.3). Moreover, in the pre-trained GloVe [27], due to the vector distance between words with similar meanings is much closer than that between words with different meanings, the influence of noise can be reduced to some extent. 4.4 Time Comparison (RQ3) In this section, we compare the efficiency of RL-based models from two aspects, the consumed time of training (updating the model per step), and decision-making, where the time spent is measured ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:20 C. Wang et al. Table 4. Time Comparison for Training and Decision-making Time Cost (ms) TRGIR-DQN (n = 500) TRGIR-DQN DDPG-kNN (k = 0.1M) DDPG-kNN (k = M) TPGR TRGIR-DDPG Per training step 6.01 6.01 5.26 5.51 4.98 3.13 Per decision 621.82 80.65 7.86 56.42 5.06 0.9 in milliseconds. Table 4 presents the comparison of the time cost on the Beauty dataset. The time cost orders on other datasets exhibit similar trend as that on Beauty and thus are omitted here. The values in Table 4 are the average ones obtained by statistics. To make a fair comparison, both n and n are set to 1 (but for TRGIR-DQN, we set n as 2) and others keep the default. The s a a experiments are conducted on the same machine with 6-core CPU (i7-6850k, 3.6 GHz) and 64 GB RAM. As showninTable 4, for per training step, the time cost gap among those RL-based methods is not large. However, for the time cost of decision-making, the large discrete action space makes most RL-based recommendation methods inefficient. The value-based RL method TRGIR-DQN (n = 500), which has to calculate a large amount of Q-values for each possible item, runs much slower than other models, not to mention how the efficiency will drop on a real scale far larger than 500. With the help of narrowing the scale of the action candidate set to 50 (n = 50 is the default setting, which performs better than n = 500 in Figure 8(a)), the decision-making efficiency of TRGIR-DQN improves greatly. D-kNN also runs slow (especially when k =M), because it has high time complexity in discovering nearest neighbors as action in large discrete action space. TPGR reduces the decision-making time significantly by constructing a clustering tree, but as mentioned before, it only supports Top-1 recommendation. Compared to other methods, by using action candidate set and policy vector, TRGIR-DDPG achieves significant improvement in terms of execution efficiency. 4.5 Hyper-parameter Sensitivity (RQ4) We select several important hyper-parameters to analyze their effects on the performance of TRGIR-DQN and TRGIR-DDPG. Note that we have conducted such experiments on all the datasets in the six metrics mentioned above, and the results show that our approach exhibits similar perfor- mance trends on all the evaluated datasets. For simplicity, we only present the results on Beauty dataset in terms of HR@10 and nDCG@10. When testing one parameter, we keep the other hyper- parameters fixed with the default settings. From Figures 7 to 9, we can see that the two performance metrics for TRGIR-DQN and TRGIR-DDPG exhibit similar trends. In this way, the following anal- yses work for both of them: The Input Dimension of GCN (n ). Note the input dimension of GCN n is equal to the di- in in mension of the pre-trained word vectors. The embedding initialization relies on the pre-trained word vectors, hence, the number of n reflects the richness of the textual information. As shown in in Figure 7(a), with the increase of n , as expected, TRGIR-DQN and TRGIR-DDPG also perform in better. TheOutputDimension of GCN(n ). The output dimension of GCN n is equal to the final out out vector dimension of users and items. Figure 7(b) shows that, with the increase of n , the perfor- out mance of our methods keeps stable. This is mainly because the useful knowledge can be contained within 16 dimensions. The Depth of GCN propagation layer (n ). Figure 7(c) shows that TRGIR-DDPG achieves the дcn best performance when the depth n is 3, and TRGIR-DQN achieves the best performance when дcn n is 2. The reason might be that the increase of the number of propagation layers can aggregate дcn ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:21 Fig. 7. Performance of TRGIR-DQN and TRGIR-DDPG on Beauty in HR@10 and NDCG@10 w.r.t. (a) the input dimension of GCN; (b) the output dimension of GCN; (c) the depth of GCN propagation layer. Fig. 8. Performance of TRGIR-DQN and TRGIR-DDPG on Beauty in HR@10 and NDCG@10 w.r.t. (a) the number of clusters; (b) the size of candidate; (c) the rate of positive items. more knowledge from more nodes, but too high-order propagation may cause the over-smoothing problem, which in turn will degrade the performance. The Number of Clusters (n ). As showninFigure 7(a), with the increase of n , the performance cl cl first rises and then falls. More clusters mean larger differences between the current cluster and the one that provides negative samples, which improves their quality. However, too many clusters may also cause a shortage of effective samples. The Size of Candidate (n ). Figure 8(b) shows that the performance decreases with the increase of n . This is mainly because, for training samples, the items fromV are much less than the items c u from V . The increase of n will cause imbalance sampling, which leads to worse performance. cl The Rate of Positive Items (α). As shown in Figure 8(c), with the increase of α, the performance first grows and then remains stable. This is because increasing α will introduce more positive items to perceive the user’s interests better. But, since n ≤|V | (see Algorithm 1), when α is pos u big enough, its growth may no longer affect n . pos The Size of State (n ). Figure 9(a) shows that with the increase of the state size n , the performance s s stays almost smoothly, which means the size of the state has little impacts on the implements of our framework TRGIR. TheSizeofAction(n ). For TRGIR-DDPG, Figure 9(b) shows that when n ranges from 1 to 10, a a the performance also increases. However, the performance starts to decrease when n reaches 20. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:22 C. Wang et al. Fig. 9. Performance of TRGIR-DQN and TRGIR-DDPG on Beauty in HR@10 and NDCG@10 w.r.t. (a) the size of state; (d) the size of action. The larger n is, the more frequent the state changes, indicating that keeping a proper updating speed is important. Note that, since n is fixed to 2 for TRGIR-DQN, we do not include it in this set of experiments. 5 CONCLUSION In this article, we propose TRGIR, a Text-based deep Reinforcement learning framework using self-supervised Graph representation for Interactive Recommendation. By learning the embeddings of users and items with a GCN-based self-supervised embedding method on a relation graph that contains textual information, we gain user and item vectors with semantics, which greatly alleviates the data sparsity problem. Moreover, based on the thought of collabora- tive filtering, we classify users into several clusters and construct an action candidate set, which reduces the scale of action space directly. Further, combining with the policy vector dynamically learned from DDPG that represents the user’s preferences in the feature space, we select items from the candidate set to generate action for the recommendation, which greatly improves the efficiency of decision-making and enhances the exploration ability. Experimental results over a carefully designed simulator on three public datasets demonstrate that compared with state-of-the- art methods, TRGIR-DDPG can achieve remarkable performance improvement in a time-efficient manner. For future work, we intend to model the textual information in word-level to capture finer- grained semantic factors for better recommendation performance; we also would like to see if it is possible to incorporate our proposed model with transfer learning. REFERENCES [1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace cluster- ing of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Laura M. Haas and Ashutosh Tiwary (Eds.). 94–105. [2] Pierpaolo Basile, Claudio Greco, Alessandro Suglia, and Giovanni Semeraro. 2018. Deep learning and hierarchical reinforcement learning for modeling a conversational recommender system. Intelligenza Artificiale 12, 2 (2018), 125– [3] Konstantin Bauman, Bing Liu, and Alexander Tuzhilin. 2017. Aspect based recommendations: Recommending items with the most valuable aspects based on user reviews. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 717–725. [4] Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263 (2017). [5] Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2019. Large-scale interactive recommendation with tree-structured policy gradient. In Proceedings of the 33rd AAAI Confer- ence on Artificial Intelligence . AAAI Press, 3312–3320. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:23 [6] Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. 2019. Generative adversarial user model for reinforcement learning based recommendation system. In Proceedings of the 36th International Conference on Machine Learning (ICML), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 1052–1061. [7] Germán Cheuque, José Guzmán, and Denis Parra. 2019. Recommender systems for online video game platforms: The case of STEAM. In Proceedings of the International Conference on World Wide Web (WWW), Sihem Amer-Yahia, Mohammad Mahdian, Ashish Goel, Geert-Jan Houben, Kristina Lerman, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 763–771. [8] Jin Yao Chin, Kaiqi Zhao, Shafiq R. Joty, and Gao Cong. 2018. ANR: Aspect-based neural recommender. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM), Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 147–156. [9] Dong Deng, Liping Jing, Jian Yu, Shaolong Sun, and Haofei Zhou. 2018. Neural Gaussian mixture model for review- based rating prediction. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys),SolePera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan (Eds.). ACM, 113–121. [10] Zhi-Hong Deng, Ling Huang, Chang-Dong Wang, Jian-Huang Lai, and S. Yu Philip. 2019. DeepCF: A unified frame- work of representation learning and matching function learning in recommender system. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33. 61–68. [11] Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015). [12] George H. Dunteman. 1989. Principal Components Analysis. Number 69. Sage. [13] Hado van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence . 2094–2100. [14] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648. [15] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182. [16] Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Confer- ence on Knowledge Discovery and Data mining (SIGKDD). 368–377. [17] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422–446. DOI:https://doi.org/10.1145/582415.582418 [18] Wang-Cheng Kang and Julian J. McAuley. 2018. Self-attentive sequential recommendation. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society, 197–206. [19] Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Pro- ceedings of the International Conference on Learning Representations (ICLR). [20] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Com- puter 8 (2009), 30–37. [21] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved from http://snap.stanford.edu/data. [22] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW). 661–670. [23] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR Poster). [24] Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM Conference on Recommender Systems (RecSys). 5–12. [25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529. [26] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018). [27] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word repre- sentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).ACL, 1532–1543. ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. 44:24 C. Wang et al. [28] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Mod- eling relational data with graph convolutional networks. In Proceedings of the European Semantic Web Conference. Springer, 593–607. [29] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. AutoRec: Autoencoders meet collabora- tive filtering. In Proceedings of the 24th International Conference on World Wide Web Companion. ACM, 111–112. [30] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Determin- istic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML). 387–395. [31] Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 926–934. [32] Haihui Tan, Ziyu Lu, and Wenjie Li. 2017. Neural network based reinforcement learning for real-time pushing on text stream. In Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 913–916. [33] Jiaxi Tang and Ke Wang. 2018. Personalized top-N sequential recommendation via convolutional sequence embed- ding. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek (Eds.). ACM, 565–573. [34] Chaoyang Wang, Zhiqiang Guo, Jianjun Li, Peng Pan, and Guohui Li. 2020. A text-based deep reinforcement learning framework for interactive recommendation. In Proceedings of the 24th European Conference on Artificial Intelligence . IOS Press, 537–544. DOI:https://doi.org/10.3233/FAIA200136 [35] Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization bandits for interactive recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence . 2695–2702. [36] Kai Wang, Zhene Zou, Qilin Deng, Jianrong Tao, Runze Wu, Changjie Fan, Liang Chen, and Peng Cui. 2021. Re- inforcement learning with a disentangled universal value function for item recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 35. 4427–4435. [37] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174. [38] Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2017. Reinforcement learning to rank with Markov decision process. In Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR). 945–948. [39] Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-Supervised Graph Learning for Recommendation. Association for Computing Machinery, New York, NY, 726–735. DOI:https://doi. org/10.1145/3404835.3462862 [40] Han Xiao, Minlie Huang, Lian Meng, and Xiaoyan Zhu. 2017. SSP: Semantic space projection for knowledge graph embedding with text descriptions. In Proceedings of the 31st AAAI Conference on Artificial Intelligence . 3104–3110. [41] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016. Representation learning of knowledge graphs with entity descriptions. In Proceedings of the 30th AAAI Conference on Artificial Intelligence . [42] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep matrix factorization models for recommender systems. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) . 3203– [43] Wu Yao, Christopher Dubois, Alice X. Zheng, and Martin Ester. 2016. Collaborative denoising auto-encoders for top-N recommender systems. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM). 153–162. [44] Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, and Changyou Chen. 2019. Text-based interactive recommendation via constraint-augmented reinforcement learning. Adv. Neural Inf. Process. Syst. 32 (2019), 15214–15224. [45] Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019. Deep reinforcement learning for search, recommendation, and online advertising: A survey. SIGWEB Newsl. Spring (July 2019). DOI:https://doi.org/10.1145/3320496.3320500 [46] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018. Deep reinforcement learn- ing for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys). 95–103. [47] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). 1040–1048. [48] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. 2018. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209 (2018). ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022. A Text-based Deep RL Framework Using Self-supervised Graph Representation 44:25 [49] Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive collaborative filtering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM). ACM, 1411–1420. [50] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the International Conference on World Wide Web (WWW). 167–176. [51] Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of users and items using reviews for recom- mendation. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM).ACM, 425–434. Received March 2021; revised January 2022; accepted February 2022 ACM Transactions on Data Science, Vol. 2, No. 4, Article 44. Publication date: May 2022.
ACM/IMS Transactions on Data Science – Association for Computing Machinery
Published: May 17, 2022
Keywords: Recommender system
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.