recsys
推荐系统中的向量检索00-概述
YeeKal
•
•
"#recsys"
匹配和检索
- semantic matching: 语义匹配/文本匹配
- embedding-based retrieval: 语义检索/语义召回/向量召回/向量检索/
匹配和检索并没有统一的叫法,或者两者是有很大交叉的领域。匹配更多的应用在文本上面,而检索则是指搜索推荐中的应用。搜索中一般称为query和doc的匹配/检索,而推荐中称为user和item的匹配/检索,或者广告领域中的query-ad/user-ad。其实通过一个query召回doc,也可以说成是给定一个query匹配最相近的doc。本笔记专注于针对搜索推荐场景的向量检索,或者以下我们统称匹配/检索为向量检索。
- [语义/向量][检索/索引/匹配]一般指两大类应用:
- 应用于文本:这类方法更加丰富多样,并且应用于多种nlp场景,比如问答/机器翻译/段落匹配等。参考MatchZoo-py
- 应用于各种实体的:工业应用场景较多,比如电商(淘宝,京东,amazon),社交(facebook)等。可以说在有推荐/搜索的地方都可以应用。
- 检索对象/术语:
- doc/query, user/item
向量检索的三大方法
- 双塔模型(two tower model): 又称为representation learning。doc和query分别训练,由于模型分离,便于在doc端建索引,使用ann方法召回。
- 单塔模型:interaction learning.doc和query有特征交互,复杂度增大。 如阿里的tdm
- 图模型(graph embedding):图神经网络学习,最后用ann召回
双塔模型
dssm开山之作。之后为了捕捉时序信息,有通过CNN进行的改进和通过rnn进行的改进。
- Based on DNN
- DSSM: Learning Deep Structured Semantic Models for Web Search using Click-through Data (Huang et al., CIKM '13)
- Based on CNN
- CDSSM: A latent semantic model with convolutional-pooling structure for information retrieval (Shen et al. CIKM '14)
- ARC I: Convolutional Neural Network Architectures for Matching Natural Language Sentences (Hu et al., NIPS '14)
- CNTN: Convolutional Neural Tensor Network Architecture for Community-Based Question Answering (Qiu and Huang, IJCAI'15)
- Based on RNN
- LSTM-RNN: Deep Sentence Embedding Using the Long Short Term Memory Network: Analysis and Application to Information Retrieval (Palangi et al., TASLP '16)
相似度函数/匹配函数计算:
- cosine similarity:
- dot product
- multi-layer perception (ARC-I)
- neural tensor networkd(CNTN)
单塔模型
- Matching with word-level similarity matrix
- ARC II (Hu et al., NIPS '14)
- MatchPyramid (Pang et al., AAAl '16)
- Match-SRNN (Wan et al., IICAI '16)
- Matching with attention model
- Decomposable Attention Model for Matching (Parikh et al., EMNLP '16)
- Combining matching function learning and representation learning
- Representation Learning + Matching Function Learning Duet (Mitra et al., WWW'17)
图模型
- random walk
- deepwalk(2014)
- node2vec(2016)
- eges(2018)
- line(Lareg-scale information network embedding MSRA 2015)
- SDNE (Structural deep network embedding)
- graphsage(2017)
关于匹配
文本的匹配
Typical Query-Document Relevance Matching Methods:
- Based on global distribution of matching strengths
- DRMM (Guo et al., CIKM '16)
- aNMM (Yang et al., CIKM '16)
- K-NRM (Xiong et al., SIGIR '17)
- Conv-KNRM (Dai et al., WSDM '18)
- Based on local context of matched terms
- DeepRank (Pang et al., CIKM '17)
- PACRR (Hui et al., EMNLP'17)
推荐系统中的匹配:
- Collaborative Filtering: Models are built based on user-item interaction matrix only.
- DeepMF: Deep Matrix Factorization (Xue et al, IJCAI'17)
- AutoRec: Autoencoders Meeting CF (Sedhain et al, WWW'15)
- CDAE: Collaborative Denoising Autoencoder (Wu et al, WSDM'16)
- Collaborative Filtering + Side Info: Models are built based on user-item interaction + side info.
- DCF: Deep Collaborative Filtering via Marginalized DAE (Li et al, CIKM'15)
- DUIF: Deep User-Image Feature (Geng et al, ICCV'15)
- ACF: Attentive Collaborative Filtering (Chen et al, SIGIR'17)
- CKB: Collaborative Knowledge Base Embeddings (Zhang et al, KDD'16)
特征处理
- labelEncoding: 离散特征编程数字
- OneHotEncoding
- HashEncoding
- hash trick
- hash 冲突
- embedding
- 多值离散特征: 相加,平均
- 多个特征: 拼接