Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features【论文记录】

2020-12-02 10:01:04 阅读：220 来源： 互联网

标签：log Web Scale Features Residual Deep 广告 Crossing left

交叉特征有很好的效果，但人工组合发现有意义的特征很难
深度学习可以不用人工挖掘特征，还可以挖掘到专家都没找到的高阶特征
特色在于残差单元的使用，特征的表示

1 摘要

automatically combines features to produce superior models
自动组合特征以产生出色的模型
achieve superior results with only a sub-set of the features used in the production models.
仅使用生产模型中使用的特征的子集即可获得出色的结果。

2 Sponsored Search

Sponsored search is responsible for showing ads alongside organic search results
Sponsored Search 负责与自然搜索结果一起展示广告

概念	含义
Query	用户在搜索框中输入的文本字符串
Keyword	广告商指定的与产品相关的文本字符串，以匹配用户查询
Title	广告客户指定的赞助广告标题，以吸引用户的注意
Landing page（登录页面）	当用户点击相应的广告时，用户访问的产品网站
Match type	提供给广告客户的选项，可以让用户查询关键字与关键字的匹配程度如何，通常为以下四种之一：精确，词组，广泛和上下文
Campaign	一组具有相同设置（如预算和位置定位）的广告，通常用于将产品分类
Impression（展品）	向用户显示的广告实例。通常会在运行时记录展品以及其他可用信息
Click	用户是否点击了展品的指标。通常会在运行时记录一次单击以及其他可用信息
Click through rate	总点击次数超过总展示次数
Click Prediction	平台的关键模型，可预测用户针对给定查询点击给定广告的可能性

3 特征表示

Simply converting campaign ids into a onehot vector would significantly increase the size of the model.
只将广告系列 ID 转化为 onehot 向量，就会大大增加模型的大小
- One solution is to use a pair of companion features as exemplified in the table, where CampaignID is a one-hot representation consisting only of the top 10, 000 campaigns with the highest number of clicks.
  一种解决方案是使用表中示例的一对广告特征，CampaignID 是只包含点击次数最高的前 10,000 个广告的 onehot 表示
- Other campaigns are covered by CampaignIDCount, which is a numerical feature that stores per campaign statistics such as click through rate. Such features will be referred as a counting feature in the following discussions
  其他广告由 CampaignIDCount 包含，CampaignIDCount 是一个数字特征，可存储每个广告的统计信息，例如点击率。在以下讨论中，此类功能将被称为计数特征。
Deep Crossing avoids using combinatorial features. It works with both sparse and dense individual features
Deep Crossing 不使用特征组合。它可以同时处理稀疏和密集的个体特征

4 模型结构

Deep Crossing模型结构

The objective function is log loss but can be easily customized to soft-max or other functions
目标函数是 log 损失函数，但也能定义为 softmax 或其他函数
logloss = − 1 N ∑ i = 1 N ( y i log ⁡ ( p i ) + ( 1 − y i ) log ⁡ ( 1 − p i ) ) (1) \text { logloss }=-\frac{1}{N} \sum_{i=1}^{N}\left(y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right) \tag{1} logloss =−N1i=1∑N(yilog(pi)+(1−yi)log(1−pi))(1) p i p_i pi 是 Scoring 层一个节点的输出

4.1 Embedding and Stacking Layers

The embedding layer consists of a single layer of a neural network, with the general form
Embedding 由神经网络的单层组成，一般形式为
X j O = max ⁡ ( 0 , W j X j I + b j ) (2) X_{j}^{O}=\max \left(\mathbf{0}, \mathbf{W}_{j} X_{j}^{I}+\mathbf{b}_{j}\right) \tag{2} XjO=max(0,WjXjI+bj)(2) 其中，
X j I X^I_j XjI 是 n j n_j nj 维的输入特征，
W j W_j Wj 是 m j × n j m_j \times n_j mj×nj 矩阵
b b b 是 n j n_j nj 维的
当 m j < n j m_j \lt n_j mj<nj，embedding 就可以减小输入特征的维度
这个运算参考于 ReLU
Note that both { W j W_j Wj} and { b j b_j bj} are the parameters of the network, and will be optimized together with the other parameters in the network.
W j W_j Wj 和 b j b_j bj 会和网络中的其他参数一起进行优化，这与 word2vec 不同

4.2 Residual Layers

源于 Residual Net 的 Residual Unit，进行了修改

The unique property of Residual Unit is to add back the original input feature after passing it through two layers of ReLU transformations
残差单元的独特属性是在经过两层 ReLU 转换后，将原始输入特征添加回去
X O = F ( X I , { W 0 , W 1 } , { b 0 , b 1 } ) + X I (3) X^{O}=\mathcal{F}\left(X^{I},\left\{\mathbf{W}_{0}, \mathbf{W}_{1}\right\},\left\{\mathbf{b}_{0}, \mathbf{b}_{1}\right\}\right)+X^{I} \tag{3} XO=F(XI,{W0,W1},{b0,b1})+XI(3) F ( ⋅ ) \mathcal{F}(\cdot) F(⋅) 表示拟合 X O − X I X^O - X^I XO−XI 的残差
the authors believed that fitting residuals has a numerical advantage. While the actual reason why Residual Net could go as deep as 152 layers with high performance is subject to more investigations, Deep Crossing did exhibit a few properties that might benefit from the Residual Units.
在这篇论文中¹作者认为拟合残差具有数值优势。尽管“Residual Net”可以深入到 152 层还能有很高性能的实际原因尚待进一步研究，但“Deep Crossing”确实显示出一些可能会受益于“残差单元”的属性。
Deep Crossing was applied to a wide variety of tasks. It was also applied to training data with large di↵erences in sample sizes. It’s likely that the Residual Units are implicitly performing some kind of regularization that leads to such stability.
Deep Crossing 被应用于各种各样的任务。它也适用于样本数量差异较大的训练数据。残差单元可能会隐式执行某种正则化操作，从而导致这种稳定性。

5 总结

Deep Crossing demonstrated that with the recent advance in deep learning algorithms, modeling language, and GPU-based infrastructure, a nearly dummy solution exists for complex modeling tasks at large scale
Deep Crossing 证明了随着深度学习算法，建模语言和基于GPU的基础架构的最新发展，针对大型复杂建模任务存在着几乎是虚拟的解决方案

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385,2015. ↩︎

标签：log,Web,Scale,Features,Residual,Deep,广告,Crossing,left
来源： https://blog.csdn.net/qq_40860934/article/details/110451599

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9