Paper-summary-5: Glove-Global Vectors for Word Representations

相信几乎每个用深度学习的人都被所谓的Pre-trained Glove统治过自己的模型,在懒得训练词向量的时候,一个pre-trained Glove成了最优备选方案。曾经某段时间,我还纠结过是该自己训练Word2vec还是偷懒用Glove,这次读了paper后发现,Glove在被发明出来的时候就已经把word2vec所表征的信息给包含进去了。这篇paper不仅从数学(?)角度详细推导了一次GloVe的式子,同时还单独开了一个section来论证了Skip-gram与GloVe的关系,这点个人感觉是非常地照顾读者了。 Introduction: 现在较为流行的两类词向量训练方法集为: Global Matrix Factorization: LSA Local Context Window Method: Skip-Gram 目前来讲,两类方法都存在各自的缺陷,LSA充分利用了基于词库的统计学方面的信息,但是在涉及到词相似度方面的任务时候表现的很差。而Skip-gram虽然在相似度任务上表现良好,但是却是于各个独立的局部上下文而不是基于整个语料库的co-occurrence counts。为了充分利用好二者的优势,作者提出了一个基于全局词与词共存对数的加权最小二乘法训练的模型,因此可以充分利用语料库的统计量。事实证明该方法在大部分同义词集中取得了SOTA的成果,并在词相似度任务和NER任务中都取得了领先的结果。 Pre-defined Notations \(X\) : 词与词的共存矩阵。 \(X_{ij}\): 单词\(j\)出现在单词\(i\)上下文的次数。 \(X_i = \sum_kX_{ik}\) \(P_{ij}=P(j|i) = X_{ij}/{X_i}\): 单词\(j\)出现在单词\(i\)上下文的概率。 \(w \in \mathbb{R}^d\): word vectors \(\widetilde{w} \in \mathbb{R}^d\): context word vectors Methodology 首先,作为词向量方面的永恒难题,如何从语料库中的词频等统计量提取词表征,词义是如何从这些统计量中产生的。未解决这个问题,作者首先基于语料库创建了个共存矩阵\(X\),进而获得概率矩阵\(P\)。symbol的定义依照上一节。然后设定了一个新的metric。假设有单词\(i\), \(j\), \(k\)。其\(k\)与\(i\)上下文关联性强于\(k\)与\(j\),则我们最后的词向量应当使\(P_{ik} / P_{jk}\)尽可能大,反之则尽可能小。现在尝试创建以该三个单词为变量的函数。 \[F(w_i, w_j, \widetilde{w}_k)=\frac{P_{ik}}{P_{jk}}\] 适合这个式子本身的\(F\)将会特别多,现在开始对\(F\)加限制条件。首先需要让\(F\)函数将等式右边的ratio在数学意义上表示出来,又因为(在之前的工作中)词向量空间本质上是一个线性空间,该ratio也是反映的\(i\)与\(j\)的差异性,所以尝试把\(w_i\)和\(w_j\)替换为\((w_i-w_j)\)。函数变为 \[F(w_i-w_j, \widetilde{w}_k) = \frac{P_{ik}}{P_{jk}}\] 显然,现在等式右侧为一个标量,左侧是一个关于向量的函数。为简化,先把左侧的函数arguments替换为向量内积让两边dimension相同。 \[F((w_i-w_j)^T\widetilde{w}_k) = \frac{P_{ik}}{P_{jk}}\] 接下来的这一步有些tricky,首先在构建\(X\)的时候,如果\(j\)在\(i\)的context中,那么我们同时可以推测\(i\)也在\(j\)的context中,考虑到每个word最终只有一个向量与之对应,因此推测出\(F\)应为一个关于\(w_i\)和\(\widetilde{w}_j\)的对称函数。为达成这一目标,首先要求\(F\)为一个从\((\mathbb{R}, +)\)映射到\((\mathbb{R}, \times)\)的一个group homomorphism,通式为 \(F(a + b) = F(a) \times F(b)\)。因此我们的\(F\)变为 [Read More]

Paper-summary-1: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

ref: Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping 首先整篇文章主要讲的是一系列针对 DTW 算法的时间优化策略,还包括与一部分主流的同类算法的效果对比,而且根据paper中所列的结果来看,这些优化策略在时间上确实表现出来了明显的提升,非常值得借鉴。 在这里DTW的全称是“Dynamic Time Warping”,本身是一个衡量两个序列相似程度的算法,而且在众多相关文章的验证下,可以说是针对时间序列问题的最好算法之一,但同时它也存在着同类算法的共同瓶颈问题,时间复杂度较高而无法适用于“大”数据。而这也是本篇paper旨在解决的。 How does DTW (Dynamic Time Warping) work Explaination:假设已知两个序列\(X\)和\(Y\),长度分别为\(M\)和\(N\)。当\(M=N\)的时候,此时最简单且直接的办法就是算“点对点”的欧式距离,但局限就是要求X与Y的序列长度相同,而显而易见这个并不能保证任何场景都满足。当两者都不等长时候,就不能像理想状况那样找到唯一且符合常识的对应关系,在这种情况下,我们需要综合考虑两个序列所能构成的所有“点与点”之间的对应关系(或者说是X上的任意点与Y上的任意点的距离),这样我们总共可以获得 \(M×N\) 对点,也就是\(M×N\)个“距离”。如果我们将这些距离用一个M×N的矩阵表示出来 \(Cost \in R^{M\times N}\),在这里元素\(Cost_{i,j}\) 指的就是X上的index 为i的值与Y上index为j的值的距离。为了获得一个可以表示similarity的数字(这里定义为\(cost\)),DTW提供了一个基于该矩阵获得measure的思路:找一条起始于\((0,0)\)终于\((M-1,N-1)\)而且monotonic的路径,要求该路径所覆盖的“距离”总和最小。问题也因此被简化为了一个动态规划中的“最短路径”问题。 Definitions and Notations Definition-1: 定义有序序列 \(T = t_1, t_2, ..., t_m\)。由于\(T\)很长,所以我们会用它的子序列来作比较。 Definition-2: 定义序列 \(T_{i,k} = t_i, t_{i+1}, ..., t_{i+k-1}\)为长度为\(k\),起始于\(t_i\)的\(T\)的子序列。 Definition-3: 假设有序列\(Q\)和\(C\),当\(|Q|=|C|\)时,定义\(Q\)和\(C\)之间的Euclidean distance (ED) 为: \[ ED(Q,C) = \sqrt{\sum_{i=1}^n(q_i - c_i)^2} \] 为避免产生歧义,用\(C\)来表示\(T_{i,k}\)作为用来比较的子序列,用\(Q\)来表示被比较的Query。之前也提到DTW计算过程被简化为一个“最短路径问题”。 [Read More]

Paper-summary-2: XGBoost- A Scalable Tree Boosting System

Introduction XGBoost对大部分人而言应该是不陌生的,在kaggle等数据科学竞赛的平台上凭借其准确快速的特性吸引了一大批拥趸。这次花了几天时间自己读了一下该论文,作者的基于GBDT的算法改进以及计算机系统设计层面的优化非常有创新性。不仅如此,作者在paper中将复杂度分析以及伪代码的部分也写的非常的详细。总的来说这是一篇很有价值的paper,适合精读一次。 Goal of the paper Describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientist to achieve SOTA results. 这是paper中的原话,而关键也在于这个system。相较于别的提升树模型,XGboost的一个主要优势也是对不同问题的高度兼容性以及稳定性。稳定主要体现在算法层面及系统层面。算法层面优化针对稀疏数据的处理方法以及对连续特征的分组方法。而系统层面提出了一种缓存管理的方法用来降低资源损耗。二者结合后使得XGBoost用相对当时现有的系统更少的计算资源在亿级数据层面获得不错效果。 Major contribution from this paper 提出一个highly scalable的端对端提升树系统。 提出加权版的quantile sketch用来处理连续数据分段。 提出适合稀疏数据的并行学习算法。 提出一种高效的缓存结构用来辅助out-of-core learning. Pre-defined Notations 在正文之前,先来统一一下数学符号以及一些关键性的概念。 \(\mathcal{D} = \{(x_i, y_i)\} (|\mathcal{D}| = n,x_i \in\mathbb{R}^m, y_i\in \mathbb{R})\) : 样本数为\(n\)的数据集,其中feature的个数为\(m\)。 \(\mathcal{D}_k = \{(x_{1k}, h_{1}),(x_{2k}, h_{2}),...,(x_{nk}, h_{n})\}\) : 第k个feature里的值及对应的二阶导数。 [Read More]
Math 

Setup Rstudio Server and Shiny Server on AWS EC2

In people’s traditional views, R always comes along with analysis and statistics. As R users, we all knew that R is not only a local software limited to analysis toolkit but has some powerful features to do fancy things. Thanks to Rstudio team who helped develop fundaments for R community, like Rstudio server for group project and shiny server for hosting shiny apps. Here are some experiences or traps I have encountered in setting these two servers on EC2. [Read More]

Recursion and Dynamic Programming

This blog mainly relied on the book, Cracking the coding interviews. Approach to Recursive Problems Recursive solutions are built on solutions to subproblems. This mostly referred to computing \(f(n)\) by modifying \(f(n-1)\). There are many ways to divide the problems into subproblems, but the three most common approaches to develop an algorithm are: bottom-up. top-down. half-and-half. (1) Bottom-Up Approach Starting with solving the problem for a simple case, like a list with one element. [Read More]

Some Array Manipulation Problems

What is Array? An array is an aggregate data structure that is designed to store a group of objects of the same or different types. In python array is simply represented as one list with the complexity table. case append get item set item get slice Avg complexity \(O(1)\) \(O(1)\) \(O(1)\) \(O(k)\) Problems This blog is to keep log of some basic array manipulation problems collected these days to help understand. [Read More]

Linked List and Classic problems

In preparing for the coding interview, I will start writing blogs to record the hints and tips about the basic data structures. The series of the blog will start from Linked List in this one. Definition: According to the book, Cracking the Coding Interview, a linked list is a data structure that represents a sequence of nodes where each node stores a value and points to next node. diagram [Read More]

Find the Longest Successive Path in a Matrix

Problem This is a coding problem which I previously failed in a online test. You are given a matrix filled with positive numbers and the task is build a function that could return the longest successive sequence in this matrix. The sequence must be in ascending order. \[ M = \left[ \begin{array}{rrrrr} 4 & 1 & 3 \\ 4 & 7 & 6 \\ 8 & 9 & 10 \\ \end{array} \right] \] [Read More]
Code  DFS 

Verify Path In Matrix

Problem You are given a matrix filled with characters and a string. Task is building a function to find out whether the string exist in the matrix. For example, suppose we have a matrix like following: \[ M = \left[ \begin{array}{rrrrr} A & B & C \\ G & D & X \\ E & F & Y \\ \end{array} \right] \] and a string “AGDXY”, then the function should return True because the string could be represented by a path in the matrix. [Read More]