Word2vec Python

broken image


TL;DR - word2vec is awesome, it's also really simple. Learn how it works, and implement your own version.

Introduction

Word2Vec trains a model of Map(String, Vector), i.e. Transforms a word into a code for further natural language processing or machine learning process. So both the Python wrapper and the Java pipeline component get copied. Parameters extra dict, optional. Extra parameters to copy to the new instance. Gensim Word2Vec Tutorial Python notebook using data from Dialogue Lines of The Simpsons 207,359 views 3y ago nlp, text data, text mining, +1 more spaCy 340. Python scikit-learn gensim word2vec. Improve this question. Follow edited Jan 4 at 9:03. Asked Dec 6 '20 at 1:45. 173 10 10 bronze badges. Can you think of a minimal reproducible example that could be run start to finish? Word2vec is a technique for natural language processing.The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.

Since joining a tech startup back in 2016, my life has revolved around machine learning and natural language processing (NLP). Trying to extract faint signals from terabytes of streaming social media is the name of the game. Because of this, I'm constantly experimenting and implementing different NLP schemes; word2vec being among the simplest and coincidently yielding great predictive value. The underpinnings of word2vec are exceptionally simple and the math is borderline elegant. The whole system is deceptively simple, and provides exceptional results. This tutorial aims to teach the basics of word2vec while building a barebones implementation in Python using NumPy. Note that the final Python implementation will not be optimized for speed or memory usage, but instead for easy understanding.

The goal with word2vec and most NLP embedding schemes is to translate text into vectors so that they can then be processed using operations from linear algebra. Vectorizing text data allows us to then create predictive models that use these vectors as input to then perform something useful. If you understand both forward and back propagation for plain vanella neural networks, you already understand 90% of word2vec.

It's Actually Really Simple (I promise!)

Word2vec is actually a collection of two different methods: continuous bag-of-words (CBOW) and skip-gram1. Given a word in a sentence, lets call it w(t) (also called the center word or target word), CBOW uses the context or surrounding words as input. For instance, if the context window C is set to C=5, then the input would be words at positions w(t-2), w(t-1), w(t+1), and w(t+2). Basically the two words before and after the center word w(t). Given this information, CBOW then tries to predict the target word.

Figure 1: word2vec CBOW and skip-gram network architectures

The second method, skip-gram is the exact opposite. Instead of inputting the context words and predicting the center word, we feed in the center word and predict the context words. This means that w(t) becomes the input while w(t-2), w(t-1), w(t+1), and w(t+2) are the ideal output. For for this post, we're only going to consider the skip-gram model since it has been shown to produce better word-embeddings than CBOW.

The concept of a center word surrounded by context words can be likened to a sliding window that travels across the text corpus. As an example, lets encode the following sentence: 'the quick brown fox jumps over the lazy dog' using a window size of C=5 (two before, and two after the center word). As the context window slides across the sentence from left to right, it gets populated with the corresponding words. When the context window reaches the edges of the sentences, it simply drops the furthest window positions. Below is what this process looks like. Note that instead of w(t), w(t+1), etc., the center word has become xk and the context words have become yc.

Figure 2: a sliding window example

Because we can't send text data directly through a matrix, we need to employ one-hot encoding. This means we have a vector of length v where v is the total number of unique words in the text corpus (or shorter if we want). Each word corresponds to a single position in this vector, so when embedding the word v_n, everywhere in vector v is zero except v_n which becomes a one. Below in Figure 3, a one-hot encoding of examples 1, 5, and 9 from Figure 2 above.

Figure 3: one-hot encoding

→ For future reference, the column vectors ( y_{c=1},...,y_{c=C} ) are referred to as panels.

So far, so good right? Now we need to feed this data into the network and train it. Most of the literature describing skip-gram uses the same graphic depicting the final layer of the model somehow having three or more matrices. I found this rather confusing while I was reading the white papers so I ended up digging through the original source code to get to the bottom of it. I figure there are other people like me, so I created another version of the architecture that I find a little easier to digest.

Figure 4: skip-gram network architecture

Forward Pass

Now that the text data has been encoded, lets proceed through a single forward pass of the network. Step one is getting to the hidden layer h.

[ h = x^T W ]

Since ( x ) is a one-hot encoded vector, ( h ) is simply the ( k^{th} ) row of matrix ( W ).

[ h = W_{(k,:)}^T := v_{w_I}^T ]

Preceding forward, we need to calculate the values as they exit the network.

[ u_c = W^{prime T} h = W^{prime T} W^T x ]

Although ( u_c ) gets us through the network, we need to apply a softmax function. The softmax will compress each element of ( u_c ) to a range of [0,1] while also forcing the sum of ( u_c ) to equal one. This will help in computing network error and backpropagation later on.

[begin{aligned}y_c & = Softmax(u) [1.5em] p left( w_{c,j} = w_{O,c} | w_I right) & = y_{c,j} = frac{exp(u_{c,j})}{sum_{j^prime=1}^V exp(u_{j^prime})}end{aligned}]

For each one-hot encoded center word ( x ), matrix ( W^prime ) will output a certain set of values. This means that all the panels associated with center word ( x ) share the same input values, thus:

[ u_{c,j} = u_j = {v_{w_j}^{prime}}^T cdot h qquad text{for} ; c = 1,2,...,C ]

where ( v_{w_j}^{prime} ) is the output vector of the j-th word in the vocabulary, ( w_j ). Again because we're dealing with one-hot encodings, this means ( v_{w_j}^{prime} ) is a column vector taken from the hidden → output matrix ( W^{prime} ). Refer to figure-4 for clarity.

In code, this will take the form of the following:

With the softmax calculation taking the form of:

The forward pass is fairly simple and differs little from that of a standard, fully connected neural network.

Backpropagation & Training

Now to improve the weights within ( W ) and ( W^{prime} ) we're going to use stochastic gradient decent (SGD) to backpropagate the errors, which means we need to calculate the loss on the output layer.

[begin{aligned}E & = - log mathbb P ( w_{O,1}, w_{O,2},...,w_{O,c} | w_I ) [1.5em]& = -log prod_{c=1}^C frac{ exp(u_{c,j_c^*})}{sum_{j^{prime}=1}^V exp(u_{j^{prime}})} [1.5em]& = - sum_{c=1}^C u_{j_c^*} + C cdot log sum_{j^{prime}=1}^V exp(u_{j^{prime}})end{aligned}]

where ( j_c^* ) represents the index of the actual c-th output context word in the vocabulary.

Now that the loss has been calculated, we're going to employ the chain rule to distribute the error amongst the weight matrices ( W ) and ( W^{prime} ). First step is taking the derivative of ( E ) with respect to every element on every panel of the output layer ( u_{c,j} ).

[ frac{partial E}{partial u_{c,j}} = y_{c,j} - t_{c,j} := e_{c,j} ]

Where ( t_{c,j} ) is the ground truth for that particular panel. To simplify the notation going forward, we'll define the following:

[EI_j = sum_{c=1}^C e_{c,j} = sum_{c=1}^C left( y_{c,j} - t_{c,j} right) = frac{partial E}{partial u_j} tag{2}]

The column vector ( EI_j ) represents the row-wise sum of the prediction errors across each context word panel for the current center word. Proceeding backwards, we need to take the derivative of E with respect to ( W^{prime} ) representing the output → hidden matrix.

[begin{aligned}frac{partial E}{partial w_{ij}^{prime}} & = sum_{c=1}^C frac{partial E}{partial u_{c,j}} cdot frac{partial u_{c,j}}{partial w_{i,j}^{prime}} [1.5em]& = sum_{c=1}^C left( y_{c,j} - t_{c,j} right) [1.5em]& = EI_j cdot h_iend{aligned}]

Therefore, the gradient decent update equation for ( W^{prime} ) becomes:

[ w_{i,j}^{prime (new)} = w_{i,j}^{prime (old)} - eta cdot EI_j cdot h_i ]

Note that ( eta ) is the learning rate. Next, lets formulate the error update equation for the input → hidden layer ( W ) weights by deriving the error with respect to the hidden layer.

[begin{aligned}frac{partial E}{partial h_{i}} & = sum_{j=1}^V frac{partial E}{partial u_j} cdot frac{partial u_j}{partial h_i} [1em]& = sum_{j=1}^V EI_j cdot w_{ij}^prime end{aligned}]

This allows us to then calculate the loss with respect to ( W ).

[begin{aligned}frac{partial E}{partial W_{ki}} & = frac{partial E}{partial h_i} cdot frac{partial h_i}{partial w_{ki}} [1em]& = sum_{j=1}^V EI_j cdot w_{ij}^{prime} cdot x_kend{aligned}]

and finally, we can formulate the gradient decent weight update equation for ( W ).

[ w_{ij}^{(new)} = w_{ij}^{(old)} - eta cdot sum_{j=1}^V EI_j cdot w_{ij}^{prime} cdot x_j ]

At this point, everything needed to train the network has been established and we just need to code it.

Where the backpropagation function is defined as:

That's it! Only slightly more complicated than a simple neural network. To avoid posting redundant sections of code, you can find the completed word2vec model along with some additional features at this GitHub repo (here).

Results

As a simple sanity check, lets look at the network output given a few input words. This is the output after 5000 iterations.

Input wordbrowndogfoxjumpslazyoverquickthe
fox2.45e-014.34e-044.45e-042.53e-012.34e-052.53e-012.45e-017.62e-07
lazy5.81e-053.32e-012.42e-041.11e-051.91e-043.33e-014.51e-043.33e-01
dog1.85e-073.17e-041.31e-031.29e-044.98e-011.42e-054.86e-064.99e-01

These sample results show that the output probabilities for each center word are split fairly evenly between the correct context word. If we narrow in on the word lazy, we can see that the probabilities for the words dog, over, and the are split fairly evenly at roughly 33.33% each. This is exactly what we want.

Improvements

Shortly after the initial release of word2vec, a second paper detailing several improvements was published.2 Amongst these proposed improvements are:

Phrase Generation — This is the process in which commonly co-occuring words such as 'san' and 'francisco' become 'san_francisco'. The result of phrase generation is a cleaner, more useful, and user-friendly vocabulary list. Phrase generation is based on the following equation which utilizes the unigram and bigram vocabulary counts for the given corpus.

[ score(w_a, w_b) = frac{count(w_aw_b) - delta}{count(w_a) times count(w_b)} ]

The numerator consists of the total number of times the bigram formed with words ( w_a ) and ( w_b ) appears in the corpus. This is then divided by the counts of ( w_a ) multiplied by ( w_b ). The variable ( delta ) is referred to as the discounting coefficient and prevents the formation of word phrases consisting of very infrequent words.3

For longer word combinations such as 'new york city', an iterative phrase generation approach can be used. As an example, on the initial pass 'new' and 'york' could be combined into 'new_york', with a second pass combining 'new_york' with 'city' yielding the desired phrase 'new_york_city'. According to Mikolov et al. (2013), typically 2-4 passes with decreasing threshold values yielded the best results.

Subsampling — The purpose of subsampling is to counter the balance between frequent and rare words. No single word should represent a sizable portion of the corpus. To correct for this all words are discarded based on the following probability:

[ P left( w_i right) = 1 - sqrt{frac{t}{f left( w_i right)}} ]

Where ( f(w_i) ) is the frequency of word ( w_i ) and ( t ) is a user specified threshold. In Mikolov et al. (2013), values for ( t ) were typically around 10-5. As pointed out here, this probability calculation differs from the official word2vec C implementation. Below is the modified equation found in the C implementation:

[ P(w_i) = left( sqrt{frac{z(w_i)}{0.001}} + 1 right) times frac{0.001}{z(w_i)} ]

Where ( z(w_i) ) is the fraction of the corpus represented by the word ( w_i ). The higher ( P(w_i) ) is, the greather the chances are of keeping ( w_i ).

Negative Sampling — Often referred to as just NEG, this is a modification to the backpropagation procedure in which only a small percentage of the errors are actually considered. For instance, using example #9 from figure 3 above, dog is the target word, while the and lazy are the context words. This means that in an ideal situation, the network will return a '0' for everything except for the and lazy which will be '1'. In short, context words become 'positive' while everything else becomes 'negative'. With negative sampling, we only concern ourselves with the positive words and only a small percentage of the negative words. This means that instead of backpropagating the errors from all 8 words in the vocabulary, we backpropagate the errors from the positive words the and lazy plus ( k ) negative words. Because the vocabulary size in this example is only 8, the advantage of negative sampling may not be immediately apparent. Now, imagine for a moment that you have a corpus that yielded billions of training samples, you've had to clip your vocabulary to 10,000 words and your embedding vector length is 300. With negative sampling we're updating a few positive words, and a few negative words (lets say ( k = 10)) which translates to only 3,000 individual weights in W'. This represents 0.1% of the 3 million weight values we would otherwise be updating without negative sampling!

The probability that a negative word is chosen is determined using the following equation:

[ log p(w|w_I) = log sigma (v_w^{prime T} v_{w_I}) + sum_{i=k}^K E_{w_i ~ P_n(w)} [log sigma(-v_w^{prime T} v_{w_I})]]

As we can see, the primary difference between this and the standard stochastic gradient decent version is that now we include K observations. As for how large k should be, Mikolov et al. had this to say:

Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5.

- Mikolov et al. (2013)

Something to keep in mind however is that the official C implementation uses a slightly different formulation as seen below4:

[ P(w_i) = frac{f(w_i)^{3/4}}{sum_{j=0}^n left( f(w_j)^{3/4} right)} ]

Note that even without negative sampling, only the weights for the target word in matrix W are updated.

Conclusion

Thanks for reading, I hope you found this useful! If anything still seems unclear, please let me know so I can improve this tutorial. For further reading, I highly suggest working through each of the references below.

References

  1. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space (Submitted on 16 Jan 2013)[return]
  2. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality (Submitted on 16 Oct 2013)[return]
  3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality (Submitted on 16 Oct 2013)[return]
  4. Chris McCormick. Word2Vec Tutorial Part 2 - Negative Sampling (January 11, 2017)[return]

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for furthernatural language processing or machine learning process.

Examples

Methods

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getInputCol()

Gets the value of inputCol or its default value.

getMaxIter()

Gets the value of maxIter or its default value.

getMaxSentenceLength()

Gets the value of maxSentenceLength or its default value.

getMinCount()

Gets the value of minCount or its default value.

getNumPartitions()

Gets the value of numPartitions or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getSeed()

Gets the value of seed or its default value.

getStepSize()

Gets the value of stepSize or its default value.

getVectorSize()

Gets the value of vectorSize or its default value.

getWindowSize()

Gets the value of windowSize or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setMaxIter(value)

Sets the value of maxIter.

setMaxSentenceLength(value)

Sets the value of maxSentenceLength.

setMinCount(value)

Sets the value of minCount.

setNumPartitions(value)

Sets the value of numPartitions.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, *[, minCount, …])

Sets params for this Word2Vec.

setSeed(value)

Sets the value of seed.

setStepSize(value)

Sets the value of stepSize.

setVectorSize(value)

Sets the value of vectorSize.

setWindowSize(value)

Sets the value of windowSize.

write()

Returns an MLWriter instance for this ML instance.

Attributes

Returns all params ordered by name.

Methods Documentation

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and someextra params. This implementation first calls Params.copy andthen make a copy of the companion Java pipeline component withextra params. So both the Python wrapper and the Java pipelinecomponent get copied.

Parameters
extradict, optional

Extra parameters to copy to the new instance

Returns
JavaParams

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionallydefault values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values

Parameters
extradict, optional

extra param values

Returns
dict

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters
datasetpyspark.sql.DataFrame

input dataset.

paramsdict or list or tuple, optional

an optional param map that overrides embedded params. If a list/tuple ofparam maps is given, this calls fit on each param map and returns a list ofmodels.

Returns
Transformer or a list of Transformer

fitted model(s)

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
datasetpyspark.sql.DataFrame

input dataset.

paramMapscollections.abc.Sequence

A Sequence of param maps.

Returns
_FitMultipleIterator

A thread safe iterable which contains one model for each param map. Eachcall to next(modelIterator) will return (index, model) where model was fitusing paramMaps[index]. index values may not be sequential.

getInputCol()

Gets the value of inputCol or its default value.

getMaxIter()

Gets the value of maxIter or its default value.

Word2vec Python
getMaxSentenceLength()

Gets the value of maxSentenceLength or its default value.

New in version 2.0.0.

getMinCount()

Gets the value of minCount or its default value.

getNumPartitions()

Gets the value of numPartitions or its default value.

New in version 1.4.0.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getSeed()

Gets the value of seed or its default value.

getStepSize()

Gets the value of stepSize or its default value.

getVectorSize()

Gets the value of vectorSize or its default value.

getWindowSize()

Gets the value of windowSize or its default value.

New in version 2.0.0.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given(string) name.

isDefined(param)

Checks whether a param is explicitly set by user or hasa default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

Word2vec Python Tutorial

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)[source]

Sets the value of inputCol.

setMaxIter(value)[source]

Sets the value of maxIter.

setMaxSentenceLength(value)[source]

Sets the value of maxSentenceLength.

setMinCount(value)[source]

Sets the value of minCount.

New in version 1.4.0.

setNumPartitions(value)[source]

Sets the value of numPartitions.

setOutputCol(value)[source]

Sets the value of outputCol.

setParams(self, *, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)[source]

Sets params for this Word2Vec.

New in version 1.4.0.

setSeed(value)[source]

Sets the value of seed.

setStepSize(value)[source]

Sets the value of stepSize.

setVectorSize(value)[source]

Sets the value of vectorSize.

New in version 1.4.0.

setWindowSize(value)[source]

Sets the value of windowSize.

write()

Word2vec Python Package

Returns an MLWriter instance for this ML instance.

Attributes Documentation

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
maxIter = Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')
maxSentenceLength = Param(parent='undefined', name='maxSentenceLength', doc='Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size.')
minCount = Param(parent='undefined', name='minCount', doc='the minimum number of times a token must appear to be included in the word2vec model's vocabulary')
numPartitions = Param(parent='undefined', name='numPartitions', doc='number of partitions for sentences of words')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementationuses dir() to get all attributes of typeParam.

seed = Param(parent='undefined', name='seed', doc='random seed.')

Word2vec Python

stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (>= 0).')
vectorSize = Param(parent='undefined', name='vectorSize', doc='the dimension of codes after transforming from words')
windowSize = Param(parent='undefined', name='windowSize', doc='the window size (context words from [-window, window]). Default value is 5')




broken image