Word Embeddings

I recently read an excellent blog post about word embedding models, something I've been fascinated by for some time now. 

To simplify some, they're vector models of word relationships in a corpus, so you can imagine relationships between or among lexical items as being situated in (some higher-dimensional) space, that you can then theoretically reduce to meaningful relationships and project in manageable dimensions. Spatial relationships among items can then potentially tell you something about both what words mean similar things (or do similar things), and about the corpus itself. Moreover, because it's all linear algebra, you can perform mathematical operations on items in the space. The most famous example of this, and one that's been going around a lot on social media lately, is:

king - man + woman = queen

For a variety of reasons (better coding skills, new familiarity with matrix algebra, interest in external computational validation of semantic intuitions), now seemed like an excellent time to level up. So, I decided to get to work in R (after some cleaning in python, with the NLTK package).

The tutorial for the wordVector package does a lot of fun things with a large corpus of cookbooks (closest words to fish: salmon, haddock, cod...), but I figured why not play around with some other things? Why not, say, tweets?

I have a corpus of ~17,000 tweets all in (basilectal) AAE that I collected for my research on geographic patterns in AAE on social media. While this definitely is on the small end, it seemed suitable as a trial run, and I'm quite pleased with the results.

For instance, among the closest words to eem are terms that are negation (don't, didn't, ain't, can't) and negative polarity items (even, much, yet, nomo, anymore, anything). Among the closest to nuttin are nuffin, and sayin. 

What I'm finding really interesting is the results of projecting the whole thing down to a two dimensional space, even before having really cleaned the data:

Things that belong together are very clearly together: happy is right under birthday (top left). Nuffin and nuttin are both in the same place, as are somebody and sumn. Talm and talmbout are right on top of one another (bottom left), and quite far away from talk and talking (middle right), with said in between (exactly what I would predict based on the material I presented at NWAV). Eem is right next to even in a cloud of negative words: ain't, don't, ion (i.e. "I don't..."), all at the bottom right. Question words all clustered together in the top (slightly left of center). Verbs (sleep, eat, talk, take give, go, hit) are all in a cluster in the middle right. Dat and doe are right by one another (top left). Hell is in the immediate vicinity of both nah and yeah.

Of course, the fact I haven't much cleaned the data means that don, can, ain are a different cluster than  dont, cant, aint, but an actual analysis would fix that (and exclude http, https, and all the floating alphanumeric bits).

Even with messy data, there are some intriguing relationships: jawn is right between miss and sombody/sumn (and forms a triangle with somebody/sumn and baby/girl). In fact, the nearest vectors to jawn include jont and philly.

 Moreover, performing vector operations like jawn - philly yields jont, the Washington DC equivalent of jawn, in the top 3 results (pragmatics: guess which rank). Nuttin - nyc yields nuffin. This is fascinating, in part because geographical variation is showing up in a very abstract high dimensional space, almost like a regional AAE translator:

jawn - Philly = jont

nuttin - NYC = nuffin  

The next step is to do some transformations of the vector space to dig into these relationships: what happens when you frame things in terms of an opposition between love and hate? Where would jawn fall relative to girlfriend?  

I have a lot of work to do to develop this, but already I can see some excellent potential for future research. 

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!