The Problem With Twitter Maps

December 25, 2014 by Taylor Jones

Twitter is trending

I'm a huge fan of dialect geography, and a huge fan of Twitter (@languagejones), especially as a means of gathering data about how people are using language. In fact, social media data has informed a significant part of my research, from the fact that "obvs" is legit, to syntactic variation in use of the n-words. In less than a month, I will be presenting a paper at the annual meeting of the American Dialect Society discussing what "Black Twitter" can tell us about regional variation in African American English (AAVE). So yeah, I like me some Twitter. (Of course, I do do other things: I'm currently looking at phonetic and phonological variation in Mandarin and Farsi spoken corpora).

Image of North America, entirely in Tweets, courtesy of Twitter Visual Insights: https://blog.twitter.com/2013/the-geography-of-tweets

Moreover, I'm not alone in my love of Twitter. Recently, computer scientists claim to have found regional "super-dialects" on Twitter, and other researchers have made a splash with their maps of vocatives in the US:

More and more, people are using social media to investigate linguistics. However, there are a number of serious dangers inherent to spatial statistics, which are exacerbated by the use of social media data.

Spatial statistics is developing rapidly as a field, and there are a number of excellent resources on the subject I've been referring to as I dig deeper and deeper into the relationship between language and geography. Any of these books (I'm partial to Geographic Information Analysis) will tell you that people can, and do, fall prey to the ecological fallacy (assuming that some statistical relationship that obtains at one level, say, county level, holds at another level -- say, the individual). Or they ignore the Modifiable Areal Unit Problem -- which arises out of the fact that changing where you draw your boundaries can strongly affect how the data are distributed within those boundaries, even when the change is just in the size of the unit of measurement.

The statistical consideration that most fascinates me, and seems to be the most likely to be overlooked in dealing with exciting social media data, however, is the problem of sampling.

Spatial Statistics aren't the same as Regular Statistics.

In regular statistics, more often than not, you study a sample. You can almost never study an entire population of interest, but it's not generally a problem. Because of the Law of Large Numbers, the bigger the sample, the more likely you are to be able to confidently infer something about the population the sample came from (I'm using the day-to-day meanings of words like "confidence" and "infer"). However, in the crazy, upside down world of spatial statistics, sampling can bias your results.

In order to draw valid conclusions about some kinds of spatial processes, it is necessary to have access to the entire population in question. This is a huge problem: If you want to use Twitter, there are a number of ways of gathering data that do not meet this requirement, and therefore lead to invalid conclusions (to certain questions). For instance, most people use the Twitter API to query Twitter and save tweets. There are a few ways you can do this. In my work on AAVE, I used code in Python to interact with the Twitter API, and asked for tweets containing specific words -- the API returned tweets, in order, from the last week. I therefore downloaded and saved them consecutively. This means, barring questionable behavior from the Twitter API (which is not out of the question -- they are notoriously opaque about just how representative what you get actually is), I can claim to have a corpus that can be interpreted as a population, not a sample. In my case, it's very specific -- for instance: All geo-tagged tweets that use the word "sholl" during the last week of April, 2014. We should be extremely careful about what and how much we generalize from this.

Many other researchers use either the Twitter firehose or gardenhose. The former is a real-time stream of all tweets. Because such a thing is massive, and unmanageagable, and requires special access and a super-computer, others use the gardenhose. However, the gardenhose is a(n ostensibly random) sample of 10% of the firehose. Depending on what precisely you want to study, this can be fine, or it can be a big problem.

Why is sampling such a problem?

Put simply, random noise starts to look like important clusters when you sample spatial data. To illustrate, this, I have created some random data in R.

I first created 1,000 random x and 1,000 random y values, which I combined to make points with random longitudes (x values) and latitudes (y values). For fun, I made them all with values that would fit inside a box around the US (that is, x values from -65 to -118, and y values from 25 to... Canada!). I then made a matrix combining the two values, so I had 1,000 points randomly assigned within a box slightly larger than the US. That noise looked like this:

" Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1. " "Never tell me the odds!"

Before we even continue, it's important to note two things. First, the above is random noise. We know this because I totally made it up. Second, before even doing anything else, it's possible to find patterns in it:

A density contour plot of random noise. Sure looks like something interesting might be happening in the upper left.

Even with completely random noise, some patterns threaten to emerge. What we can do if we want to determine if a pattern like the above is actually random is to compare it to something we know is random. To get technical, it turns out random spatial processes behave a lot like Poisson distributions, so when we take Twitter data, we can determine how far it deviates from random noise by comparing it to a Poisson distribution using a Chi-squared test. For more details on this, I highly recommend the book I mentioned above. I've yet to see anyone do this explicitly (but it may be buried in mathematical appendices or footnotes I overlooked).

This is what happens when we sample 100 points, randomly. That's 10%; the same as the Twitter gardenhose:

And this is what happens when we take a different 100 point random sample:

Another random 100 point sample from the same population.

The patterns are different. These two tell different stories about the same underlying data. Moreover, the patterns that emerge look significantly more pronounced.

To give an clearer, example, here is a random pattern of points actually overlaying the United States I made, after much wailing, gnashing of teeth, and googling of error codes in R. I didn't bother to choose a coordinate projection (relevant XKCD):

And here are four intensity heat maps made from four different random samples drawn from the population of random point data pictured above:

This is bad news. Each of the maps looks like it could tell a convincing story. But contrary to map 3, Fargo, North Dakota is not the random point capital of the world, it's just an artifact of sampling noise. Worse, this is all the result of a completely random sample, before we add any other factors that could potentially bias the data (applied to Twitter: first-order effects like uneven population distribution, uneven adoption of Twitter, biases in the way the Twitter API returns data, etc.; second-order effects like the possibility that people are persuaded to join Twitter by their friends, in person, etc.).

What to do?

The first thing we, as researchers, should all do is think long and hard about what questions we want to answer, and whether we can collect data that can answer those questions. For instance, questions about frequency of use on Twitter, without mention of geography, are totally answerable, and often yield interesting results. Questions about geographic extent, without discussing intensity, are also answerable -- although not necessarily exactly. Then, we need to be honest about how we collect and clean our data. We should also be honest about the limitations of our data. For instance, I would love to compare the use of nuffin and nuttin (for "nothing") by intensity, assigning a value to each county on the East Coast, and create a map like the "dude" map above -- however, since the two are technically separate data sets based on how I collected the data, such a map would be completely statistically invalid, no matter how cool it looked. Moreover, if I used the gardenhose to collect data, and just mapped all tokens of each word, it would not be statistically valid, because of the sampling problem. The only way that a map like the "dude" map that is going around is valid is if it is based on data from the firehose (which it looks like they did use, given that their data set is billions of tweets). Even then, we have to think long and hard about what the data generalizes to: Twitter users are the only people we can actually say anything about with any real degree of certainty from Twitter data alone. This is why my research on AAVE focuses primarily on the geographic extent of use, and why I avoid saying anything definitive about comparisons between terms or popularity of one over another.

Ultimately, as social media research becomes more and more common, we as researchers must be very careful about what we try to answer with our data, and what claims we can and cannot make. Moreover, the general public should be very wary of making any sweeping generalizations or drawing any solid conclusions from such maps. Depending on the research methodology, we may be looking at nothing more than pretty patterns in random noise.

-----

©Taylor Jones 2014

Have a question or comment? Share your thoughts below!

Maya glyphs in stucco at the Museo de Sitio in Palenque.

Language Jones and the Temple of...Burritos?

November 10, 2014 by Taylor Jones

Like most people, I enjoy burritos. Unlike most people, I also enjoy learning about ancient hieroglyphic writing systems, because I’m Indiana—er—Language Jones. A while back, I bought Stone & Zender’s Reading Maya Art: A Heiroglyphic Guide to Ancient Maya Painting and Sculpture, and borrowed or checked out a number of similar books from the library. I skimmed and enjoyed them, and then returned them. Stone & Zender took a place on my bookshelf and I moved on to other things, not anticipating I’d be able to see the Mayan ruins in Mexico any time soon.

Then it happened.

I went to a Chipotle in Philadelphia, looked at the wall, and realized their design was more than just decoration. There, looking back at me, was K’awiil, also known as God K, the “most ubiquitous god in Classic Maya art.” Next to K’awiil was a glyph representing a lord, possibily Juun Ajaw, one of the Hero Twins. All over the wall was seeing bits and pieces of legible, decipherable Classic era Mayan art. Here, the glyph for mountain. There, a shark.

Maya (?) glyphs at the Chipotle on 15th and Walnut, in Philadelphia.

Then I noticed some weird things. Like what I thought might be K’awiil had a bird on his head (because, why not put a bird on it?). A lord seemed to be writing, but with a burrito instead of a reed stylus. My studies had not prepared me for this.

I couldn’t just tell myself “that’s interesting,” and move on. So I did some research, and found that the wood and metal sculptures at many (all?) Chipotle locations were provided by a company named Mayatek Inc. Some of their artwork is surprisingly informed, and the names suggest that the creators are familiar with Maya art. Other works are named strange things that suggest little familiarity with what the art represents. There are some clearly modern embellishments on ancient symbols (like putting a bird on it, but said bird is not a quetzal).

For instance, the above is referred to on the Mayatek website as "dancer." There is a similar sculpture called "warrior dancer." However, anyone with even a passing knowledge of Maya art will recognize this is not a dancer, but a specific god. Lucky for you, reader, I have just such a passing knowledge: This is Chahk, the Maya rain deity. In his right hand is the axe he uses to strike clouds to make it rain. In fact, the above sculpture looks suspiciously like Chahk as represented in the Dresden Codex:

Chahk detail from a Codex style vessel in the Metropolitan Museum of Art. This particular variant is known as  Chak Xib Chahk — Chahk detail from a Codex style vessel in the Metropolitan Museum of Art. This particular variant is known as *Chak Xib Chahk*

Similarly, the below is referred to as "baldy," on the Mayatek website. However, the distinctive mark on the cheek is an indicator that the person depicted is a lady (as in "Lord and ____"). While, as it has been pointed out, ladies can be bald, I would argue the unmarked case is not, and "baldy" is not an appropriate appellation for a lady, bald or not -- suggesting the iconography was copied without full understanding.

In the photograph I took above, from a Chipotle restaurant in Philadelphia, the second glyph in the first column looks like a shark head, the glyph for stone, and a fish. The bottom glyph in the first row looked to me like a lord, although it seemed likely he's Juun Ajaw.

Hilariously (to me) the top right glyph looks like Acan, the god of alcohol, who is associated with swarms of bees, since bees are attracted to xtabentún an alcohol made from fermented honey and tree bark. And by "associated with bees" I mean "he literally vomits swarms of bees." I'm pretty sure that bit is before he ritually decapitates himself ("Chipotle: so good, you'll vomit bees and decapitate yourself!").

There's also a sculpture that is clearly Ixchel, the jaguar god, from the Dresden Codex.

Perhaps my favorite find was this:

....which looks a lot like God A, one of the Maya Death Gods (which, by the way, is an excellent name for a band). This would not make Chipotle the first major American chain restaurant to decorate with death iconography from another culture (that distinction may go to P.F. Chang's, with their terracotta soldiers), but I'm of the opinion "death by burrito" should be about portion size, and not about inadvertently invoking the wrath of an ancient deity.

In order to get more information, I wrote an email to Dr. Marc Zender one of the leading scholars on Maya glyphs and author of The Book on the subject, asking if he could tell me whether the bas relief decoration at this Chipotle was imitating some known work or complete gibberish (email title: "a frivolous question"). To my surprise, he responded, and the answer is that it's a little of both. He told me that the artist for Chipotle intended to copy a well-known collection of stucco glyphs from Palenque's Temple 18.

He explained: "The text was commissioned by the early 8th-century king K'inich Ahkal Mo' Nahb, and had fallen from the rear wall of a temple in antiquity. The stuccos were then recovered piecemeal by several different archaeological projects between the 1920s and 1950s. Primarily because their original order couldn't be determined, but also because most of them couldn't be read at that time, the curators at Palenque's archaeological site museum unfortunately ended up mounting them in (unreversible) cement, placing similar signs next to one another and creating a nonsensical text. "

He went on to explain that " the Chipotle artist has also picked glyphs at random from this collection and has made his best attempt to copy them. It's not a bad effort in some places, but note the 'bird with wings' the artist has created in the bottom rightmost glyph, as well as some missing or invented details in a few other places." So my intuitions that (1) it was partially invented and (2) the artist followed the Portlandia mantra "put a bird on it" both check out! I was paying attention.

Then, Dr. Zender made my day. "Just for the fun of it," He translated the glyph blocks from Chipotle: (left to right, top to bottom):

u-K'AM-ma-K'AJAN?-ch'o-ko
uk'amk'ajan ch'ok
"the youth's rope-taking" (a ceremony)
u-TZ'AK-AJ
utz'akaj
"its count" (calendric information)
WAX-YAX-SIHOOM-ma
"6 Yax" (part of a date)
chu-lu-ku-?
Chuluk ... (pre-accession name of the king)
i-K'A'-yi
i k'a'ayi
"his ... stopped" (a death verb, here referring to the king's father)
TIWOL?-4-ma-ta
Tiwohl Chan Mat (the father of the king)
mu-ka-ja
muhkaj "he was buried" (again referring to the father)
u?-na-ta-la
u naahtal "the first"? (ordinal title?)
MO'-na-bi
... Mo' Nahb (part of the name of the king)

Dr. Zender also explained that the "shrunken head" glyph I thought might be God A is actually a complex Early Classic spelling of the name of the serpent deity Chak Bay Kaan (CHAK-ba-ya-ka-KAAN). He went on "We're still not sure what bay means, but the other portions of the name are 'Red ... Serpent'."

So there you have it folks. Death verbs. "He was buried." Enigmatic dates. Mysterious serpents. Next time you're at Chipotle, forget the secret menu and instead focus on what one of my colleagues at U Penn enthusiastically referred to as a "disjoined, incoherent stream of historical tidbits." (Said colleague continued, "in that sense, it's not that different from the history of the non-European world that most people get anyway.")

Now if I could only figure out why a restaurant with a Nahuatl (=Aztec) name has Maya glyphs everywhere...

LSA talk preview: Semantic Bleaching of Taboo Words, and New Pronouns in AAVE

October 14, 2014 by Taylor Jones

Note: this post was coauthored with Christopher Hall.

TRIGGER WARNING: this post will discuss profanity, obscenity, taboo language, slurs, and racially charged terms.

I recently received word that an abstract Chris and I submitted to the Linguistic Society of America was accepted for a 30 minute talk at the LSA annual meeting in January of 2015. While exciting, this is also somewhat terrifying, because our research involves not just syntax, but taboo words, dialect divergence, and America's ugly racial history (and present). Outside of academia, there's an enormous amount of potential for misunderstanding, offense, hostility, and other ill feelings. Even among academics there's the potential for hurt feelings.

In brief, our research takes both recent work in syntax and recent work in sociolinguistics, and couples it with good, old-fashioned field-work and new computation methods (read: tens of thousands of tweets). However, the subject matter involves the emergence of a new class of pronouns in one (sub-)dialect of English from words that are considered offensive or taboo in other varieties of English. As such, it's potentially quite charged.

Before describing the research, it is absolutely crucial to note that:

we work as descriptive linguists: this means we observe a real-world phenomenon and describe it.
We neither condone nor disapprove of the data. Our job is simply to describe and analyze natural language as it is used in the world.
Both authors are native speakers of the variety of English in question.

So what's the big deal? Well, we argue that there is an emerging class of words that function as pronouns (remember elementary school English class? A pronoun is a word that stands in for another noun or noun-phrase) in some varieties of African American Vernacular English (AAVE), that are built out of the grammatical reanalysis of phrases including the n- word. Well, sort of the n- word because there's excellent evidence that there are actually at least two n-words, and that some speakers of AAVE differentiate between them and use them in different contexts.

WARNING: from here out, we will be discussing the use of words some deem extremely offensive. Seriously, just stop here if such discussion will offend you despite the above points. We will be using the actual words, not variants like b-h and n-. You've been warned!

Some preliminaries:

Pronunciation

One of the most potent slurs in American English is the racial epithet nigger (we warned you!). However, many white people oblivious to history and privilege don't hesitate to muse, "why can they [read: "black" people] use it, then?" Their observation - that some black Americans use what sounds like the same word - is valid, although insisting that makes the use of slurs OK is not valid.

AAVE is (generally) what can be called r-less and l-less. That is, in some contexts, especially at the end of words or syllables and when not followed by a vowel, words that may have an r or l are pronounced as though they do not. The stereotypical Boston accent is r-less: "pahk the car in Hahvahd yahd." (Note: "car" comes before a vowel, and therefore the r is pronounced!).

So when some speakers of AAVE use the word nigga, it is understandably interpreted as an r-less variant of a word that underlyingly has an r. However, the supposed r never shows up, not even intervocalically (jargon for "between vowels").

When people maintain that they're two different words, there seems to be good evidence for that. Note to white people: This does not give you license to use either. If you do not speak AAVE, and chances are you don't, you don't get to use either word. You WILL offend people, and no one will like you.

Semantic Bleaching

This is a term that has existed in linguistics for a long time, which we did not invent, so there is actually no pun intended. It means that a word, over time, loses shades of meaning. For our purposes, there is excellent research on "obscenity" in AAVE, the main argument being that many things that are considered obscene in other dialects have been semantically bleached. Spears (1998), for instance, argues that nigga, shit, bitch, and ass have been semantically bleached. In fact, Collins and Postal have shown that there is a particular grammatical construction that relies on the semantic bleaching of ass: the Ass Camouflage Construction (ACC), as in:

how ya no-phone-havin'-ass gonna call me?

Not content to just rely on the previous literature, we collected data from our stomping grounds: Harlem and the South Bronx, as well as West Philadelphia (mostly, this required little more than going outside and paying attention, although we did take notes on time, place, and type of use). We also used the Twython library for Python to extract and stored 10,000 tweets using the word nigga. While this is a huge sample by regular regular sociolinguistic norms (where 500 data points is impressive), it's worth keeping in mind that it's about 1/60th of what is tweeted in an average afternoon.

tweets containing nigga from August 19 - September 18, 2014. 16 MILLION tokens. — tweets containing *nigga* from August 19 - September 18, 2014. 16 MILLION tokens.

In none of the 10,000 we read was the word used as an epithet or slur (although there were some cheeky white people trying to test boundaries).

In fact, we argue that in this dialect, it is now human and male by default, but not always (an example of the not always: "I adopted a cat and I love that nigga like a person"). It is also not inherently specified for race, like nigger and other epithets are. In fact, race is often added to it, so the authors may be referred to in our neighborhoods as "that white nigga" and "the black nigga who was with him." Others include "asian nigga," and even "African nigga."

Among those who use the term, it is now a generic term like guy.

This shift in meaning seems to have happened some time after 1972-ish, possibly in conjunction with the rise of the Black Power movement, as an attempt to reclaim the word, similar to some feminists reclaiming bitch, and cunt. It was a necessary prerequisite for the super cool grammatical change our paper is actually about.

Grammatical Change: Pronouns or ...Imposters?!

The real point of our paper is about grammatical change. There exists a class of phrases first described by Collins and Postal, called Imposters. These are phrases that grammatically behave as though they are third person (reminder: he, she, it), but actually have first person (I, we) meaning. Great examples are:

Daddy is going to buy you an ice cream!
This reporter has found himself behind enemy lines.
The authors have already used 3 imposters in this very article.

Where the meanings are:

I am going to buy you an ice cream!
I have found myself behind enemy lines.
We have already used 3 imposters in this very article.

The key here is that the noun phrases behave in the syntax of the sentence as though they are 3rd person, but the actual meaning is first person -- we just decode it.

What we do is argue that there are new pronouns in AAVE, but first we have to argue that they're not just imposters. This is not trivial! For instance, Zilles (2005) argues that Brazilian Portuguese is developing a new first person pronoun, a gente ("ah zhen-tshy"), but Taylor (2009) argues that no such thing is happening, and it's just a popular imposter.

The Paper

We argue that a nigga is becoming a pronoun, meaning "I". The corresponding plural is niggas or niggaz. We also argue that there are two second person vocatives (that is, "terms of address") which are used depending on social deference one wants to show: nigga, and my nigga.

Yes. You read that correctly: we are claiming that saying my nigga signals politeness (...among speakers of this and only this dialect!!! Don't go saying Jones & Hall gave you the green light to say "my nigga" to your black friends!!!).

What's the evidence for pronoun status?

a nigga and my nigga are phonologically reduced. That is, there is a clear difference in pronunciation between the pronoun forms and the terms meaning "a person" and "my friend." To this end, we tend to use anigga and manigga, pronounced /ənɪgə/ and /mənɪgə/ (we leave the original spacing when quoting tweets, though).
No other words can intervene while still retaining the first person meaning. "A friendly nigga said hello" does not mean "I said hello," whereas "anigga said hello" can. The first means that some friendly guy said hello, but it wasn't the speaker.
anigga binds anaphors. No, that's not some kind of Greek fetish; Anaphors are words like "myself" "himself," "herself," etc. Binding in this case refers to which anaphors show up with the word. anigga patterns with the first person words, whereas imposters do not. For almost everyone "daddy is going to buy myself an ice cream" is either ungrammatical or sounds like daddy got lost in the middle of his sentence. anigga, on the other hand, is often used with myself, as in "anigga proud of myself."
Other pronouns refer back to anigga. That is, "you read all a nigga's tweets but you still don't know me."
Verbs are conjugated first person, not third person, with anigga. This is totally ungrammatical with imposters, and totally normal for actual pronouns. Example:
"Finna make myself dinner. a nigga haven't eaten all day." Compare that to "Daddy haven't eaten all day; he's going to make myself dinner." Really, really, abysmally bad.
anigga can be used in certain conditions that imposters - like "a brotha" - cannot. For instance, you can say "anigga arrived," with first person meaning, but the only interpretation available for "a brotha arrived" is third person. It's for this reason that we cannot simply substitute the much-less-likely-to-offend "a brotha" in our discussion of these terms.

That's basically it. In every conceivable grammatical test, anigga patterns with actual pronouns and not with imposters.

We then attempt to pinpoint the origin of it, and find that it must have happened some time between 1970 (The Last Poets) and 1992 (Wu-Tang). In 1993, it's already being used in puns in rap music, as in Wu-Tang Clan's "Shame on a nigga (that tries to run game on a nigga)", where the meaning is "shame on a guy who tries to run game on me." The first unambiguously pronoun appearance we can find in print is from a 1995 interview with ODB ("old dirty bastard") of the Wu-Tang Clan, followed shortly by use in a magazine interview with Slick Rick. This is over 100 years after the first records we can find of the use of anigga as an imposter -- all of which are from exceedingly racist old books from the 1880s.

With regards to the terms of address nigga and manigga, the difference seems to be social deference. When in a position of greater authority, nigga is the term of address used toward another person (As in the first minute of this video of possibly the best cooking show for chefs on a budget, and an excellent example for Spears, 1998). When showing deference, manigga is used. This is why there's a clear difference in meaning between "nigga, please," and "manigga, please." The first is dismissive, the second is pleading.

Non-linguists, feel free to skip this technical paragraph. Currently, we're in the process of tallying use in Urban Fiction as a way of getting at the frequency of use. It's exceptionally difficult to get a large enough sample of material to be able to tally use of these new pronouns compared to other pronouns. If you try and compare to the frequency of "I" on Twitter, for instance, you're then comparing against all varieties of English, not just AAVE. If you use some other word as a proxy for AAVE use (hypothetically, tweets that contain the word nigga), you then have a number of other confounds, like potential bias in your data set, or in the case of using nigga possible lexical priming effects. If you try and do sociolinguistic interviews, you get observer effects that bias the data. Fiction is a good way to get at what the author of a given novel perceives as natural, which we can then compare against other authors and other datasets (eg, Twitter). The goal right now is simply to get a baseline for comparison so we can begin to home in on a plausible range we can later refine.

Concluding thoughts

It's unlikely that this pronoun will ever replace or even truly rival the usual English pronouns, however speakers of this variety of AAVE now have a new way of expressing themselves at their disposal. For the moment, the authors have the dubious distinction of potentially being the world's leading experts on the n- words. So we've got that going for us, which is nice.

Big Data and Black Twitter

September 28, 2014 by Taylor Jones

This post is a story of how combining century old linguistic methods with new sources of data can reveal unexpected insights. It's a small preview of my upcoming talk at the annual meeting of the American Dialect Society, where I will discuss my recent research using social media to map previously undescribed dialect regions in African American Vernacular English (AAVE). It's the intersection of historical linguistics, dialect geography, spatial statistics, and #swag.

Prelude: Maps are Cool

I recently took a class with Bill Labov on Dialect Geography: an under-appreciated subfield of linguistics that had a bit of a heyday in the late 1800s, and which is now starting to make a come back, thanks in no small part to popular dialect surveys like this one from the New York Times.

In the class, we learned methods of mapping and interpreting spatial data to glean information about regional variation in language use, and to begin to understand language variation and change. We learned how maps like this were made:

l'Atlas linguistique de la France published from 1902 to 1920 by J. Gilliéron and E. Edmont – added colors from the version in Lectures de l'Atlas linguistique de la France de Gilliéron et Edmont. Du temps dans l'espace, pu… — l'*Atlas linguistique de la France* published from 1902 to 1920 by J. Gilliéron and E. Edmont – added colors from the version in *Lectures de l'Atlas linguistique de la France de Gilliéron et Edmont. Du temps dans l'espace*, published in G. Brun-Trigaud, Y. Le Berre & J. Le Dû (2005)

...but we also learned how to map data using newer, more sophisticated computational methods. For instance, reading geographic data from a comma separated file and mapping the data in the R programming language. More importantly, we learned what the interaction between geographic features, historical migrations, and a 'snapshot' of linguistic data can tell us about our language and ourselves.

Now, in the late 1800s, there were basically two ways that you could collect data for linguistic atlases: informally known as the German Method, and the French Method. The German Method was the method Georg Wenker used in 1876, when he sent out 50,000 surveys to German schoolmasters who dutifully sent back 45,000 completed surveys. The flaw in this method is that there is no guarantee of standardization as far as how the data is collected and interpreted. The French Method is what Jules Gilliéron used a decade later: send one trained linguist galavanting around the countryside on a bicycle for four years, eating baguettes, drinking wine, and conducting sociolinguistic interviews with everyone he can as he moves from town to town. My kind of job! Both methods resulted in gorgeous, detailed, and informative atlases...decades after the data were collected. More recently, enterprising linguists (among them, Dr. Labov) conducted telephone surveys, resulting in the "gold standard" Atlas of North American English. The ANAE gives an enormous amount of granularity to the study of regional dialects in North America -- seriously, click the link and play around, it's awesome.

Big Data

What the ANAE achieves, it does with a mere 792 speakers, intelligently sampled by region. It is a feat of ingenuity and economy.

However, we now have some intriguing new tools at our disposal, thanks to the internet and social media platforms like Facebook and Twitter. To give you an idea: a search for the word "the," -- a pretty good proxy for English use -- returns 607 million tokens in the last month alone. All of it is literally published work. It is, in effect, an enormous corpus of written language. Given the right tools and know-how, anyone can search that published material.

The Speech Problem: Graffiti and The Writing on the (Facebook) Wall

The only hitch is this: writing is not speech. In fact, if you try to figure out how English speakers anywhere pronounce English based on the spelling conventions of academic written English, you're gonna have a bad time. A few sound shifts here, a few hundred years of weird convention there, and you've got a system that doesn't tell you much of anything useful.

Notice, though, I said the spelling conventions of academic English. Many people have a pet peeve they're more than willing to share (especially on reddit, it seems): they hate when others write should of in lieu of should have. This kind of mistake is any historical linguist's favorite thing ever. Why? Because it tells us something about pronunciation. People who write should of have reduced should have to should've and it is coming out in their writing -- should of and should've are totally indistinguishable in casual speech.

It's precisely this kind of error, along with the writings of hand-wringing pedants lamenting the decline of language (among other things), that allow us to reconstruct the pronunciation of Latin as it changed through time. (aside: ever wonder why it's "inconvenient" but not "inpolite"? A historical linguist can tell you why, and when it happened). In fact, we get an enormous amount of phonologically relevant information from things like graffiti dick jokes in places like Pompeii Who says historical linguistics isn't fun?

Error isn't the whole picture though. It's one thing to say that people who struggle spelling will fall back on sounding things out. It's quite another when the non-standard spelling is intentional. For instance, one task for computational linguists interested in Natural Language Processing (NLP) is to group various spellings into sets that computers can recognize are all the same word. To simplify: a computer needs to know that color and colour are the same thing if it's going to process language quickly and effectively. Recent research in NLP has demonstrated that people on social medial platforms intentionally write how they speak. That is, they go out of their way to spell things in a non-standard way in order to better communicate how they talk informally. The best part is that this research holds across languages. While an American might be sittin (instead of sitting), a Dutch user of Twitter may well sitte (instead of zitten). This is especially true the further a dialect diverges from the written standard, as in modern dialects of Arabic. It's also true in AAVE, where the orthography you learn in school can't capture the phonological and grammatical nuances of the dialect -- something that writers like Zora Neal Hurston, Toni Morrison, and Ralph Ellison grappled with.

Black Twitter: Stigmatized Speech, Innovative Writing

Around the time I was taking the class on dialect geography, I stumbled upon a Youtube video purporting to explain #Blackfolkslang. It's a fun example of what linguists call enregisterment: when a dialect feature gets (consciously) noticed and becomes an overt marker of linguistic belonging. A classic example is the stereotypical Brooklynese fugeddaboudit.

Being a native speaker of AAVE (due to childhood speech community), the forms made intuitive sense to me and were a lot of fun. When I showed them to non-speakers of the dialect out of context, however, they were baffled. "What is ioneem? Is that Arabic?"

I thought it would be fun to dig into their use, and see where these forms were used, and how often. I got help writing a script in Python, using the Twitter API and the Twython package to extract tweets, and started using the mapping tools I was learning in R to check them out.

It became an obsession.

A few months and a few hundred thousands tweets later, I came to a few realizations. First, there's not consensus. Some people tweet nun (for "nothing"), while others tweet nuttin, and others still tweet nuffin. Second, the forms used vary regionally. Third, the phonological clues these tweets provide can be corroborated by both other media and linguistic informants (informant: a fancy term for people who both speak whatever a linguist is interested in and are willing to talk to one). Lastly, there's not just one "Black Twitter." The Black Twitter that blogs, contributes to NPR, and live-tweets sociology conferences was not the Black Twitter I was reading. I was reading tweets from young adults not represented in the Pew Research Center Internet Project, from young gang members who signal affiliation with spelling (fun fact: crips superstitiously avoid the combo "ck" because it could stand for "crip killa," and will instead favor spellings like "fucc"), and from people who use Twitter as a free analog to both texting plans and dating sites.

Some of the writing was not immediately recognizable. For instance, I was perplexed by yeen for "you ain't" (in part because it's not used in NYC or Philadelphia, I would later find). That is, I was perplexed right until I searched for it on YouTube, and came across dozens of different songs, often self-produced, which use yeen in the lyrics. Similarly, nun could conceivably be pronounced in a number of different ways. French Montana to the rescue! People often tweet lyrics to their favorite songs, and quite a number of them tweeted "nigga i ain't worried bout nun". Whether there is a glottal stop or it's elided for some of these tweeters is not clear, but what is clear is that it is two syllables, not one -- the only way to fit the rhythm.

Ultimately, I gathered data on ~30 terms (among them: yeen, talmbout, eem, ion, sumn -- you ain't, talking about, even, I don't, and something, respectively), and found that all of the variation could be explained by recourse to a handful of variations in pronunciation -- variations which can be corroborated by other means.

The Discovery: The Maps Don't Line Up

A handful of computationally minded linguists and linguistically minded computer scientists have been doing work on dialect geography using Twitter data, and I've found their work invaluable in developing this research. One of them, Gabriel Doyle (at UCSD), has demonstrated that dialect forms on Twitter correspond exceptionally well to the established gold standard of the ANAE. Like, uncannily, eerily well. He concluded, after some sophisticated statistical verification, that it's possible to glean geographic information about dialects from Twitter data.

His maps of double modals ("might could") and of the "needs washed" construction ("your car needs washed") line up perfectly with the maps produced by the ANAE and by the Harvard Dialect Survey (HDS).

My maps, however, did not line up.

Now, it has been known for a long time that including data from speakers of AAVE muddies things. In some ways, AAVE speakers do what other people in their general vicinity are doing, but in other ways they seem to do things differently. There's a large body of literature on this, but no national level description of regional variation in AAVE.

The standard maps of dialect regions in North America look like this:

Image from The ANAE, via the Texas English Project website: www.texasenglish.org

Notice the main feature is horizontal bands across the country, spreading from the East Coast. In some maps, the North, Midland, and South extend across the West, which is not given its own region. These regions follow patterns of westward expansion and settlement. In fact, maps of differences in building materials used in making cabins line up nicely with maps of dialect regions.

The thing is, AAVE does not share the same history as other North American dialects. Obviously, it is meaningless to discuss patterns of "settlement" when referring to black Americans, and while there is no consensus on the mechanics of how AAVE developed, it is understood to be largely an ethnolect, the product of a culture that developed in the last few hundred years shaped by (and despite) slavery, systemic racism, and extreme segregation.

In theory, then, the geographic distribution of AAVE should look different, and it should look roughly like the geographic distribution of Black Americans:

Image courtesy the Rural Assistance Center, from US Census 2010 data.

In some instances, this is what we see. For instance, when mapping AAVE-specific grammatical features like stressed been (which I discuss further here), the pattern lines up nicely with the population data:

initial exploratory plot of stressed been on Twitter

Note that tweets are concentrated in the South and the Northeast, and the areas with the highest black populations have the most tokens. Atlanta stands out particularly, but so do Oakland and LA, Chicago and Detroit. This pattern appears with other terms we'd expect to be non-regional, including nigga, tryna, and finna.

Similarly, enregistered lexical items (that is, local words famous for being local words) show up where we would expect them:

Philly's famously local word, "jawn," mapped on Twitter. Some of the unexpected points, on closer investigation, are people originally from Philadelphia. The two in Florida are someone referring to a friend named Jawn.

DC's newly famous "Jont." for more information, see Frances Sellers' fantastic article in the Washington Post, Is There a D.C. Dialect? It's a Topic Locals are Pretty Cised to Discuss, which discusses research by Georgetown's Minnie Annan. — DC's newly famous "Jont." for more information, see Frances Sellers' fantastic article in the Washington Post, *Is There a D.C. Dialect? It's a Topic Locals are Pretty Cised to Discuss,* which discusses research by Georgetown's Minnie Annan.

We see other, unexpected things, however. Things like this distribution of sholl for "sure":

Once we map everything, we get a broad pattern, with some words (tryna, finna) completely non-regional, some (talmbout, sholl) in a band up the middle of the country, some (ioneem, nuffin) solely along the East Coast I-95 corridor, and some (yeen) only in the South:

When I compared the maps of broad patterns of use in AAVE on Twitter to a map of the Second Great Migration from the Schomberg Center for Research in Black Culture, the pattern, and its most likely historical cause, revealed themselves.

What we see on Twitter is exactly what a historical linguist would expect from migration and dispersal, followed in some regions by innovation. The South Central to Midwest corridor and the South share terms like the quotative talmbout and phonological features like the replacement of /r/ with /l/ in the word sure (e.g., "sholl is."). The South, and the South alone, has so called /ey/-raising, consistent with Southern American English, making you ain't into yeen, (whereas other parts of the country simplify it to yain). New York says "nuttin" and "suttin," but D.C. has nuffin, and Philly is split right down the middle by these competing forces:

"My bruva neva syced bout nuffin" - You a bama.

The above is just a small taste; I will be presenting quite a few more maps, and discussing the phonological data in much greater detail in my talk at the American Dialect Society annual meeting, this January. I'm also preparing a paper for publication. The key finding is that the pattern looks like what any historical linguist would expect after migration and innovation.

Why is this a big deal?

I'm extremely excited about this line of research, for a number of reasons:

Not many linguists have been riding the big data wave. Instead, computer scientists with no training in linguistics are compiling huge data sets of, well, language, and they're doing their best to analyze it, and similarly linguists with no training in computer science are often ignoring the new tools at our disposal. In some instances, computer scientists are beating us to interesting discoveries. In others, they're getting flawed research past peer review because they're so unfamiliar with established concepts in linguistics that they think they've discovered "super dialects" when in reality they've stumbled upon register. We should all be collaborating, instead of reinventing the wheel.
This is the first attempt at defining dialect regions in AAVE on a national level, providing a baseline of research - a starting point for other researchers (and me, of course) to refine. For instance, there is a significant body of research that suggests "th-fronting" (that is, pronouncing words with a th like they have f/v, as in nuffin) is universal. While it may be possible to find everywhere (especially now that a Philly rapper has a hit song with the word "mouf" in the title!), it does not appear that way in these data. Moreover, in NYC, it's often interpreted as a marker that the speaker is not from here. Conversely, an informant I interviewed who had recently moved to Waldorf, MD, told me how he had to insist that his children do not say "nuffin," because he didn't want them "sounding like their peers in school," going on to say "everyone around here talks like that." In this way, participation or non-participation in these phonological patterns may be performatively indexing (non)local identity.
This research relies on a new method of gathering data that can be complementary to traditional methods, and can help point toward new hypotheses, and new areas of research. For instance, some of the data suggest a syntactic change in progress (distinct from the one I'm presenting on at LSA 2015, in fact).

Ultimately, I'm excited because social media are new sources of data for linguists to take advantage of, and they're sources that are extremely rich and extremely large. Whereas Georg Wenker needed decades to send out 50,000 surveys and process the results, given the right question, we're on the cusp of being able to gather more data than that in just the better part of an afternoon.

I'm also excited because this research puts black folk back on the map, literally. It's time for a large scale, systematic description of regional patterns in AAVE like what we already have for other North American dialects, and this is a step toward it!

GOOD NEWS!

September 26, 2014 by Taylor Jones

I have been accepted to present at not one, but two conferences this January. I will be presenting at the annual meeting of the American Dialect Society and presenting a co-authored paper at the annual meeting of the Linguistic Society of America (which is evidently spectacularly huge). Conveniently, these are being held at the same time, in the same place, in Portland, Oregon.

Although my current research covers many other subjects, both papers are on topics in African American Vernacular English. Posts about them to come soon!

What is AAVE?

September 19, 2014 by Taylor Jones

[UPDATE: This post now has a video companion, see below!]

AAVE is an acronym for African American Vernacular English. Other terms for it in academia are African American Varieties of English, African American English (AAE), Black English (BE) and Black English Vernacular (BEV). [EDIT: since I wrote this post in 2014, a new term has gained a lot of traction with academics: African American Language (AAL), as in the Oxford Handbook of African American Language edited by Sonja Lanehart (2015), or the Corpus of Regional African American Language (CORAAL). I now use either AAE or AAL exclusively, unless I’m specifically talking about an informal, vernacular variety, however “AAVE” has gained traction in social media just as AAL replaced it among academics]

In popular culture, it is largely misunderstood, and thought of as "bad English," "ebonics" (originally coined in 1973 by someone with good intentions, from "ebony" and "phonics," but now starting to become a slur), "ghetto talk" (definitely a slur), and the "blaccent" (a portmanteau word of "black" and "accent") that NPR seems to like using.

Why do I say it's misunderstood? Because it is emphatically not bad English. It is a full-fledged dialect of English, just like, say, British English. It is entirely rule-bound -- meaning it has a very clear grammar which can be (and has been) described in great detail. It is not simply 'ungrammatical'. If you do not conform to the grammar of AAVE, the result is ungrammatical sentences in AAVE.

That said, its grammar is different than many other dialects of English. In fact, it can do some really cool things that other varieties of English cannot. Without further ado, here's a quick run-down of what it is, what it do, and where it be:

Where does it come from?

AAVE was born in the American South, and shares many features with Southern American English. However, it was born out of the horrifically ugly history of slavery in the United States. Black Americans, by and large, did not voluntarily move to North America with like-minded people of a shared language and cultural background, as happened with waves of British, Irish, Italian, German, Swedish, Dutch, &c. &c. immigrants. Rather, people from different cultures and language cultures were torn from their homelands and sold into chattel slavery. Slaves in the US were systematically segregated from speakers of their own languages, lest they band together with other speakers of, say, Wolof (a West African language), and violently seize freedom.

There are two competing hypotheses about the linguistic origins of AAVE, neither of which will linguists ever be likely to fully prove, because the history of the US has completely obscured the origins of the dialect. Because of historical racism, we're left with hypotheses instead of documentation.

The two hypotheses are the Creole Origin Hypothesis, and the Dialect Divergence Hypothesis. Both are politically charged (linguists are people too...). The first is that contact between English speakers and among speakers of other languages led to the formation of a Creole language with an English superstrate but strong pan-African grammatical influences -- meaning lots and lots of English words, but still a distinct language from English. Another example of such a language is Bislama. The second hypothesis is that it is basically a sister dialect of Southern American English which started to diverge in the 1700s and 1800s.

How is it different?

AAVE has a number of super cool grammatical features that non-speakers tend to mistake for 'bad grammar' or 'lazy grammar'. Here is a - by no means exhaustive - list of the key differences between it and the useful hypothetical construct "General American" (GA -- basically, how newscasters speak):

Deletion of verbal copula (not as dirty as it sounds). This means that in some contexts, the word "is/are" can be left out. If you think this is "lazy grammar," speakers of Russian, Arabic, and Mandarin would like to have a word with you. example. "he workin'."
A habitual aspect marker (known as habitual be, or invariant be). Aspect refers to whether an action is completed or on-going. Habitual aspect means that a person regularly/often/usually does a thing, but does not give any indication of whether they are currently in the process of doing that thing. example: "he be workin'" (meaning: he is usually working.)
A remote present perfect marker (stressed been). This communicates that not only is something the case, and not only is it completed (ie. perfective aspect), but it has been for a long time. example: "he been got a job." meaning: he got a job a long time ago.
Negative concord. This means that negation has to all "match." If you've ever studied French, Spanish, Italian, Portuguese, Russian, or any of a whole slew of other languages, you've seen this. It is often stigmatized in English ("don't use double negatives!"), but is totally normal in many, many languages and in many varieties of English. example: He ain't never without a job! Can't nobody say he don't work.
It for the dummy expletive there. What's a dummy expletive? It's that word that's necessary to say things when there isn't really an agent doing the thing in question -- like in "it's raining." Some languages can just say "raining," and be done with it. English is not one of them. In contexts where speakers of other dialects might say there, some AAVE speakers say it. example: "it's a man at the door here to see you." More famous example "Oh, Lord Jesus, It's a fire."
Preterite had. This refers to grammatical constructions that in other dialects do not use had, but use the simple past. It's usually used in narrative. example: "he had went to work and then he had called his client." meaning: he went to work and then he called his client.
Some varieties have 'semantic bleaching' of words that are considered obscenities in other dialects - this is where a word loses shades of meaning over time. Here's a famous example.

There are quite a few other cool grammatical features and quirks, but these are among the major innovations (yes, innovations). There's also tons of lexical variation (read: different words).

Sounds cool, what's the big deal?

Basically, racism and linguistic prejudice. We have a long cultural history of assuming that whatever black people in America do is defective. Couple this with what seems to be a natural predilection toward thinking that however other people talk is wrong, and you've got a recipe for social and linguistic stigma. For instance, in 1996 the Oakland school board took the sensible step of trying to use AAVE as a bridge to teach AAVE-speaking children how to speak and write Standard American English. They also took the less sensible step of declaring AAVE a completely different language. This was wildly misrepresented in the media, leading to a storm of racist, self-congratulatory "ain't ain't a word" pedantry from both white people and older middle-class black people who do not speak the dialect. (author's note: ain't been a word...for over 300 years.)

The use of ebonics as a derisive slur comes out of this national media shitstorm. Literally nobody even wanted to teach AAVE, they simply wanted to use the native dialect of pre-literate children as a bridge to teach the standard dialect and to teach reading and writing. Like this program, Academic English Mastery, in Watts. How awesome was that?! Instead, it was portrayed as marxist nutjobs trying to force anarchist anti-grammar on helpless (white) American children instead of teaching them standard English.

There is absolutely nothing wrong with AAVE, but it is stigmatized for social and historical reasons, related to race, socioeconomic class, and prestige.

Who speaks it?

In general, black Americans, however there are exceptions to every part of this. Not all black Americans speak it (eg. Bill Cosby, who displays his ignorance of dialect variation often, and with gusto). Some black non-Americans speak it (eg. Drake, who speaks it professionally, and is Canadian). Not all people who speak it are black (eg. the author, Eminem, that white guy in the movie Barbershop). I even know a white linguist from Holland who speaks fluent AAE as a second language (it’s a language like any other, after all, although that kind of speaker is super rare).

In general, it can be assumed that non-black Americans probably don't speak or understand it. You can't necessarily assume, however, that a given black American does speak it. I recently tried to do the math to get a rough idea of how many people speak it, and came up with something like 30 million people, plus or minus about 10 million. I did this by looking at census data, linguistics papers that make estimates about how many black folk do speak it (ie., Rickford 1999), and guesses about how many non-black AAVE speakers might exist. So I basically pulled it out of ... a hat. (note to self: this would make a good research topic. Note to other academics: I called dibs!). Many people who do speak it are extremely adept at code-switching: in the popular imagination, that's deftly switching between dialects or registers as the social situation calls for.

As an aside, one common trope used by those against its recognition as a dialect is "no academic could ever teach a class or publish in it." The argument being that linguists are hypocritical for claiming it is a legitimate dialect, since they could never actually publish in it. This would be simply misguided if it weren't for the fact that linguists like Geneva Smitherman have published articles in AAVE.

Is it spoken the same everywhere?

Yes and no. Certain grammatical features seem to be universally used in AAVE, however there is regional variation in pronunciation. More on this in another post.

One key finding in sociolinguistics that was hard for me to wrap my head around is that a given dialect — Appalachian English, Philadelphia English, Yiddish English, African American English — may have 20 different distinctive features, but individual speakers might not use all 20. So someone who never uses habitual be can still be a native speaker of AAE.

My dissertation research demonstrated that there are at least ten distinctive accents in AAE. Other research shows that there may be regional variation in what syntactic structures are used. For instance, “be done” constructions (as in, “I be done went home when they be gettin’ wild”) used to be common in Philadelphia. We know this because we have recordings! But now some young people report having never heard such sentences, or that they’re the kind of thing their grandparents might say, but not them.

Linguists don’t all agree on what the core features are, although things like habitual be, stressed been, and consonant cluster simplification in syllable codas are good candidates. Features that only exist in AAE but aren’t universal in AAE are relevant too — like replacing the /d/ in words like bleeding with a glottal stop. (that is: [bliʔɪn]). There are also different registers in AAE, so Arthur Spears argues that African American Standard English (AASE) has different features from AAVE. Both are under the umbrella of AAE. (An example might be the pronunciation of /t/ in words like indemnity, where most white speakers of American English would pronounce that /t/ as a alveolar tap (that is: [ɪndɛmnɪɾi]), but many AAE speakers who are speaking formally might produce an aspirated t instead (that is, [ɪndɪmnɪtʰi]). There are tons of other factors that affect whether someone speaks AAVE or AASE in a given circumstance.

Closing thoughts

AAVE is a dialect of English like any other, but suffers extreme stigma due to the history of race in America. It has a systematic, coherent, rule-bound grammar. It has some super cool grammatical features that allow it to communicate complex ideas in fewer words than other dialects of English. While the rise of hip-hop and some reintegration of our cities has exposed more of the mainstream to some varieties of AAVE, it is still, unfortunately, highly stigmatized. Regarding those who still think it is somehow not valid, Oscar Gamble said it best: They don't think it be like it is, but it do.

For more on AAVE, check out this video, where I interview four Black scholars who speak and research African American language use.

-----

©Taylor Jones 2014

Have a question or comment? Share your thoughts below!

Bad Vibrations: the Bizarre Explanation Why the French 'Can't' Learn Languages

July 22, 2014 by Taylor Jones

The French have a completely absurd myth about language learning that blows my mind.

Having family in France, I'm lucky that I can sometimes visit. In some ways I'm unlucky, in that my family is largely insane, but insane family is a relatively common affliction. So, when a family member asked about my studies and then used that as a segue into pontificating about a totally ridiculous theory of why the French are physiologically incapable of learning English, I just assumed this was another instance of crazy family being, well, crazy.

Then I heard the same theory from a friend of my mother. When we recounted it to her French tutor, expressing our surprise at how two people who were apparently unacquainted shared such a preposterous view, this woman -- an educator, no less -- also supported it. As more and more French people we meet volunteer that they know and believe it, we're realizing it is a well-known, culturally ingrained myth. So what is it?

Apparently, it's simply common knowledge in France that The French cannot learn English (or other foreign languages) because of...different...frequencies...and...stuff.

The general gist of the idea is that different languages occur at different frequencies, and that native speakers of one language are ill-equipped to hear and interpret, and even worse equipped to produce, those frequencies.

It's unclear whether Mercury going into retrograde also affects things.

Now, as someone who likes to play Devil's Advocate, I kept trying to find ways to understand this nonsense. I thought, perhaps they recognize that the building blocks of a spoken language are its phonemes and that those can be thought of as being defined stochastically, so each speaker has a mental target, but every utterance will miss it by some margin of error. Maybe they also know that one can represent a speaker's vowel space by using a graph of the first and second formants (that is, the secondary frequencies produced by speech sounds) plotted as the x and y axes. This seems unlikely, but whatever, maybe it's common knowledge in France. If they recognize that individual productions of a sound will, in the aggregate, cluster around this target, maybe they also know that the target could potentially vary from speaker to speaker.

the vowel space derived from acoustic measurement of the first and second formant midpoints of short medial vowels in 50 Northern Mao words. From Aspects of Northern Mao Phonology.

Perhaps, then, what they're trying to say is that the target is slightly different from language to language, so an English /i/ and a French /i/ are, on average, slightly different. Then, you can make a bit of a leap and say that the fact of slightly different phonemic targets, coupled with different phonemic inventories makes it hard for an adult to learn a foreign language, because we're basically trained to separate sounds into different mental categories than in our target language, and certain combinations of F1 and F2 frequencies are ambiguous and confusing.

There's only one problem:

That's not what they mean.

No, they actually mean that the entire language as spoken by everyone who speaks any version of it, just...vibrates at a different frequency. That they can't hear. Or reproduce. In fact, there exists website after website after website that propose to train aspiring polyglots (for a fee, of course) how to open their ears and minds to these different frequencies. They often have scientific looking graphs, like this:

Nevermind the fact that there is a range that all human voices fall into, and that the vowel space is pretty well defined.

Nevermind that it's defined in two dimensions.

Nevermind that there are studies on vowel spaces across languages, and on differences among dialects of the same language (which should then be totally mutually incomprehensible).

Nevermind that women and men have different base frequencies, so according to this theory, women and men speaking the same language should be totally incomprehensible to one another [insert your own joke here].

Nevermind that "North American" isn't a language (seriously?!).

This bunk science is widely accepted as obviously true by the vast majority of French people I've interacted with. There's even a corollary: even if the French could hear and interpret those crazy foreign sounds, they can't make them because their mouths and vocal cords have become adapted to French in such a way that they are now malformed from the point of view of non-French languages (ie., langues non-civilisées).

When I ask how it's possible that I can speak and understand French, the consensus is that the frequency problem is one-way. That is, anyone can learn French (obviously the best, most expressive, most beautiful language -- ideally suited to the historical mission civilatrice), but the French are uniquely ill-suited to learn any other language because of those pesky fréquences.

One of the most interesting aspects of this myth is that it is so (psuedo) scientific. Whereas in the US people will just say they have no need, or will say they're "too old," they don't then lecture about their half-remembered misconceptions about the critical period hypothesis. In France, however, it seems totally unacceptable to say "I tried and failed," or "I never felt much need to learn anything else," or even "I can't because of external factors like age, opportunity, etc." Rather, it is a fundamental flaw of other languages which has been scientifically demonstrated: they simply vibrate at unfortunate frequencies.

I'm not quite sure what I expected from a place where doctors prescribe homeopathy and public intellectual is totally a legitimate job, but this kind of absurdity is wholly, delightfully foreign to me. Now, to have a croissant and a grand crème while I ponder whether simply digitally adjusting acoustic frequencies could create a universal translator.

-----

Have a question or comment? Share your thoughts below!

Facebook's "Emotional Contagion" Study Design: We're Mad For All the Wrong Reasons

July 02, 2014 by Taylor Jones

A new study in the Proceedings of the National Academy of Sciences has been receiving an enormous amount of negative press, as their study of 'emotional contagion,' has been called 'secret mood manipulation,' 'unethical,' and a 'trampl[ing] of human ethics.' Researchers took 689,003 participants, and used the Linguistic Inquiry and Word Count (LIWC) software to manipulate the proportion and valence of 'positive' and 'negative' emotional terms that appeared on users news feeds. They then argued that emotional contagion propagates across social networks. This study has a number of flaws, and the fact that it passed Institutional Review Board (IRB) review is the least of them.

Since there's so much wrong with it, let's start first with why it's not as bad as everyone thinks: there is far more content generated by Facebook users' friends than is viewable, and so the news feed only presents users with a small sample of what their friends posted. All of their friends posts were visible (that is, nothing was suppressed!) on their walls and timelines, as well as on news feed viewings before and after the one week experiment. Facebook is very clear about the fact that they only present a subset of posts on any given user's news feed, and this experiment was simply tinkering with the algorithm for a week. A careful read of the study methodology reveals why it passed IRB review -- it's not massive, secret emotional manipulation, like some kind of google-era attempt at a privately funded MK Ultra. Rather, it was slight tinkering with how Facebook filters posts that it already filters, and is clear about filtering in their terms of use. This is not, however, an attempt at Facebook apologetics. In fact, I think the article was absolutely terrible, but for different reasons. The thing people seem to be missing is that:

Facebook claims they demonstrated emotional contagion, but cannot show that they actually successfully manipulated emotions AT ALL.

That's right, the reason I'm upset is that they didn't manipulate emotions; not because I wanted them to -- as that would potentially be an enormous violation of ethics -- but because they claimed they did and published it in a peer-reviewed journal, without actually proving anything of the sort.

There are so many flaws with the methodology that I'm going to limit myself to bullet points covering the most glaring problems:

"Posts were determined to be positive or negative if they contained at least one positive or negative word, as defined by Linguistic Inquiry and Word Count software (LIWC2007) (9) word counting system, which correlates with self-reported and physiological measures of well-being, and has been used in prior research on emotional expression (7, 8, 10)." -- I'm friends with a ton of jazz musicians. When they call something bad, this is not a negative term, but would be interpreted as such by the LIWC.
More generally, depending on the social circle, terms like bad, dope, stupid, ill, sick, wicked, killing, ridiculous, retarded, and terrible should be grouped differently. There is absolutely no indication that the researchers took slang or dialect variation in English into account.
This study does not -- and cannot -- demonstrate actual emotional contagion. They have a much better chance of demonstrating lexical priming than emotional contagion. Except, they can't demonstrate that either, because all of the terms are aggregated, so they only know that words with 'negative valence' are predictors of the use of other words with negative valence.
"people ’s emotional expressions on Facebook predict friends’ emotional expressions, even days later (7) (although some shared experiences may in fact last several days)" -- That is, there's no control for friends in social networks sharing a real-world experience and posting about it on Facebook using similar emotional terms.
"there is no experimental evidence that emotions or moods are contagious in the absence of direct interaction between experiencer and target."

In other words, the Facebook study does not control for shared experiences being described in similar terms, does not control for different semantic and pragmatic contexts (e.g., "those guys were BAD, son. [Piano player] was STUPID NASTY on the gig last night!" is extremely positive, but would be interpreted by LIWC as extremely negative), and conflates emotional contagion with lexical priming (simply, the increased likelihood of using a given term if it is 'primed' by previous use or by previous use of a related term).

In order for this study to say anything even remotely interesting, the researchers would first have to demonstrate that they can get at actual emotional state through social media posts. Then, they would have to demonstrate that they could reliably determine actual emotional state from social media posts (what is the probability that a Facebook user is experiencing sadness given that they have used descriptive terms about sadness in their posts?). Next, they'd have to separate out confounds (e.g. "nasty" for "good"). Then they'd have to demonstrate that there is in fact a 'contagion' effect. Finally, they'd have to demonstrate that the apparent contagion effect was not just lexical priming (that is, me repeating "sad" because I was primed by another person's use of the word "sad," while not actually feeling sadness). If this post is any indication, they'd also have to figure out a way to control for discussion of emotion -- this post is chock full of negative terms, while being emotionally neutral, since I'm discussing emotional terms.

The real travesty is not that the Facebook study passed IRB; it's that it passed peer review.

This is indicative of a larger problem in the sciences: there is a bias toward dramatic findings, even if they're not terribly well supported. As a linguist, it feels like linguistics suffers more from this than other fields, since there have been a slew of recent dramatic articles published about linguistic topics by non-linguist dabblers who employ terrible methodology (for instance, making claims about linguistic typology predicting economic behavior, but getting all the typologies wrong!). Whether linguistics as field suffers from this more than other fields remains to be proven by a well designed study. That said, when people decide to do research that relies heavily upon understanding linguistic behavior, it behooves the researchers to, I don't know, maybe...consult a linguist.

Ultimately, the Facebook study was (just barely) within the realm of ethical study on human subjects, although their definition of informed consent was more than a little blurry. What's truly terrible about it is the fact that they make very strong claims about emotional contagion on social networks that their research does not justify, and they passed peer review.

-----

Have a question or comment? Share your thoughts below!

Obvs is Phonological, and it's Totes Legit

May 26, 2014 by Taylor Jones

Recently, NPR ran a story called Researchers are Totes Studying how Ppl Shorten Words on Twitter. It was primarily focused on what they called 'clipping,' for which the author of the article provides the example "awks," for "awkward." As far as I know, aside from the researchers interviewed by NPR, no one has done any scholarly work on this phenomenon, and as far as I can find on JSTOR and Google Scholar, no one has published anything on it.

The general consensus among regular folk is that the phenomenon is:

annoying
associated with young white women
the result of character limits on Twitter, or choices about spelling economy in text messages.

The first two are likely in some ways true: I don't have the data to prove it (yet!), but it does seem to be most deployed by young women (who are often the leaders of linguistic change), and -- as is often the case -- because of its association with young women, it is negatively socially evaluated by the general public. My issue is with the third point. Most people take it as so obvious as to be axiomatic that 'clippings' like "obvs," and "totes legit," are the result of spelling choices. Even the Dartmouth researchers interviewed by NPR are influenced by this assumption, and were perplexed to find that people still shorten their words on Twitter even when they have plenty of characters left to write.

Not only is the assumption that it's orthographically motivated wrong, but it's a perfect example of where linguistics can provide clearer insight than can be afforded by Big Data style data mining and statistical analysis without a grounding in the past 100 years or so of the scientific study of language. Perhaps it's confirmation bias that leads people to assume that this phenomenon originated in written communication. The fact is:

Truncations like "totes" for "totally" arise out of the spoken, not written, language.

They can be described entirely in phonological terms, without recourse to writing. Moreover, they are clearly sensitive to phonological environment: specifically, primary stress. It's not entirely clear why a written truncation should be sensitive to stress. If that weren't enough, sometimes what NPR calls 'clippings' are significantly longer than the word they're supposedly an abbreviation of. Case in point:

bee tee dubs is more than 3x longer than "BTW."

So, what's really going on?

Let's break it down. There are a few key features:

Words are truncated after their primary stress. A word like totally has three syllables, but its primary stress is on the first: tótally. The style of truncation under discussion is extremely productive, and can be used on new words. All of the truncations are sensitive to primary stress. When I asked women who use these forms, the consensus was that indécent becomes indeec, expósure becomes expozh, and antidisestablishmentárianism becomes antidisestablishmentairz. Note how spelling changes serve to preserve what remains of the pronunciation of the original word.
As much material as possible from the syllable following the stressed syllable is incorporated into the end of the new word (that is, the onset of the following syllable is resyllabified as part of the coda of the stressed syllable).
A final fricative is added if not already present (marv for marvellous). For most people who employ this kind of language play, there is actually a more restrictive rule: a final sibilant is added. This means that truncations can end with sh, zh, ch, s or z, and if there is no sibilant present, an s or z is added.

Voilà! An explanation that accounts for most of the data, explains forms that are not predicted by spelling rules, and makes correct predictions about novel forms.

The astute, Twitter-savvy reader might not be totally satisfied with the above, however. Such a reader might ask, "but what about forms like legit? Soz (sorry)? Tommaz (tomorrow)? Bruh (brother)? "

First, it's necessary to point out that truncation is not a new phenomenon in English. Part of what motivated me to look into this phenomenon was outrage that anyone would suggest legit arose from Twitter or texting. Three words can disprove the 'twitter hypothesis': MC Hammer.

Of course, a quick Google Ngram search will show that legit was in common use in the 1800s. Bumf, slang for tedious paperwork, is actually a truncation of 16th century 'bumfodder,' (i.e., 'toilet paper'). What's new here is the addition of the sibilant. Interestingly, it's now possible to find reanalyzed truncations on Twitter, so alongside legit, one may also see legits.

With regards to soz, appaz, tommaz, there is actually a very simple explanation: these forms are much more popular in the UK, and the speakers are non-rhotic. That is, they speak dialects that "drop the rs" (in point of fact, there is compensatory vowel lengthening in the contexts where r is not pronounced, so the r is not entirely absent). The above description actually perfectly describes how you get soz in a non-rhotic dialect. Underlyingly, it's still sorrs.

Finally, bruh, cuh, luh, and others. These are truncations, but in a different dialect of English: African American Vernacular English (although bruh has been borrowed into other dialects, like twerk, turnt, and shade have been recently). In these cases, the word is truncated after the primary stress, but subsequent material is not added to make a maximally large syllable coda.

This is where things get interesting. Truncation in both AAVE and other dialects of English leads to 'words' that are otherwise ill-formed. This may be part of why some people believe that such truncations are "annoying," or that their users are "ruining English." The /-bvz/ in obvs is not otherwise a permissible cluster in English (and most native speakers actually find it quite hard to say. Some 'fix' it by changing it to 'obv' or 'obvi,' the latter being the standard English diminutive or hypocoristic truncation). There, as far as I know, only four words in English that end with /ʒ/: rouge, garage, homage, and louge -- all of which are borrowed words, and some speakers 'correct' them to /dʒ/ (as in "George"). That sound does occur, however, in the middle of words like pleasure, treasure, measure, leisure, and so on...and ends up word final in truncations like plezh, trezh, mezh, leezh, and so on.

So what's the takeaway from all of this? Well, I hope it goes without saying, but young women aren't ruining English, even if they maybe speak a little differently than, say, your high school English teacher. Moreover, truncated forms like 'obvs' have nothing to do with writing. If they were simply shorthand for texting and Twitter, it would be a lot easier to wrt smthng lk ths. Instead, truncated forms are the result of language games that follow specific rules and are based on native mastery of phonology. They're closer to Pig Latin (or French Verlan, or Arrunde Rabbit Talk) than the babbling of a "speech impaired halfwit."

So next time someone says it was totes a plezh to make your acquaints, or responds to your "how're things?" with "my sitch is pretty deec," recognize that they are playing a language game that requires total, intuitive, mastery of English...and maybs play along, rather than making things totes awks for everyone.

-----

Have a question or comment? Share your thoughts below!

A ternary plot modeling basins of attraction, frequently used in Evolutionary Game Theory (image from the Stanford Encyclopedia of Philosophy)

Why Game Theory?

May 26, 2014 by Taylor Jones

One of my main research interests is Game Theoretic Pragmatics. Pragmatics is, to put it simply, the way people choose to use language -- how context, inference, prior knowledge, and other factors that are not purely structural affect meaning and communication.

So what does that have to do with Game Theory?

The answer is: a surprising amount. First, it's important to know what I mean by Game Theory.

Older readers will be surprised to learn that many people under 30 naturally assume Game Theory refers to...the theory behind creating video games. This is incorrect. On the other hand, most people who remember the Cold War have very strong feelings about Game Theory -- with ominous thoughts about military strategy, Mutually Assured Destruction, and nuclear nightmares -- but also don't quite see how that could be applicable to the study of language.

Game Theory is the branch of mathematics that deals with strategic decision making by thinking agents. It originated with the mathematicians John Von Neumann and Oskar Morgenstern and their book Theory of Games and Economic Behavior, published in 1944. For most people, the best known figure in Game Theory is the Nobel Prize winner John Nash (played by Russel Crowe in the biopic A Beautiful Mind), whose dissertation was, famously, a mere 28 pages long. While the study of rational (and irrational) decision-making is useful in military strategy, it need not be limited to macabre studies on nuclear annihilation, and it is in fact applicable to a wide range of non-military areas of research, including evolutionary biology and urban design.

My interest in Game Theory is twofold. I am interested in:

Modeling the ways in which speakers make choices when interacting, taking into account their beliefs about their interlocutors. In this respect, I model speech acts as plays in an extended game. I'm especially interested in how people choose to imply and infer meanings. The class of games known as Signaling Games are particularly useful here.
Modeling language change using the mathematics of Evolutionary Game Theory (EGT). EGT originated with John Manyard Smith and George Price. Their seminal 1973 paper applied Game Theoretic models to animal conflict. EGT is primarily concerned with strategies in competition. In evolutionary biology, this means using replicator dynamics to model gene flow (among other things). In linguistics, this means modeling linguistic behavior (pronunciation, word choice, etc.) as strategies in competition.

So what does this mean practically, in terms of actual projects?

Well, I'm currently working on a few projects using Game Theory:

I'm just finishing a paper on Microaggression, in which I argue that -- unfortunately and contrary to popular belief -- it is a property of coversational implicature that a hearer cannot be 100% certain what is meant by a verbal microaggression. I do so by modeling microaggression as a Bayesian Signaling Game: players form beliefs over the possible types of other players. Then, when one player sends a signal, the receiver can update their prior beliefs about that player's type before choosing an action.
I am using Distributed Morphology (DM) and EGT to model verbal suppletion as the result of small-scale statistical decision-making about politeness and coordinating behavior iterated over a large number of verbal decisions, over generations. Basically, I'm arguing for each independent form in a 'paradigm' as a strategy in a verbal ecology, and using replicator dynamics to model change. I am then comparing those predictions to the historical record, using various corpora.
I am using similar methods to model the 'euphemism treadmill' as predation. I compare predator-prey dynamics modeled in a Lotka-Volterra distribution to the patterns of emergence, adoption, and replacement of euphemisms, using various corpora.

If the above was Greek to you, don't worry! There will be future posts about each, in layman's terms.

Ultimately, Game Theory is just one of many possible tools in approaching language variation and change. Given an approach to linguistics, especially pragmatics, which treats language acts as decisions made by thinking agents (this is emphatically not a given for many linguists!), Game Theory provides an enormous number of incredibly useful tools for thinking about language, modeling change, and making and testing predictions about language use.

-----

Have a question or comment? Share your thoughts below!

So what is Linguistics anyway?

May 25, 2014 by Taylor Jones

I'm already a year into a PhD program in linguistics, and I still haven't come up with a decent 'elevator pitch' to explain what linguistics is, and why bother studying it. The inaugural post on Language Jones is a perfect opportunity to remedy this sad state.

I've spent the last few years responding to questions by telling people what linguistics isn't:

Linguistics is not translation studies. Translation is a field unto itself -- and one that, as a polyglot, I was very tempted to pursue before finding my real calling. Perhaps it's because organizations like the US military and the UN sometimes call their translators "linguists," or perhaps it's because people know linguists do something with language and are guessing the rest, but a surprising number of people immediately ask "oh, which language?" upon hearing I'm a linguist. Most are not happy with the response "all of them!"
Linguistics is not polyglot training. A colleague once quipped that there are two kinds of linguists: the language people, and the computational people. The former geek out about languages in the world, the latter are passionate about figuring out what is going on in the mind, or figuring out the structure or models of the structure of Language (capital L). In fact, many linguists, to the average lay-person's confusion and chagrin, are monolingual.
Linguistics is not grade school grammar pedantry. I will not correct your spelling. I do not care about your comma placement. If you ask me a question about proper usage, I'm liable to give you a brief history of etymology and changing spelling convention, before describing two or three competing standards. That is, unless I don't know that history -- in which case, I will drop everything and get to googling ("It turns out, that rule was made up by 'grammarians' in the 1700s, just to imitate Latin!").

The negation out of the way, just what is linguistics? Linguistics is the scientific study of language. Let's unpack that:

it is scientific study, meaning it is descriptive, not prescriptive. No linguist worth their salt will ever tell you a language should be some way. Rather, they will describe what languages and their speakers do. Double negatives? "Why yes, they're used in x, y, and z dialects of English, as well as all of these other languages -- however they are negatively socially evaluated by speakers of these dialects, in these social classes, in these settings." Split infinitives? "They're impossible in Latin, but a distinguishing feature of English."

It is the study of language. This means it is the study of the human faculty of language. This is a remarkably broad field, and as such, linguistics encompasses a number of subfields:

Phonology - the study of the systematic organization of sounds in a language
Syntax - the structure of language
Phonetics - the study of the sound of speech: articulatory, acoustic, and auditory production and perception.
Morphology - simply put, the structure of the building blocks of words
Historical Linguistics - the study of language change, topics in history (e.g., how the Romance languages developed out of Latin), and the reconstruction of languages that are no longer living.
Sociolinguistics - the study of the interaction between society and language.
Semantics - how languages mean things.
Pragmatics - how people use language in interaction
&c. &c.

Any given linguist will have their subfields of interest and their particular specialties. A syntactician might be able to discuss the structural differences among a hundred languages but speak only English, whereas a historical linguist focusing on Indo-European languages will almost indubitably know Latin, (a dialect of ) Ancient Greek, Sanskrit, and two other 'dead' languages.

My primary areas of interest are sociolinguistics, language variation and change, phonology, and pragmatics. I am currently interested in topics in dialects of English like African American Vernacular English, as well as in Chinese, Arabic, Persian, French, and others. One of the key problems I keep coming back to is language change and how changes are socially evaluated.

Ultimately, linguistics is a very large umbrella. At its heart, it is the scientific method applied to the study of all aspects of human language in all its varieties.

----

Have a question or comment? Share your thoughts below!