Testifying While Black

[content warning: language] [co-authored with Jessica Kalbfeld]

For the last four years I've been working on a large-scale project distinct from writing my dissertation that my family and friends know I refer to as my "shadow dissertation." It's a co-authored paper, with Jessica Kalbfeld (Sociology, NYU), Ryan Hancock (Philadelphia Lawyers for Social Equity, WWDLaw), and my advisor, Robin Clark (Linguistics, University of Pennsylvania), and we just received word that it has been accepted for publication in Language. Many of my other projects, including my work on the verb of quotation talkin' 'bout, on first person use of a nigga, and on the spoken reduction of even to "eem", among others, were all in service of this project.

Simply put: court reporters do not accurately transcribe the speech of people who speak African American English at the level of their industry standards. They are certified as 95% accurate, but when you evaluate sentence-by-sentence only 59.5% of the transcribed sentences are accurate, and when you evaluate word-by-word, they are 82.9% accurate. The transcriptions changed the who, what, when, or where 31% of the time. And 77% of the time, they couldn't accurately paraphrase what they had heard.

Let me be clear: I am not saying that all court reporters mistranscribe AAE. However, the situation is dire. For this project, we had access to 27 court transcriptionists who currently work in the Philadelphia courts -- fully a third of the official court reporter pool. All are required to be certified at 95% accuracy, however the certification is based primarily on the speech of lawyers and judges, and they are tested for speed, accuracy, and technical jargon. 

We recruited the help of 9 native speakers of African American English (if you're new to my blog, African American English is a rule-governed dialect as systematic and valid as any other), from West Philadelphia, North Philadelphia, Harlem, and Jersey City (4 women and 5 men). Each of these speakers were recorded reading 83 different sentences, all of which were taken from actual speech (that is, we didn't just make up example sentences). These sentences each had specific features of AAE, 13 in total, as well as combinations of features. Examples of sentences included:

  • When you tryna go to the store?

  • what he did?

  • where my baby pacifier at?

  • she be talkin’ ‘bout “why your door always locked?”.

  • Did you go to the hospital?

  • He been don’t eat meat.

  • It be that way sometimes.

  • Don’t nobody never say nothing to them.

Features we tested for included: 

  • null copula (deletion of conjugated is/are, as in he workin’ for “he is working”).

  • negative concord (also known as multiple negation or “double negatives”).

  • negative inversion (don’t nobody never say nothing to them meaning “nobody ever says anything to them).

  • deletion of posessive s (as in his baby mama for his baby’s mama).

  • habitual be (an invariant grammatical marker that indicates habitual action, as in he be workin’ for “he is usually working”).

  • stressed been (this marks completion in the subjectively distant past, as in I been did my homework meaning “I completed my homework a long time ago”).

  • preterite had (this is the use of had where it does not indicate prior action in the past tense, but rather often indicates emotional focus in the narrative, as in what had happened was… for “what happened was…”).

  • question inversion in subordinate clauses (this is when questions in subordinate clauses invert the same way as in matrix clauses in standard English, as in I was wondering did you have trouble for “I was wondering whether you had trouble”).

  • first person use of a nigga (This is where a nigga does not mean any person, but rather indicates the speaker, as in a nigga hungry for “I am hungry”).

  • spoken reduction of negation (this is the reduction of ain’t even to something that sounds like “eem”, or the reduction of don’t to something that sounds like “ohn”).

  • quotative talkin’ ‘bout (this is the use of talkin’ ‘bout, often reduced to sounding like “TOM-out” to introduce direct or indirect quotation, as in he talkin’ ‘bout “who dat?” meaning “he asked ‘who’s that?’”).

  • modal tryna (this is the use of tryna to indicate intent or futurity, as in when you tryna go for “when do you intend to go?”).

  • perfect done (this is a perfect marker, indicating completion or thoroughness, as in he done left meaning “he left”).

  • be done (this is a construction that can mark a combination of habitual and completed actions, or can mark resultatives, as in I be done gone home when they be acting wild for “I’ve usually already gone home when they act wild”).

  • Expletive it (this is replacing standard English “there” with it, as in it’s a lot of people outside for “there are a lot of people outside”).

  • combinations of the above, as in she be talkin’ ‘bout “why your door always locked?” meaning “she often asks ‘why is your door always locked?’”

These are by no means all the patterns of syntax unique to AAE, but we thought they were a decent starting point. However, not only does AAE have different grammar than other varieties of English, but more often than not, African Americans have different accents from their white counterparts within the same city. Think about it: Kevin Hart's Philadelphia accent is not the same as Tina Fey's (it's also why Kenan Thompson's Philly accent is so weird in that sketch). 

All of the court reporters we tested were given a 220Hz warning tone to tell them a sentence was coming, followed by the same sentence played twice, followed by 10 seconds of silence. We asked them to 1) transcribe what they heard (their official job) and 2) to paraphrase what they heard in "classroom English" as best as they could (not their job!). The audio was at 70-80 Decibels at 10 feet (that is, very loud). The sentences and voices were randomized so they heard a mix of male and female voices, and they didn't hear the same syntactic structures all at the same time. All of the court reporters expressed that what they heard was:

  • better quality audio than they're used to in court

  • consistent with the types of voices they hear in court (more specifically, they often volunteered "in criminal court").

  • spaced with more than enough time for them to perform the task (they often spent the last 5 seconds just waiting -- they write blisteringly fast).

What was the result? None of them performed at 95% accuracy, no matter how you choose to define accuracy, when confronted with everyday African American English spoken by local speakers from the same speech communities as the people they are likely to encounter on the job. If you choose to measure accuracy in terms of  full sentences -- either the sentence is correct or it is not -- the average accuracy was 59.5% If you choose to measure accuracy in terms of words -- how many words were correct -- they were 82.9% accurate on average. Race, gender, education, and years on the job did not have a significant effect on performance, meaning that black court reporters did not significantly outperform white court reporters (we think this is likely because of the combination of neighborhood, language ideologies, and stance toward the speakers -- black court reporters distanced themselves from the speakers and often made a point of explaining they "don't speak like that."). Interestingly, the kinds of errors did seem to vary by race: there's weak evidence that black court reporters did better understanding the accents, but still struggled with accurately transcribing the grammar associated with more vernacular AAE speakers. 

For all the court reporters, their performance was significantly worse when we asked them to paraphrase (although individual court reporters did better or worse with individual features. For example, one white court reporter nailed stressed been every time -- something we did not expect). Court reporters correctly paraphrased on average 33% of the sentences they heard. There was also not a strong link between their transcription and paraphrase accuracy -- in some cases they even transcribed all the words correctly, but paraphrased totally wrong. In a few instances, they paraphrased correctly, but their official transcription was wrong!  The point here is that while the court reporters did poorly transcribing AAE, they did even worse understanding it -- which makes it no surprise they had difficulty transcribing.

In the linguistics paper, we go into excruciating detail cataloguing the precise ways accent and grammar led to error. However, the takeaway for the general public is that speakers of African American English are not guaranteed to be understood by the people transcribing them (and they're probably even less likely to be understood by some lawyers, judges, and juries), and not guaranteed that their words will be transcribed accurately. Some examples of sentences together with their transcription and paraphrase include (sentence in italics, transcription in braces <>, and paraphrase in quotes):

  • he don’t be in that neighborhood — <We going to be in this neighborhood> — “We are going to be in this neighborhood”

  • Mark sister friend been got married — <Wallets is the friend big> — (no paraphrase)

  • it’s a jam session you should go to — <this [HRA] jean [SHA] [TPHAO- EPB] to> — (no paraphrase)

  • He don’t eat meat — <He’s bindling me> — “He’s bindling me”

  • He a delivery man — <he’s Larry, man> — “He’s a leery man”

Why does this matter?

First and foremost, African Americans are constitutionally entitled to a fair trial, just like anyone else, and the expectation of comprehension is fundamental to that right. We picked the "best ears in the room" and found that they don't always understand or accurately transcribe African American English. And crucially, what the transcriptionist writes down becomes the official FACT of what was said. For 31% of the sentences they heard, the transcription errors changed the who, what, when, or where. Some were squeamish about writing the "n-word" and chose to replace it with other words, however those who did often failed to understand who it referred to (for instance, changing a nigga been got home 'I got home a long time ago" to <He got home>, or in one instance <Nigger Ben got home>, evidently on the assumption it was a nickname). 

And it's not just important for when black folks are on the stand. Transcriptions of depositions, for instance, can be used in cross-examination. In fact, it was seeing Rachel Jeantel defending herself against claims she said something she hadn't that sparked the idea for this project. (And she really hadn't said it -- I've listened to the deposition tape independently, and two other linguists -- John Rickford and Sharese King -- came to the conclusion the transcription was wrong, and have published to that effect). Transcriptions are also used in appeals. In fact, one appeal was decided based on a judge's determination of whether "finna" is a word (it is) and whether "he finna shoot me" is admissible in court as an excited utterance. The judge claimed, wrongly, that it is impossible to determine the "tense" of that sentence because it does not have a conjugated form of "to be", claiming that it could have meant "he was finna shoot me." If you know AAE, you know that you can drop "to be" in the present but not in the past. That is, you can drop "is" but not "was". The sentence unambiguously means "he is about to shoot me," that is, in the immediate future.

This is excluding misunderstanding like with the recent "lawyer dog" incident in which a defendant said "I want a lawyer, dawg" and was denied legal counsel because there are no dogs who are lawyers.

All of this suggests a way that African Americans do not receive fair treatment from the judicial system; one that is generally overlooked. Most of us learn unscientific and erroneous language ideologies in school. We are explicitly taught that there is a correct way to speak and write, and that everything else is incorrect. Linguists, however, know this is not the case, and have been trying to tell the public for years (including William Labov’s “The Logic of Nonstandard English,” Geoffrey Pullum’s “African American English is Not Standard English with Mistakes,” the Linguistic Society of America statement on the “ebonics” controversy, and much of the research programs of professors like John Rickford, Sonja Lanehart, Lisa Green, Arthur Spears, John Baugh, and many, many others). The combination of these pervasive language attitudes and anti-black racism leads to linguistic discrimination against people who speak African American English — a valid, coherent, rule-governed dialect that has more complicated grammar than standard classroom English in some respects. Many of the court reporters assumed criminality on the part of the speakers, just from hearing how the speakers sounded — an assumption they shared in post-experiment conversations with us. Some thought we had obtained our recordings from criminal court. Many also expressed the sentiment that they wish the speakers spoke "better" English. That is, rather than recognizing that they did not comprehend a valid way of speaking, they assumed they were doing nothing wrong, and the gibberish in their transcriptions (see above examples) was because the speakers were somehow deficient.

Here, I think it is very important to point out two things: first, many people hold these negative beliefs about African American English. Second, the court reporters do not have specific training on part of the task they are required to do, and they all expressed a strong desire to improve, and frustration with the mismatch between their training and their task. That is, they were not unrepentant racist ideologues out to change the record to hurt black people — they were professionals, both white and black, who had training that didn't fully line up with their task and who held  common beliefs many of us are actively taught in school.

What can we do about it?

There is the narrow problem we describe of court transcription inaccuracy, and there is the broader problem of public language attitudes and misunderstanding of African American English. For  the first, I believe that training can help at least mitigate the problem. That's why I have worked with CulturePoint to put together a training suite for transcription professionals that addresses the basics of "nonstandard" dialects, and gives people the tools to decode accents and unexpected grammatical constructions. Anyone who has ever looked up lyrics on genius.com or put the subtitles on for a Netflix comedy special with a black comic knows that the transcription problem is widespread. For the second problem, bigger solutions are needed. Many colleges and universities have undergraduate classes that introduce African American English (in fact, I've been an invited speaker at AAE classes at Stanford, Georgetown, University of Texas San Antonio, and UMass Amherst), but many, even those with linguistics departments, do not (including my current institution!). Offering such classes, and making sure they count for undergraduate distribution requirements is an easy first step. Offering linguistics, especially sociolinguistics in high schools, as part of AP or IB course offerings could also go a long way toward alleviating linguistic prejudice, and to helping with cross dialect comprehension. Within the judicial system more specifically, court reporters should be encouraged to ask clarifying questions (currently, it's officially encouraged but de facto strongly discouraged). Lawyers representing AAE speaking clients should make sure that they can understand AAE and ask clarifying questions to prevent unchecked misunderstanding on the part of judges, juries, and yes, court reporters. Linguists and sociologists can, and should, continue public outreach so that the general public has an informed idea about what science tells us about language and discrimination.

This is a disturbing finding that has strong implications for racial equality and justice. And there's no evidence that the problem of cross-dialect miscomprehension is only limited to this domain (in fact, we have future studies planned already, in medical domains). This study represents a first step toward quantifying the problem and what the key triggers are. Unfortunately, the solutions are not all clear or easy to enact, but we can chip away at the problem through careful scientific investigation. On the heels of the 19th national observance of Martin Luther King Jr. Day (It has only been observed in all 50 states since 2000(!)), it seems appropriate to reaffirm that “No, no, we are not satisfied, and we will not be satisfied until justice rolls down like waters and righteousness like a mighty stream.”


©Taylor Jones 2019

Have a question or comment? Share your thoughts below!

Linguists have been discussing "Shit Gibbon." I argue it's not entirely about gibbons.


Earlier this week a Pennsylvania state senator called Donald Trump a "fascist, loofa-faced shit-gibbon."

There was an excellent post on Strong Language, a blog about swearing, discussing what makes "shit gibbon" so arresting, so fantastic, so novel, and yet... so right (for English swearing. Whether you believe "shit gibbon" is "right" as a characterization of Donald Trump is a personal assessment each person must make for themselves).

The post, The Rise of the ShitGibbon can be found here. I highly recommend reading it.

Most of the post was dedicated to tracing the origins and rise of "shitgibbon." The end of the post, however, catalogues insults in the same vein:

wankpuffin, cockwomble, fucktrumpet, dickbiscuit, twatwaffle, turdweasel, bunglecunt, shitehawk

And some variants: cuntpuffin, spunkpuffin, shitpuffin; fuckwomble, twatwomble; jizztrumpet, spunktrumpet; shitbiscuit, arsebiscuits, douchebiscuit; douchewaffle, cockwaffle, fartwaffle, cuntwaffle, shitwaffle (lots of –waffles); crapweasel, fuckweasel, pissweasel, doucheweasel.

I've actually been thinking about insults like this a surprising amount. Ben Zimmer points out about "Shitgibbon" that "...Metrically speaking, these words are compounds consisting of one element with a single stressed syllable and a second disyllabic element with a trochaic pattern, i.e., stressed-unstressed. As a metrical foot in poetry, the whole stressed-stressed-unstressed pattern is known as antibacchius."

I argue that this is correct, but that (1) there's a little bit more to say about it, and (2) there are exceptions.


First: I argue that the rule for making a novel insult of this type is a single syllable expletive (e.g., dick, cock, douche, cunt, slut, fart, splunk, splooge, piss, jizz, vag, fuck, etc.) plus a trochee. A trochee, as a reminder, is a word that's two syllables with stress on the first. Examples are puffin, womble, trumpet, biscuit, waffle, weasel, and of course, gibbon. Tons of words in English are trochees (have a relevant XKCD! In fact, have two! Wait, no, three! No one expects the Spanish Inquisition!). Because so many words are trochees, you'll have to pick wisely --- something like ninja might not be as humorously insulting as waffle.

That said, in principle, monosyllable expletive + trochee seems to give really good results. Behold:

fart basket, shit whistle, turd helmet, cock bucket, douche blanket, vag weasel, (I'm gonna be so much fun when I get old and have dementia. Good luck grandkids!), shit mandrill, piss gopher, jizz weevil, etc. etc. I can do this all day.

So, it's not the fact of being a gibbon per se. Various other monkeys would work: vervet, mandrill, etc. However, crucially, baboons, macaques, black howlers, and pygmy marmosets are out.

Moreover, it's not completely unlimited. Some words fit but don't make much sense as an insult: cock bookshelf, fart saucepan (which I quite like, actually), dick pension, belch welder.

Others sound like the kind of thing a child would say: fart person! poop human! turd foreman!

Yet others are too Shakespearean: fart monger! piss weasel!

Clearly some words (waffle, weasel, gibbon, pimple, bucket) are better than others (bookshelf, doctor, ninja, icebox), and some just depend on delivery (e.g., ironic twat hero, turd ruler, spunk monarch, dick duchess).


For a while, I've been discussing vowels in insults with fellow linguist Lauren Spradlin. Note that when we talk about vowels, we mean sounds, not letters. Don't worry about the spelling, try saying the below aloud. Spradlin has brought my attention to the importance of repeating vowels increasing the viability of a new insult of this form: crap rabbit, jizz biscuit, shit piston, spunk puffin, cock waffle, etc.

I would argue that having the right vowels actually gives you some leeway, so you can get away with following the first word with --- gasp! ---- a non-trochee! Be it an iamb (remember iambic pentameter?) as in douche-canoe, spluge caboose, or the delightfully British bunglecunt (h/t Jeff Lidz), or even more syllables: Kobey Schwayder's charming mofo-bonobo.

As you can see, this is a hot topic in the hallowed halls of the ivory tower. If the above simple formulae have motivated even one person to go out and exercise their own creativity to make a novel contribution to the English language, then I've done my job here as a linguist. Different people get into linguistics for different reasons, but this, this is what I live for. Get out there and make a difference!




©Taylor Jones 2017

Have a question or comment? Share your thoughts below!


Arguments About Arguments

Lately, I've been thinking a lot about what linguists call valency. This is in part because I was recently discussing the weird privileging of some grammatical structures over others by self-appointed "grammar nerds", and in part because it's been very relevant in studying Zulu.

Valence or valency refers to the number of arguments a verb has (or "takes"). What's an argument? In this jargon, it's basically a noun or noun phrase. The idea is that different verbs require --or allow -- different numbers of nouns. The ones that are required are sometimes referred to as core arguments. For instance:

  • I strolled.

The above is referred to as intransitive, and it allows only one argument. It makes no sense to say, for instance:

  • *I strolled you.

(the * means the sentence is ungrammatical, meaning that it doesn't make structural sense.)

Similarly, there are verbs that take two arguments (transitive verbs) and verbs that take three (ditransitive verbs). Examples are:

  • he hit me.
  • I gave him a book.

Admittedly, this is not all that interesting. What is interesting, however, is valence changing operations

Different languages have different tools for taking a verb of one kind, and changing the number or structure of core arguments. What does this mean? It means making a peripheral argument into a core argument, like this (kind of cheat-y example):

  • I gave a book to him
  • I gave him a book

Or totally changing which arguments have which syntactic positions:

  • I ate a whole thing of ice cream.
  • A whole thing of ice cream was eaten.

The above example, many of you will recognize as the passive voice. The passive voice has gotten a bad rap. The passive voice is freaking cool! It lets you keep the same meaning, but shuffle around what structural role all of the noun phrases are playing. This, in turn, allows you to highlight a different part of the sentences, and shift focus away from the agent (or even refuse to name the agent). What's more, it's just one of many valence changing operations that languages make use of.

Image from the blog   Heading for English .

Image from the blog Heading for English.

Other languages have more. And they're awesome. David Peterson has a great discussion of this in his book The Art of Language Invention, but his examples in that book are often created languages. Natural languages, though, were the inspiration. Zulu, for instance, has passives like English, but also has causatives and benefactives. And you can combine them. For instance:

  • fona = to telephone someone
  • ngi-ni-fona = I call you (lit. I you call)

You can make it causative by adding -isa, which then makes the verb mean to cause/make/let/allow/help someone telephone someone. Notice anything? That's right, you've added an argument. 

  • fona -> fonisa
  • ngi-ni-fona = I call you
  • ngi-ni-fon-isa umama = I help you call mama

Benefactives are similar, but they make it so you do the verb for/on behalf of/instead of someone else. In Zulu, this is done by adding -ela to the verb stem:

  • fona -> fonela
  • ngi-ni-fona = I call you
  • ngi-ni-fon-ela umama = I call you for mama.

Other languages have other kinds of things. For instance, some languages have malefactives. That is, things that are done not for or on behalf of someone, but despite someone or intending them harm, ill-will, or general bad...ness. Salish, Native American language spoken in the Pacific Northwest makes use of malefactives. English has a construction which does the same thing, but doesn't encode it on the actual verb:

  • she hung up on me.
  • he slammed the door on me.
  • she walked out on me.
  • My car broke down on me.

Imagine something like "she hung-on-up me." Notice, also, that benefactives do a weird thing to the arguments. In English, you can say:

  • I cook rice.

And that's transitive. If you add a bit about who you're doing it for, you get:

  • I cook rice for you. 


  • I cook for you.

In Zulu, though, there's very specific sentence structure. I'll put the words in English, but add the Zulu morphemes, to make it as clear as possible:

  • I cook rice
  • I cook-ela you rice
  • I cook-ela you. 
  • * I cook-ela rice you.

Remember the * means "ungrammatical." This is usually discussed in terms of promotions (yay!) and obligations (ugh). That is, the benefactive in Zulu promotes the argument that is benefiting from the action, and makes it obligatory. It also must immediately follow the verb. The thing that was the object of the sentence (rice, in this case) is then an optional argument. You don't even have to say it. Or think it. Just forget about the rice.

Therefore, as valency changing operations add arguments, so too they taketh away. This is what the passive voice is doing. Whereas benefactives promote an indirect object to direct object, and then make the original direct object optional, passives promote the direct object to subject:

  • I cook rice
  • rice is cooked (by me!) 

...or, if you prefer:

  • The whole thing of ice cream was eaten. (I refuse to say by whom.)

This is why I don't understand "grammar snobbery." Your language has a syntactic tool that does a totally cool thing, and you're just gonna decide that it's somehow bad? It's a feature, not a bug! If you think calling a natural function of your grammar that's linguistically universal bad is a way of indicating how much you know about grammar, you've got weird priorities. Appreciating grammar is not a competition to see how little of your language you can use or appreciate.

Not only are valence changing operations not bad, and totally super cool, but get this: you can combine valence changing operations, so you can have a passivized benefactive in Zulu, or a passivized causative. You can have things like:

  • A cake is being baked for mama

...but they're encoded entirely on the verb:

  • (ikhekhe) li-zo-bhak-el-wa umama
  • cake      it.FUT.bake.BEN.PASS mama

Even better, you can have a causative, a benefactive, and a passive marker on the verb, so you get something like:

  • bhala = "write" or "enroll"
  • ba-bhal-is-el-wa-ni
  • they.enroll.CAUS.BEN.PASS.why = "why are they being made to enroll?"

Somewhere, there's a language that can express my desire that the passive voice be made to be used by self appointed grammar snobs, malevolently, by me, and that language can encode most of that on the verb. If not Salish, then maybe the Niger-Congo language Koalib. And that's a beautiful thing.


©Taylor Jones 2015

Have a question or comment? Share your thoughts below!




SoCal is Getting Fleeked Out

For anyone who's been living under a rock for the past few months, there is a term, "on fleek," that has been around since at least 2003, but which caught like wildfire on social media after June 21, 2014, when Vine user Peaches Monroe made a video declaring her eyebrows "on fleek."

Since then, the apparently non-compositional phrase on fleek has been wildly popular, and has generated the usual discussion: both declarations that it is literally the worst and "should die," and heated debates about what exactly on fleek even means. People seem to be divided on the question of whether it's synonymous with "on point." There is also a great deal of disagreement as to what can and cannot be on fleek, with "eyebrows" now the prototype against which things are measured.

After a conversation with Mia Matthias, a linguistics student at NYU, I decided to look at other syntactic constructions, thinking it possible -- in principle -- to generalize from on fleek to other constructions. Lo and behold, there is a minority of negative-minded people who describe others, snarkily, as "off fleek," (haters).  More interestingly, Southern California is getting fleeked out.


Geocoded tweets using variations of  fleek . Toronto, you're not fooling anyone.

Geocoded tweets using variations of fleek. Toronto, you're not fooling anyone.

This is interesting because it suggests that "on fleek" is being re-interpreted, and that it is not necessarily rigidly fixed for all speakers as an idiom. Moreover, it looks like LA is leading the first move away from strictly adhering to the idiom "on fleek," by extending the use of "fleek" to the stereotypically Californian construction of [x]-ed out.

Geocoded tweets using "fleek" in California. Las Vegas, you're not fooling anyone.

Geocoded tweets using "fleek" in California. Las Vegas, you're not fooling anyone.

I'm looking forward to watching this develop, just as we can watch bae developing (one can now be baeless, for instance). I'm also looking forward to the day one can get a fleek over, or get one's fleek on.


©Taylor Jones 2015

Have a question or comment? Share your thoughts below!


The Problem With Twitter Maps

Twitter is trending

I'm a huge fan of dialect geography, and a huge fan of Twitter (@languagejones), especially as a means of gathering data about how people are using language. In fact, social media data has informed a significant part of my research, from the fact that "obvs" is legit, to syntactic variation in use of the n-words. In less than a month, I will be presenting a paper at the annual meeting of the American Dialect Society discussing what "Black Twitter" can tell us about regional variation in African American English (AAVE). So yeah, I like me some Twitter. (Of course, I do do other things: I'm currently looking at phonetic and phonological variation in Mandarin and Farsi spoken corpora).

Image of North America, entirely in Tweets, courtesy of Twitter Visual Insights: https://blog.twitter.com/2013/the-geography-of-tweets

Image of North America, entirely in Tweets, courtesy of Twitter Visual Insights: https://blog.twitter.com/2013/the-geography-of-tweets

Moreover, I'm not alone in my love of Twitter. Recently, computer scientists claim to have found regional "super-dialects" on Twitter, and other researchers have made a splash with their maps of vocatives in the US:

More and more, people are using social media to investigate linguistics. However, there are a number of serious dangers inherent to spatial statistics, which are exacerbated by the use of social media data.

Spatial statistics is developing rapidly as a field, and there are a number of excellent resources on the subject I've been referring to as I dig deeper and deeper into the relationship between language and geography. Any of these books (I'm partial to Geographic Information Analysis) will tell you that people can, and do, fall prey to the ecological fallacy (assuming that some statistical relationship that obtains at one level, say, county level, holds at another level -- say, the individual). Or they ignore the Modifiable Areal Unit Problem -- which arises out of the fact that changing where you draw your boundaries can strongly affect how the data are distributed within those boundaries, even when the change is just in the size of the unit of measurement.

The  statistical consideration that most fascinates me, and seems to be the most likely to be overlooked in dealing with exciting social media data, however, is the problem of sampling.

Spatial Statistics aren't the same as Regular Statistics.

In regular statistics, more often than not, you study a sample. You can almost never study an entire population of interest, but it's not generally a problem. Because of the Law of Large Numbers, the bigger the sample, the more likely you are to be able to confidently infer something about the population the sample came from (I'm using the day-to-day meanings of words like "confidence" and "infer"). However, in the crazy, upside down world of spatial statistics, sampling can bias your results.

In order to draw valid conclusions about some kinds of spatial processes, it is necessary to have access to the entire population in question. This is a huge problem: If you want to use Twitter, there are a number of ways of gathering data that do not meet this requirement, and therefore lead to invalid conclusions (to certain questions). For instance, most people use the Twitter API to query Twitter and save tweets. There are a few ways you can do this. In my work on AAVE, I used code in Python to interact with the Twitter API, and asked for tweets containing specific words -- the API returned tweets, in order, from the last week. I therefore downloaded and saved them consecutively. This means, barring questionable behavior from the Twitter API (which is not out of the question -- they are notoriously opaque about just how representative what you get actually is), I can claim to have a corpus that can be interpreted as a population, not a sample. In my case, it's very specific -- for instance: All geo-tagged tweets that use the word "sholl" during the last week of April, 2014. We should be extremely careful about what and how much we generalize from this.

Many other researchers use either the Twitter firehose or gardenhose. The former is a real-time stream of all tweets. Because such a thing is massive, and unmanageagable, and requires special access and a super-computer, others use the gardenhose. However, the gardenhose is a(n ostensibly random) sample of 10% of the firehose. Depending on what precisely you want to study, this can be fine, or it can be a big problem.

Why is sampling such a problem?

Put simply, random noise starts to look like important clusters when you sample spatial data. To illustrate, this, I have created some random data in R.

I first created 1,000 random x and 1,000 random y values, which I combined to make points with random longitudes (x values) and latitudes (y values). For fun, I made them all with values that would fit inside a box around the US (that is, x values from -65 to -118, and y values from 25 to... Canada!). I then made a matrix combining the two values, so I had 1,000 points randomly assigned within a box slightly larger than the US. That noise looked like this:

" Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1. " "Never tell me the odds!"

" Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1. " "Never tell me the odds!"

Before we even continue, it's important to note two things. First, the above is random noise. We know this because I totally made it up. Second, before even doing anything else, it's possible to find patterns in it:

A density contour plot of random noise. Sure looks like something interesting might be happening in the upper left.

A density contour plot of random noise. Sure looks like something interesting might be happening in the upper left.

Even with completely random noise, some patterns threaten to emerge. What we can do if we want to determine if a pattern like the above is actually random is to compare it to something we know is random. To get technical, it turns out random spatial processes behave a lot like Poisson distributions, so when we take Twitter data, we can determine how far it deviates from random noise by comparing it to a Poisson distribution using a Chi-squared test. For more details on this, I highly recommend the book I mentioned above. I've yet to see anyone do this explicitly (but it may be buried in mathematical appendices or footnotes I overlooked).

This is what happens when we sample 100 points, randomly. That's 10%; the same as the Twitter gardenhose:

a 100 point sample.

a 100 point sample.

And this is what happens when we take a different 100 point random sample:

Another random 100 point sample from the same population.

Another random 100 point sample from the same population.

The patterns are different. These two tell different stories about the same underlying data. Moreover, the patterns that emerge look significantly more pronounced.

To give an clearer, example, here is a random pattern of points actually overlaying the United States I made, after much wailing, gnashing of teeth, and googling of error codes in R. I didn't bother to choose a coordinate projection (relevant XKCD):

And here are four intensity heat maps made from four different random samples drawn from the population of random point data pictured above:

This is bad news. Each of the maps looks like it could tell a convincing story. But contrary to map 3, Fargo, North Dakota is not the random point capital of the world, it's just an artifact of sampling noise. Worse, this is all the result of a completely random sample, before we add any other factors that could potentially bias the data (applied to Twitter: first-order effects like uneven population distribution, uneven adoption of Twitter, biases in the way the Twitter API returns data, etc.; second-order effects like the possibility that people are persuaded to join Twitter by their friends, in person, etc.).

What to do?

The first thing we, as researchers, should all do is think long and hard about what questions we want to answer, and whether we can collect data that can answer those questions. For instance, questions about frequency of use on Twitter, without mention of geography, are totally answerable, and often yield interesting results. Questions about geographic extent, without discussing intensity, are also answerable -- although not necessarily exactly. Then, we need to be honest about how we collect and clean our data. We should also be honest about the limitations of our data. For instance, I would love to compare the use of nuffin  and nuttin (for "nothing") by intensity, assigning a value to each county on the East Coast, and create a map like the "dude" map above -- however, since the two are technically separate data sets based on how I collected the data, such a map would be completely statistically invalid, no matter how cool it looked. Moreover, if I used the gardenhose to collect data, and just mapped all tokens of each word, it would not be statistically valid, because of the sampling problem. The only way that a map like the "dude" map that is going around is valid is if it is based on data from the firehose (which it looks like they did use, given that their data set is billions of tweets). Even then, we have to think long and hard about what the data generalizes to: Twitter users are the only people we can actually say anything about with any real degree of certainty from Twitter data alone. This is why my research on AAVE focuses primarily on the geographic extent of use, and why I avoid saying anything definitive about comparisons between terms or popularity of one over another.

Ultimately,  as social media research becomes more and more common, we as researchers must be very careful about what we try to answer with our data, and what claims we can and cannot make. Moreover, the general public should be very wary of making any sweeping generalizations or drawing any solid conclusions from such maps. Depending on the research methodology, we may be looking at nothing more than pretty patterns in random noise.



©Taylor Jones 2014

Have a question or comment? Share your thoughts below!