Local Identity, Appropriation, and Mock Yiddish: A Kvetch

November 12, 2020 by Taylor Jones

There is an advertisement I consistently see on TV, especially on New York 1, that never fails to annoy me. It’s a car service ad that tries to tell the viewer what “real New Yorkers” do, and does so by repeating a “New York” catchphrase over and over: “what are you, mashugana? Real New Yorkers take Carmel!” This is their spelling, by the way.

What they’re trying to do, as far as I can tell, is get people to use their car service based on insulting them in bad Yiddish.

Yiddish, and by extension Yiddish English, are highly stigmatized, much like African American English (AAE). Both are stigmatized in large part because of (non-linguistic) prejudice against their speakers. Both also have what linguists call “covert prestige” meaning you can use them to positive effect sometimes. With AAE this often means that borrowing features of AAE can be used to construct a “tough” identity, a “dangerous” identity, or a “cool”, “in-the-know” identity. For women, sometimes it’s a “sassy” identity (think about white women adopting “girl” and “girlfriend”, for instance). With Yiddish English, it’s often a curmudgeonly, beleaguered, but comedic persona (à la Mel Brooks or Jerry Seinfeld).

All of this is related to “mock” language — best explained by Jane Hill’s discussion of “mock Spanish” in The Everyday Language of White Racism.

So what’s the problem?

There are a few issues with this ad, from a sociolinguistic and linguistic anthropological stance. First and foremost, it’s borrowing language use from an ethnolect (that is, a language variety associated with a (minority) ethnicity — often strongly related to segregation), and it’s putting those words in the mouths of people who are seemingly not part of that community, and it’s doing so for comedic effect. This is not to say there aren’t black Jews (there are), or waspy jewish converts (there are), or people who live in New York, have a few Jewish friends, and picked up some Yiddishisms that they use appropriately (there are a ton). It’s just that this doesn’t seem to be what’s happening in the ad, in part because of my second point:

They do it wrong.

What they spell (and pronounce) as mashugana is the Yiddish noun (not adjective!) משוגענער meshugener: ‘lunatic, madman, crazy person.’ This is a noun derived from the adjective משוגע meshuge , originally from Hebrew מְשֻׁגָּע m’shugá ‘crazy, insane.’

The first guy, I’ll give the benefit of the doubt, as it sounds like he’s saying “what are you, a meshugener?” spoken by someone with a non-rhotic (‘r-less’) variety — something consistent with working class Jewish New York.

However, the rest of the speakers in the ad say “what are you, meshugana,” both mispronouncing the word and treating a noun as an adjective. The fact their official youtube channel spells it that way suggests it’s how they intended to use it. What’s more, their voice-over treats the noun as an adjective. Let’s break them down. First, you have a woman who seems to be stumbling over the whole line, not just the last word:

Then, a slowed down recording of a woman saying “what are you, mashuigana?”

Followed by a man again treating it as an adjective. The way he says “of course” is…shall we say, not like the “real” New Yorkers I know.

Then a man who treats at as an adjective, pronounces the /r/, but weirdly changes the final vowel (“meshuganar”):

Finally the voice over says “don’t be meshugener”.

Here’s the thing: Yiddish is not just “anything goes.” It’s not “bad German.” It’s a language. The same goes for Yiddish English, a dialect of English heavily influenced by Yiddish. And Just like AAE, there’s often a perception that it’s just “bad grammar” because it has different grammatical structures than standard English (e.g., “You want I should ask him?”).

This means that it’s entirely possible to speak it wrong. And when you dismiss it as something not worth getting correct because it’s not a “real” or “valid” language, it sends a message to the people who speak it that they aren’t of value. While most people know that approximately 6 million Yiddish speakers were killed in a genocide, fewer are aware that the strong push in the US toward cultural assimilation resulted in language repression that is a textbook example of cultural genocide (in this case, the suppression of cultural activities that do not conform to the destroyer's notion of what is appropriate). And the flip side of that is that the majority of people who do speak Yiddish natively now live in Brooklyn, and would certainly recognize this ad as not being “correct” (although they may not feel as strongly about it as I do). And heritage learners who are trying to learn the language of their grandparents or great grand parents now have little choice but to learn a “standardized” version that isn’t what their ancestors actually spoke, or the language of a particular sect from a particular place that they’re not directly related to. It’s depressing.

This business with the ad is further complicated by the fact that a generation ago there were people who did speak Yiddish well, but only used it publicly for comedic effect. Remember that mention of Mel Brooks, above? Well one of the best gags in Blazing Saddles, but also one that’s problematic from a language genocide point of view (but also that’s why it works?), has Mel Brooks playing a (generic) Native American and speaking Yiddish. In fact, he correctly declines meshugga in it. (Warning for those not already familiar with Blazing Saddles, the whole movie is puerile jokes about race and racism, and in the clip below he uses a Yiddish term that has been borrowed into English and is offensive in English but not (necessarily) in Yiddish).

None of this rises to the level that I’d have hard feelings if anyone told me they got here by taking Carmel, even though it does genuinely give me tsuris. I think the real takeaway for me, beyond the catharsis of kvetching (note to self: great title for a memoire), is something that was put best by Mel Brooks:

zayt nisht meshugga. Cop a walk.

(…instead of taking a cab)

-----

©Taylor Jones 2020

Have a question or comment? Share your thoughts below!

Truncations and offensive language

November 12, 2020 by Taylor Jones

CONTENT WARNING: lots of uncensored slurs and offensive language, in a variety of languages.

.In part starting with my research with Christopher Hall on uses of the “n-word” (available here, and in (free) final draft form here), and in part because of the consulting work I do, I am an expert on slurs, epithets, and offensive language — the main language-y thing that companies, government organizations, journalists, lawmakers, lawyers, and judges are interested in is offensive language. Everyone wants to understand what they can and can’t, or should and shouldn’t say, and where the line is drawn, and for many people it has stark, real-world consequences. One of the things lots of people ask about is some variation of how do you know or how can I prove that something is offensive?

Ever since working on totes constructions with Lauren Spradlin, years ago, I’ve been thinking about hypochoristics (fancy linguist speak for ‘baby talk’ or ‘pet names’) and truncations, and how they relate to offensive language. A huge number of slurs are truncations of other words, and this isn’t really a coincidence.

Working with Lauren Spradlin on totes truncations we were focused on phonological and morphological rules of truncation: how does everyone know how to make new truncations, and what intuitive rules to people follow? Does anyone ever break those rules? What do these truncation patterns tell us about language more generally? There’s a lot to say there (and we’ve only, honestly, written about some of it — there’s another paper on the way, but here’s the conference talk), but one of the things that stood out to me immediately was that when we were talking about how you don’t yuzh (usually) eat blewbs (blueberries) with guac beeteedubs (BTW, by the way), is that certain words were shortenable but just sounded…offensive.

Truncating blueberry? Great. Truncating adjectives like ridiculous, or obnoxious? Totally fine. Truncating adjectives that relate to ethnic groups or places of origin? Really offensive. This is interesting: a morphological transformation that’s completely unremarkable in most contexts is deeply offensive in a small set of specific contexts.

The crazy thing is, this holds for novel truncations, meaning I can refer to someone with a truncation you’ve never heard before, and you’ll have an intuitive sense of whether it’s offensive. For instance, if someone said the chefs at the French restaurant I like are all mexies it reads, at least to me and everyone I’ve asked, as very offensive… even though I don’t know anyone who has ever heard that word before. (I wouldn’t be surprised if it exists, though). It’s clear I mean Mexican, which is in-and-of-itself fine, but it’s also clear that this particular phrasing is NOT OK.

And when you look at a list of offensive words (like, say, on Wikipedia), it really jumps out how many offensive terms follow this. An by no means exhaustive sample in no particular order:

Heeb and Heeby from Hebrew
Jap from Japan(ese)
Jerry from German
Hunky (and honkey(!)) from Hungarian
Paki from Pakistan(i)

Another way people intentionally offend is to take names stereotypically associated with a people, and call someone by that name, knowing it’s not their real name (for instance, Ike is an old fashioned epithet for Jewish men, Shaniqua is a name used to insult Black American women, Ahmed is used for Arab and muslim men, and the fact that this phenomenon exists goes a long way toward explaining the sentiment that being called “Karen” is a slur, even when the target is a person whose actual name is Karen but that’s unknown to the speaker). This truncation is also applied to names, making their use, in some cases, more offensive:

Ikey-mo from Ike and Moses (or Moishe)
Ack (or Akh) from Ahmed
Hymie from Hyman (itself an anglicization of Chayyim)
Abi from Abraham
Mo from Mohammed
Shaneeq(s) from Shaniqua

Interestingly, it’s not just totes style truncations, and shortening applied to offensive terms doesn’t make them less offensive, but rather more:

coon from barracoon
nig from…you know what it’s from
spic from either “hispanic” or “no spik ingles” (!)

I mentioned hypochoristics above, and I really think that’s the common factor. Baby-talk and “childish” language games can be fun and solidarity-building when they’re in-group behavior, but when baby talk is directed to someone who you don’t have the appropriate level of social closeness with, it’s insulting. It tells the speaker you respect them as much as you do a child. And to use baby talk for the name of the listener’s ethnic group, religion, or geographic origin, indicates belittling their background.

This seems to hold cross-linguistically as well, so in French, verlan (a game where you move syllables around similar to pig latin) is used to make offensive terms, like rebeu from beurre, itself a melioration of beurre from arabe (Arab), or feuj from juif (Jew(ish)). In Hebrew you get aravush, which has a “cutesy” diminutive marker -ush on the word arab. And the baby-talk element can be used to generate new offensive terms, so the more hateful parts of the internet use the term nig-nog to double down on offensiveness. While that particular term is attested as early as 1959, it and words like it were a starting point for the (thankfully now-defunct) subreddit /r/coontown.

Perhaps the wildest part about this, to me anyway, is that these truncations are used in fiction, sometimes even for groups of people that don’t exist. There’s an episode of Star Trek Voyager where a character claims that The Doctor was being totally Vulky and my first reaction was “the censors let that slide?!” Cardi is a slur for the fictional “race” of Cardassians in Star Trek (and it has its own entry in memory alpha). And in the gritty post-apocalyptic Canadian graphic novel “We Stand on Guard” the American aggressors routinely call the Canadians Nucks (from Canuck, a sometimes offensive, sometimes not slang term for Canadians). I remember reading it at a friend’s house and being genuinely shocked at a character referring to one of the protagonists as a "nuck bitch”. (He’s a mountie — itself a word probably originally intended disparagingly, from mounted police — and he left it as bedside reading as a sly provocation).

I've been watching so much DS9 lately that "Cardi B" sounds like a Cardassian rapper reclaiming the slur
— Cohen is a ghost (@skullmandible) February 4, 2019

There’s a LOT left to be said about slurs, epithets, and offensive language — after all, we haven’t even touched on using religious headgear or ethnic foods as terms of address, let alone sweary words combined with prosody (as in “shit gibbon”) — but it seems there’s something profoundly offensive about truncation and the diminutives it commonly is accompanied by. So a good rule of thumb is to avoid any truncations unless you either (1) are really sure your interlocutors won’t take it the wrong way, or (2) you’re actively trying to offend.

-----

©Taylor Jones 2020

Have a question or comment? Share your thoughts below!

Testifying While Black

January 22, 2019 by Taylor Jones

[content warning: language] [co-authored with Jessica Kalbfeld]

For the last four years I've been working on a large-scale project distinct from writing my dissertation that my family and friends know I refer to as my "shadow dissertation." It's a co-authored paper, with Jessica Kalbfeld (Sociology, NYU), Ryan Hancock (Philadelphia Lawyers for Social Equity, WWDLaw), and my advisor, Robin Clark (Linguistics, University of Pennsylvania), and we just received word that it has been accepted for publication in Language. Many of my other projects, including my work on the verb of quotation talkin' 'bout, on first person use of a nigga, and on the spoken reduction of even to "eem", among others, were all in service of this project.

Simply put: court reporters do not accurately transcribe the speech of people who speak African American English at the level of their industry standards. They are certified as 95% accurate, but when you evaluate sentence-by-sentence only 59.5% of the transcribed sentences are accurate, and when you evaluate word-by-word, they are 82.9% accurate. The transcriptions changed the who, what, when, or where 31% of the time. And 77% of the time, they couldn't accurately paraphrase what they had heard.

Let me be clear: I am not saying that all court reporters mistranscribe AAE. However, the situation is dire. For this project, we had access to 27 court transcriptionists who currently work in the Philadelphia courts -- fully a third of the official court reporter pool. All are required to be certified at 95% accuracy, however the certification is based primarily on the speech of lawyers and judges, and they are tested for speed, accuracy, and technical jargon.

We recruited the help of 9 native speakers of African American English (if you're new to my blog, African American English is a rule-governed dialect as systematic and valid as any other), from West Philadelphia, North Philadelphia, Harlem, and Jersey City (4 women and 5 men). Each of these speakers were recorded reading 83 different sentences, all of which were taken from actual speech (that is, we didn't just make up example sentences). These sentences each had specific features of AAE, 13 in total, as well as combinations of features. Examples of sentences included:

When you tryna go to the store?
what he did?
where my baby pacifier at?
she be talkin’ ‘bout “why your door always locked?”.
Did you go to the hospital?
He been don’t eat meat.
It be that way sometimes.
Don’t nobody never say nothing to them.

Features we tested for included:

null copula (deletion of conjugated is/are, as in he workin’ for “he is working”).
negative concord (also known as multiple negation or “double negatives”).
negative inversion (don’t nobody never say nothing to them meaning “nobody ever says anything to them).
deletion of posessive s (as in his baby mama for his baby’s mama).
habitual be (an invariant grammatical marker that indicates habitual action, as in he be workin’ for “he is usually working”).
stressed been (this marks completion in the subjectively distant past, as in I been did my homework meaning “I completed my homework a long time ago”).
preterite had (this is the use of had where it does not indicate prior action in the past tense, but rather often indicates emotional focus in the narrative, as in what had happened was… for “what happened was…”).
question inversion in subordinate clauses (this is when questions in subordinate clauses invert the same way as in matrix clauses in standard English, as in I was wondering did you have trouble for “I was wondering whether you had trouble”).
first person use of a nigga (This is where a nigga does not mean any person, but rather indicates the speaker, as in a nigga hungry for “I am hungry”).
spoken reduction of negation (this is the reduction of ain’t even to something that sounds like “eem”, or the reduction of don’t to something that sounds like “ohn”).
quotative talkin’ ‘bout (this is the use of talkin’ ‘bout, often reduced to sounding like “TOM-out” to introduce direct or indirect quotation, as in he talkin’ ‘bout “who dat?” meaning “he asked ‘who’s that?’”).
modal tryna (this is the use of tryna to indicate intent or futurity, as in when you tryna go for “when do you intend to go?”).
perfect done (this is a perfect marker, indicating completion or thoroughness, as in he done left meaning “he left”).
be done (this is a construction that can mark a combination of habitual and completed actions, or can mark resultatives, as in I be done gone home when they be acting wild for “I’ve usually already gone home when they act wild”).
Expletive it (this is replacing standard English “there” with it, as in it’s a lot of people outside for “there are a lot of people outside”).
combinations of the above, as in she be talkin’ ‘bout “why your door always locked?” meaning “she often asks ‘why is your door always locked?’”

These are by no means all the patterns of syntax unique to AAE, but we thought they were a decent starting point. However, not only does AAE have different grammar than other varieties of English, but more often than not, African Americans have different accents from their white counterparts within the same city. Think about it: Kevin Hart's Philadelphia accent is not the same as Tina Fey's (it's also why Kenan Thompson's Philly accent is so weird in that sketch).

All of the court reporters we tested were given a 220Hz warning tone to tell them a sentence was coming, followed by the same sentence played twice, followed by 10 seconds of silence. We asked them to 1) transcribe what they heard (their official job) and 2) to paraphrase what they heard in "classroom English" as best as they could (not their job!). The audio was at 70-80 Decibels at 10 feet (that is, very loud). The sentences and voices were randomized so they heard a mix of male and female voices, and they didn't hear the same syntactic structures all at the same time. All of the court reporters expressed that what they heard was:

better quality audio than they're used to in court
consistent with the types of voices they hear in court (more specifically, they often volunteered "in criminal court").
spaced with more than enough time for them to perform the task (they often spent the last 5 seconds just waiting -- they write blisteringly fast).

What was the result? None of them performed at 95% accuracy, no matter how you choose to define accuracy, when confronted with everyday African American English spoken by local speakers from the same speech communities as the people they are likely to encounter on the job. If you choose to measure accuracy in terms of full sentences -- either the sentence is correct or it is not -- the average accuracy was 59.5% If you choose to measure accuracy in terms of words -- how many words were correct -- they were 82.9% accurate on average. Race, gender, education, and years on the job did not have a significant effect on performance, meaning that black court reporters did not significantly outperform white court reporters (we think this is likely because of the combination of neighborhood, language ideologies, and stance toward the speakers -- black court reporters distanced themselves from the speakers and often made a point of explaining they "don't speak like that."). Interestingly, the kinds of errors did seem to vary by race: there's weak evidence that black court reporters did better understanding the accents, but still struggled with accurately transcribing the grammar associated with more vernacular AAE speakers.

For all the court reporters, their performance was significantly worse when we asked them to paraphrase (although individual court reporters did better or worse with individual features. For example, one white court reporter nailed stressed been every time -- something we did not expect). Court reporters correctly paraphrased on average 33% of the sentences they heard. There was also not a strong link between their transcription and paraphrase accuracy -- in some cases they even transcribed all the words correctly, but paraphrased totally wrong. In a few instances, they paraphrased correctly, but their official transcription was wrong! The point here is that while the court reporters did poorly transcribing AAE, they did even worse understanding it -- which makes it no surprise they had difficulty transcribing.

In the linguistics paper, we go into excruciating detail cataloguing the precise ways accent and grammar led to error. However, the takeaway for the general public is that speakers of African American English are not guaranteed to be understood by the people transcribing them (and they're probably even less likely to be understood by some lawyers, judges, and juries), and not guaranteed that their words will be transcribed accurately. Some examples of sentences together with their transcription and paraphrase include (sentence in italics, transcription in braces <>, and paraphrase in quotes):

he don’t be in that neighborhood — <We going to be in this neighborhood> — “We are going to be in this neighborhood”
Mark sister friend been got married — <Wallets is the friend big> — (no paraphrase)
it’s a jam session you should go to — <this [HRA] jean [SHA] [TPHAO- EPB] to> — (no paraphrase)
He don’t eat meat — <He’s bindling me> — “He’s bindling me”
He a delivery man — <he’s Larry, man> — “He’s a leery man”

Why does this matter?

First and foremost, African Americans are constitutionally entitled to a fair trial, just like anyone else, and the expectation of comprehension is fundamental to that right. We picked the "best ears in the room" and found that they don't always understand or accurately transcribe African American English. And crucially, what the transcriptionist writes down becomes the official FACT of what was said. For 31% of the sentences they heard, the transcription errors changed the who, what, when, or where. Some were squeamish about writing the "n-word" and chose to replace it with other words, however those who did often failed to understand who it referred to (for instance, changing a nigga been got home 'I got home a long time ago" to <He got home>, or in one instance <Nigger Ben got home>, evidently on the assumption it was a nickname).

And it's not just important for when black folks are on the stand. Transcriptions of depositions, for instance, can be used in cross-examination. In fact, it was seeing Rachel Jeantel defending herself against claims she said something she hadn't that sparked the idea for this project. (And she really hadn't said it -- I've listened to the deposition tape independently, and two other linguists -- John Rickford and Sharese King -- came to the conclusion the transcription was wrong, and have published to that effect). Transcriptions are also used in appeals. In fact, one appeal was decided based on a judge's determination of whether "finna" is a word (it is) and whether "he finna shoot me" is admissible in court as an excited utterance. The judge claimed, wrongly, that it is impossible to determine the "tense" of that sentence because it does not have a conjugated form of "to be", claiming that it could have meant "he was finna shoot me." If you know AAE, you know that you can drop "to be" in the present but not in the past. That is, you can drop "is" but not "was". The sentence unambiguously means "he is about to shoot me," that is, in the immediate future.

This is excluding misunderstanding like with the recent "lawyer dog" incident in which a defendant said "I want a lawyer, dawg" and was denied legal counsel because there are no dogs who are lawyers.

All of this suggests a way that African Americans do not receive fair treatment from the judicial system; one that is generally overlooked. Most of us learn unscientific and erroneous language ideologies in school. We are explicitly taught that there is a correct way to speak and write, and that everything else is incorrect. Linguists, however, know this is not the case, and have been trying to tell the public for years (including William Labov’s “The Logic of Nonstandard English,” Geoffrey Pullum’s “African American English is Not Standard English with Mistakes,” the Linguistic Society of America statement on the “ebonics” controversy, and much of the research programs of professors like John Rickford, Sonja Lanehart, Lisa Green, Arthur Spears, John Baugh, and many, many others). The combination of these pervasive language attitudes and anti-black racism leads to linguistic discrimination against people who speak African American English — a valid, coherent, rule-governed dialect that has more complicated grammar than standard classroom English in some respects. Many of the court reporters assumed criminality on the part of the speakers, just from hearing how the speakers sounded — an assumption they shared in post-experiment conversations with us. Some thought we had obtained our recordings from criminal court. Many also expressed the sentiment that they wish the speakers spoke "better" English. That is, rather than recognizing that they did not comprehend a valid way of speaking, they assumed they were doing nothing wrong, and the gibberish in their transcriptions (see above examples) was because the speakers were somehow deficient.

Here, I think it is very important to point out two things: first, many people hold these negative beliefs about African American English. Second, the court reporters do not have specific training on part of the task they are required to do, and they all expressed a strong desire to improve, and frustration with the mismatch between their training and their task. That is, they were not unrepentant racist ideologues out to change the record to hurt black people — they were professionals, both white and black, who had training that didn't fully line up with their task and who held common beliefs many of us are actively taught in school.

What can we do about it?

There is the narrow problem we describe of court transcription inaccuracy, and there is the broader problem of public language attitudes and misunderstanding of African American English. For the first, I believe that training can help at least mitigate the problem. That's why I have worked with CulturePoint to put together a training suite for transcription professionals that addresses the basics of "nonstandard" dialects, and gives people the tools to decode accents and unexpected grammatical constructions. Anyone who has ever looked up lyrics on genius.com or put the subtitles on for a Netflix comedy special with a black comic knows that the transcription problem is widespread. For the second problem, bigger solutions are needed. Many colleges and universities have undergraduate classes that introduce African American English (in fact, I've been an invited speaker at AAE classes at Stanford, Georgetown, University of Texas San Antonio, and UMass Amherst), but many, even those with linguistics departments, do not (including my current institution!). Offering such classes, and making sure they count for undergraduate distribution requirements is an easy first step. Offering linguistics, especially sociolinguistics in high schools, as part of AP or IB course offerings could also go a long way toward alleviating linguistic prejudice, and to helping with cross dialect comprehension. Within the judicial system more specifically, court reporters should be encouraged to ask clarifying questions (currently, it's officially encouraged but de facto strongly discouraged). Lawyers representing AAE speaking clients should make sure that they can understand AAE and ask clarifying questions to prevent unchecked misunderstanding on the part of judges, juries, and yes, court reporters. Linguists and sociologists can, and should, continue public outreach so that the general public has an informed idea about what science tells us about language and discrimination.

This is a disturbing finding that has strong implications for racial equality and justice. And there's no evidence that the problem of cross-dialect miscomprehension is only limited to this domain (in fact, we have future studies planned already, in medical domains). This study represents a first step toward quantifying the problem and what the key triggers are. Unfortunately, the solutions are not all clear or easy to enact, but we can chip away at the problem through careful scientific investigation. On the heels of the 19th national observance of Martin Luther King Jr. Day (It has only been observed in all 50 states since 2000(!)), it seems appropriate to reaffirm that “No, no, we are not satisfied, and we will not be satisfied until justice rolls down like waters and righteousness like a mighty stream.”

-----

©Taylor Jones 2019

Have a question or comment? Share your thoughts below!

African American English and Cross Dialect Comprehension

October 02, 2018 by Taylor Jones

A while back, I wrote a handful of tweets in response to someone describing a linguist giving students a test on their comprehension of African American English. I explained that I am a linguist and part of what I study is cross-dialect comprehension between AAE and mainstream, “classroom” (white) English. Or really, the lack of comprehension on the part of the mainstream speakers. The tweet was seen by over 50,000 people (!) and a lot of people asked for DMs with more information about AAE. I figured it was easier to put some information all in one place here.

I'm a linguist who researches AAE and cross dialect comprehension. You're right about Habitual 'be' and stressed 'been'. Some other things white people usually don't understand:
-Preterite had: using "had" in sentences that don't place the events before other past events. 1/
— Language Jones (@languagejones) June 25, 2018

I’ve written elsewhere about what AAE is, and about borrowing and appropriation, especially those based on not quite understanding what is being borrowed, but here I want to dig a little more into whether and to what extent people who don’t speak AAE actually understand it.

I have a co-authored paper under review right now that I won’t discuss further here, that investigates to what extent court reporters understand and accurately transcribe AAE, which I will blog about once it’s published (spoilers: it’s bad out there). Below is a primer on AAE, a handful of things that are not understood by non-AAE-speakers, and some recommended readings.

A quick primer on AAE:

AAE is a dialect spoken primarily but not exclusively by black Americans, and is the language associated primarily with the descendants of slaves in the American South. It is a systematic, rule-governed, logical, fully-formed language variety, and it differs significantly from other varieties of English, across all levels of the language (that is, the phonology, or sound system, is different, it has different grammatical rules, etc.). It is important to note that AAE has different grammatical rules than standard English, and not that it has no grammatical rules. Therefore, it is absolutely possible to speak it wrong — something white people who are ignorant of the rules do often when imitating black people who speak AAE.

The accent of AAE is different from white accents, and because of segregation, people in the same city often have very different accents depending on race. Take Chicago for instance. The stereotypical white Chicago accent exhibits what’s called the Northern Cities Vowel Shift, which SNL made fun of with their sketch about “da bears.” But that’s not the only Chicago accent. Think about it: does Kanye West sound like that?

It’s actually not fair to say the accent of AAE, since there’s regional differences (Michael B. Jordan (Philly) sounds nothing like Ryan Coogler (Bay area)). In fact, my dissertation research is on regional variation in AAE accents (if you identify as black and grew up in the US, please think about participating in my anonymous survey — it takes 3-4 minutes and can be found here: www.languagejones.com/aaes).

The grammar:

When I talk about cross-dialect comprehension, different accents definitely play a part, but so does very different grammar. There’s not much research on how well non-AAE speakers understand or don’t understand AAE, but what there is does not look good.

Labov 1972 found that white teachers in Harlem did not understand habitual be or stressed been. When given the secnario “you ask a child if he did his homework, and he replies ‘I been did my homework’” most incorrectly interpreted that to mean the child had not completed their homework. (see #2 below) Similarly, Rickford 1975 mentions an informal survey in which white participants took “they been got married” to mean a number of different, all wrong things.

Arthur Spears coined the term “camouflage construction” for constructions in AAE that look like they mean something in standard English, but really mean something else. He did this initially when describing “indignant come”, which is a marker of indignation, not a verb of motion. John Rickford and a few of his students did work on the use of had in preterite, not perfective, constructions. Christopher Hall and I have written on first person use of a nigga, and have a paper under review right now dealing with more than 10 different uses of “the n-word” in AAE that are distinct from those available to speakers of other dialects. I’ve written about “talkin’ ‘bout'“ as a verb of quotation.

But beyond a handful of papers on individual morphosyntactic features of AAE, there’s not really any research on how well other people actually understand it. We know they don’t always understand habitual be, but not at what rate they do or don’t. Same for a ton of other features. The court reporter paper I mentioned above is, to my knowledge, the first quantitative test of cross-dialect comprehension for almost all of the features mentioned in it.

What is unique to AAE? What is not understood by others?

Keeping in mind that there’s not much quantitative research on this, I can at least point to a handful of differences between AAE and other language varieties that lead to confusion or miscomprehension. Here’s a partial list:

Habitual be: he be workin’ does not mean “he is at work” or “he is working.” It means he works, usually or often. In fact, a sentence like this can imply he’s not currently at work. I wrote a short post about it here, comparing hiring ads for fast food restaurants. This is one of the earliest features that sociolinguists focused on. Bill Labov, Walt Wolfram, and John Rickford, as well as many, many others have written about this.
stressed been: This refers to actions completed in the distant past. So I been did my homework means I finished it a long time ago. I been told you that means I told you a long time ago. They been got married means they got married a long time ago, and still are. It does not mean the same thing as standard English “have been” as in I have been doing my homework — which implies I didn’t finish yet. John Rickford has written extensively about this.
Preterite had: This is use of “had” for past events, but not to situate them before others. I had went to the store means the same thing as “I went to the store”, although it may have a different function in terms of emotion in a narrative. John Rickford has written extensively about this.
Quotative “talkin’ ‘bout”: This is “talkin’ ‘bout” used the same way white people use “like” as in “he was all like ‘oh my god’”. It’s often used with indignant come, and often used in a mocking context. I wrote a paper about it available here. It’s also touched on in Arthur Spears’ work on indignant come, and in Patricia Cukor-Avila’s work on verbs of quotation.
First person a nigga: this is where a nigga means the same thing as “me” or “I”. I have blogged about it here, I have a paper in conference proceedings about it here, and Christopher S. Hall and I have a paper about it (and other n-words) under review right now.
Negative Auxiliary Inversion: This is don’t nobody never instead of “nobody (n)ever does”. Interestingly, there’s some evidence that without context, people who don’t speak AAE interpret these as commands. Lisa Green has written about the grammar of this construction.
Question Inversion in subordinate clauses: instead of “I was wondering whether you did it,” you may hear I was wondering did you do it. Lisa Green has written about this. There’s some evidence that it’s below the level of consciousness even for middle class speakers of what Arthur Spears calls AASE (African American Standard English).
The associative plural nem (an’ them"): to my knowledge, there’s only one sentence on this in the sociolinguistics literature, in a book chapter written by Salikoko Mufwene (in African American English: Structure, History, and Use). This functions the same as associative plurals in other languages (like Zulu). Saying Malik nem (or “Malik an’ ‘em") means “Malik and the people associated with him” and from context it’s clear who that means. Could be family, could be friends, could be the people he’s sitting with right now. I have an aunt (it the African American family-by-choice-not-blood kind of way) named M., and stay asking about M nem.
Stay for regular or repeated action: He stay acting stupid does not mean “he’s still acting stupid” or “he remains acting stupid” but rather, he consistently, repeatedly acts stupid.
It instead of there: it’s a lot of people means “There are a lot of people”…
Deletion of the subject relative pronoun: Standard English can delete “who” when referring to a person in a subordinate clause only if the person is the direct object (“That’s the man who I saw yesterday” or “Thats the man I saw yesterday”). AAE can delete the subject version (That’s the man saw me yesterday). I recently heard 10 and 11 combined, on the radio: It’s a lot of people don’t go there (meaning, there are a lot of people who don’t go there).
finna and tryna as immediate future markers: There’s one conference paper written by an undergrad (who I think didn’t continue to grad school in linguistics) about tryna as marking intent or immediate future action. There’s an entire court case where the appeal decision hinged on whether finna was a word and what it means. Both can be used to mean you’re about to do something.
be done: White folks often know done as in “he done hit him!” but don’t know be done as in “I be done gone to bed when he be getting off work” meaning “I’ve usually already gone to bed when he is getting off work”. There’s also the be done familiar from the crows in Dumbo: I’ll be done seen most everything when I seen an elephant fly, which is a slightly different construction.
Set expressions, idioms, clichés: Things like it be that way sometimes, or what had happened was are not always understood, or even recognized as set expressions.

There plenty of others, but these are the main ones (in my opinion). And of course, these can all combine with each other in longer sentences (“it be a lot of people talkin’ ‘bout ‘why she always be hanging out with Malik nem?’”). Combine that with a completely different accent, even (especially?) in the same city, and you have a recipe for total miscomprehension.

The interesting thing for me, though, is that from both personal anecdotal experience and some limited research, it appears that people who don’t speak AAE, especially white folks, generally assume (1) black folks are speaking “broken” English, and (2) that they understand it even when they don’t. So people will hear I been told you that and assume it means “I have been telling you that” and that the speaker just…said that wrong. Both sentence structures exist in AAE, and they mean different things. But only one exists in “classroom” English.

Some good readings:

There’s not a lot of material aimed at regular people instead of linguists, however, I highly recommend a few books:

Spoken Soul (Rickford and Rickford)
African American English: A Linguistic Introduction (Lisa Green)
Language and the Inner City (William Labov — this one is from 1972, at the beginning of AAE being taken seriously as an object of study).
African American English: Structure, History, and Use (ed. Salikoko Mufwene)
The Oxford Handbook of African American Language (ed. Sonja Lanehart. This one is massive and new, but a lot of it is very technical).

-----

©Taylor Jones 2018

New Working Paper on Zulu published

March 08, 2018 by Taylor Jones

I recently gave a talk on Zulu morphosyntax in which I (hopefully politely an respectfully) challenged some of the mainstream approaches to Zulu syntax. The working paper is now out, in the Proceedings of the Linguistic Society of America, available here (pdf download under "full text").

It's not a fun read for a layperson, but the general gist is that (1) a lot of previous syntax work doesn't pay enough attention to the phonology, (2) the justifications for arguing that the noun augment is really a determiner are a little shaky, and (3) if we just treat the 'linking vowel' as a determiner, everything is simpler. This has the unexpected outcome of also suggesting that Zulu has construct state, something known (and controversial) in Semitic languages, but not known to exist in Bantu languages. To paraphrase a colleague at Penn, I've reduced a seemingly unique thorny problem to an already known thorny problem, which is about as good as you can hope for in syntax.

-----

©Taylor Jones 2018

Have a question or comment? Share your thoughts below!

What would Wakanda sound like?

February 15, 2018 by Taylor Jones

Today, Marvel's Black Panther is released. The Black Panther, aka T'Challa (played by Chadwick Boseman), is the king of Wakanda, a fictional country in Africa (neighbored by other fictional countries like Azania and Narobia (but not Nambia). While I'm extremely excited for the movie (NO SPOILERS PLEASE), I don't have high hopes for a surprise fictional language in the movie, given the pre-film hype about the inspiration for design elements, costume, and even T'Challa's accent. In previous films, T'Challa's father was played by a Xhosa speaking actor, and it now seems that Xhosa being spoken in Wakanda is now Marvel Cinematic Universe head-canon.

Geographic improbability aside, I don't have a problem with this, as Chadwick Boseman does a great Xhosa accent --- far better than, say, Morgan Freeman in Invictus. But, given that Wakanda is supposedly 5,000km away from South Africa (where the non-Wakandan Xhosa people are), what would the languages of Wakanda sound like? This is just a short blog post to (shallowly) explore that question with some links for the interested.

Location, Location, Location

Wakanda is situated somewhere in East Africa, by either Lake Victoria, or Lake Turkana. That means it's somewhere around Uganda, Kenya, Rwanda, Ethiopia, and South Sudan. What's great about this is that it's an area where a lot of languages from different language families are spoken. So the five major ethnic groups in Wakanda could all potentially have their own very different languages.

What about the comics?

The character and country were created by Stan Lee and Jack Kirby in 1966. Both white guys, neither linguists. So there are a lot of elements of the Black Panther mythos that have names that sound, well, like what a white guy would make up to sound exotic and African (or look it on the page). That said, certain things are just part of the canon. So The kings have an (evidently) ejective /t'/ as the first part of their names. The all female fighting force, the Dora Milaje are called what they're called. Anyone contracted to construct languages for the MCU will have to work with the existing material, much like how Marc Oakrand developed Klingon by building around what was already uttered on-screen in Star Trek. And, that will have an effect on the backstory and character development. To my knowledge, Ta-Nehisi Coates and other recent writers have not done a deep dive into the linguistic side of Wakanda, but we can't really expect Ta-Nehisi to solve everything for us.

What's spoken in that area?

As I mentioned above, that particular (vague) part of East Africa has representation from a few of the major families: Afro-Asiatic, Niger-Congo A, Niger-Congo B ("Bantu" languages), and Nilo-Saharan languages.

In Kenya, there is, of course, Swahili --- a Bantu language spoken by 50 to 100 million people and a lingua franca for the region. Swahili's huge number of speakers means you can hear it on internet radio if you want. It also means that it has lost lexical tone (when the pitch of a word or syllable changes the meaning), and because it's used for trade by so many people who speak so many other languages as their native language, it is relatively regular, meaning there's not a lot of unpredictable grammatical stuff.

But there's also a lot else spoken there. Kenya alone is home to 68 languages. The most prominent of which are Kikuyu, with 8 million speakers, and Dholuo, or Luo, with 4 to 5 million speakers.

The latter, Dholuo, is not a Bantu language, but a Nilo-Saharan language. What's the difference? The main difference is that all the Bantu languages group nouns into types (think gender in European languages, except there's 10-17 of them). Every noun has a prefix for its noun class, and the prefixes generally com in pairs (singular vs plural). So in Zulu (and Swahili!) the base form for the noun 'person' is ntu. But this doesn't just show up on its own. Rather, it has one of these noun class prefixes, as in :

umuntu 'person'
abantu 'people' (hence the name for the languages...they all call people some form of "bantu")
ubuntu 'humanity, humanness' (whence also the operating system).

So you can get phrases like umuntu ngumuntu ngabantu: "a person is a person through other people".

Bantu languages also generally have a LOT of sounds, but simple syllable types, almost always CV --- Consonant Vowel. You'll never see a word like English strengths. This is obscured by the writing a bit, so for instance, <ng> in Zulu is one sound, not two (the sound of <ng> in sing). Swahili also has syllabic nasals, so for instance, the <m> in mzungu 'white person' is it's own syllable: m-zu-ngu.

Back to Luo: Luo has vowel harmony, meaning all the vowels in a word have to share the same feature. What's the separating factor? How advanced your tongue root is. So words with the vowels in (an American pronunciation of) bean, bait, bot, boat, and boot, are one class, and words with the vowels in bin, bet, bat, bought, and foot are in another. A single word will not have vowels from both groups, only one.

Even cooler, Luo grammatically distinguishes between alienable and inalienable posession, so for instance, the word for a dog's bone has different forms depending on whether you mean the bone is part of the dog's skeleton, or a cow bone it's chewing on. If it can be taken away, it's got a suffix marking that fact.

Wakanda is also close to Ethiopia and South Sudan, where Afro-Asiatic languages are spoken. The most well-known subset of these are the Semitic languages, which include Arabic and Hebrew, but also languages Americans are often less familiar with, like Amharic, spoken in Ethiopia.

Amharic, like other Semitic languages, has what's called non-concatenative morphology, meaning that words aren't always built by adding prefixes, suffixes, or infixes, but are instead built with a system of (unpronouncable) roots that combine with vowels in between. The standard example linguists use is from Arabic (also spoken in that region), where k-t-b is always in things related to books and writing, but the vowels make it mean different things: kitaab 'book', kataba 'he wrote', kutib 'was written', etc. Amharic, like Swahili, has a massive number of speakers: roughly 22 million. It also has an objectively cool writing system.

Semitic languages like Amharic and Ge'ez are not the only Afro-Asiatic languages, though. To the south of Lake Victoria (so, somewhere sort of near Wakanda?) Iraqw, a Cushitic language, is spoken by approximately 460,000 people (because it's spoken by a much smaller number of people, the best video I could find was about porcine cysticerosis --- tapeworm in pigs).

And of course, we've established that Xhosa is MCU head canon (I really want to know the back story of how they first arrived in Wakanda, reversing the Bantu Migration, and how they rose to power!), which means that one could expect to hear clicks in Wakanda, too.

Wakanda Forever!

Given pre-release ticket sales alone, it seems like Hollywood has been sleeping on Black Panther's type of pan-African magic just the way the rest of the world has been sleeping on Wakanda's advanced technological civilization. If we're lucky, BP is going to be a smash hit with future films, TV series, Spinoffs...and maybe we'll get to hear the sounds of Wakanda just as we hear the sounds of Essos and Valyria, Middle Earth, and Qo'noS.

Bill Maher, the N-word, and that pesky R

June 03, 2017 by Taylor Jones

[Trigger warning: n-words]

Bill Maher is in the news right now for dropping the n-bomb on his show in a context that many, many people found offensive. Predictably, people are coming to his defense with two arguments: (1) he was referring to himself, and (2) he "didn't say the /r/."

As a linguist, and as one of the handful of us who has given serious thought to the n-word(s), (shout out to Christopher Hall, to Arthur Spears, and to Geneva Smitherman) I want to weigh in with a (socio)linguistic perspective. My argument is:

It was not ok for him to say either, and,
White folks (in general) should not say either if they don't want to offend, because
It is an artificial distinction for most white people, if they are borrowing from a dialect they do not speak...and the vast majority of white people do not speak (or understand) African American English, natively or otherwise. And also,
In most white people's native dialect, the only n-word is a slur.

Elsewhere, Christopher Hall and I have written about the grammatical and social functions of the n-words in some varieties of AAE. We argued that there are multiple words that all include the "n-word" that fulfill various grammatical and pragmatic functions: from first person pronouns to social distance markers, to politeness (yes, politeness) forms. If you are not a native speaker of AAE, it is easy to misunderstand these uses because they are what Arthur Spears coined the term "camouflage constructions" to describe. That is, they look like they might mean something else, and so people assume they understand when they don't. Recent pilot work on cross-dialect comprehension that I worked on with a team at U Penn and NYU confirms that in general, white folks don't understand the range of uses of the n-words.

More importantly, these are uses that occur in African American English, which is a dialect that has its own accent (really, range of accents, but we'll set that aside for now). Crucially, most forms of AAE are what linguists call non-rhotic, meaning /r/s after vowels are often not pronounced. Many white dialect varieties are not non-rhotic, including Bill Maher's normal speech. So Maher will make the argument that nigga and nigger are different words, and that he said the "acceptable" one.

HOWEVER, Maher, I would argue, only has nigga in his vocabulary as a taboo deformation of the word nigger. It's the same as claiming he didn't call someone bitch, he called them betch, or bish. The point is to say "I technically didn't say the word" while still saying the word.

Here's the crux of my argument: If you don't speak AAE, whether you borrow AAE sounds or not to say nigger doesn't change what you're saying. For people to be comfortable (or less uncomfortable) with Maher's use of nigga, he'd have to (1) use it in the appropriate social context, which this was not, and (2) back it up with literally any other features of AAE... and this would still probably not make it ok. As is, he was just "being edgy" by saying a taboo word he knew would offend.

That is, we white folks don't get to say "I was using that word like you people do!" without actually being able to use any other words like AAE speakers. If the accent is right, if the word choice is right, if the grammar is right (yes, you can butcher AAE grammar --- it is as systematic and rule governed as any other language variety), and if the cultural context is right you can maybe get away with speaking AAE as a white person. Notice I didn't say "saying the n-word". That's still pretty much off the table. Even if you understand the grammar, social function, and pragmatics of use.

Here are some tips and general rules of thumb around the n-words if you don't want to offend, and you're white in America:

When you can say "nigger" without offending:

maybe in citation, either directly quoting old racist stuff, or discussing the word itself, best if at a linguistics conference or conference on race, and even then you might encounter pushback.
never in casual conversation.

So basically, you can't.

When you can say "nigga" without offending:

To a POC who has specifically said to you "yo, we cool, you can call me nigga. You get a pass." To that person ONLY. Probably not within earshot of anyone else. I've never heard of this situation occurring, but who knows. Also, even if you find yourself in that situation, if you actually do it, I'm not saying it's gonna go great, or that I endorse that path.
discussing the word nigga in citation form at a linguistics conference. And even then, not everyone will agree.
Never in casual speech.

So theoretically it's possible, but maybe just don't.

The distinction between r-full and r-less forms has a long history, and linguists are not remotely settled as to the history of the word (for instance, Hiram Smith argues the semantically neutral r-less form goes back 200 years or more). While it's interesting, it's completely orthogonal to the question of whether it's appropriate for white people to say it. Because it has been a slur in white English from its beginnings to literally right now, in both r-full and r-less varieties of white English, people like Bill Maher don't get to decide that it no longer has all that historical baggage.

And even if you deeply understand its use in AAE speaking communities, and participate in those communities, if you actually care about the people in those communities, you still won't say it. Even when it's linguistically appropriate. Because our language use is culturally and socially situated.

-----

©Taylor Jones 2017

Have a question or comment? Share your thoughts below!

Linguists have been discussing "Shit Gibbon." I argue it's not entirely about gibbons.

February 09, 2017 by Taylor Jones

BACKGROUND: LINGUISTS CARE ABOUT SHITGIBBONS TOO

Earlier this week a Pennsylvania state senator called Donald Trump a "fascist, loofa-faced shit-gibbon."

There was an excellent post on Strong Language, a blog about swearing, discussing what makes "shit gibbon" so arresting, so fantastic, so novel, and yet... so right (for English swearing. Whether you believe "shit gibbon" is "right" as a characterization of Donald Trump is a personal assessment each person must make for themselves).

The post, The Rise of the ShitGibbon can be found here. I highly recommend reading it.

Most of the post was dedicated to tracing the origins and rise of "shitgibbon." The end of the post, however, catalogues insults in the same vein:

wankpuffin, cockwomble, fucktrumpet, dickbiscuit, twatwaffle, turdweasel, bunglecunt, shitehawk

And some variants: cuntpuffin, spunkpuffin, shitpuffin; fuckwomble, twatwomble; jizztrumpet, spunktrumpet; shitbiscuit, arsebiscuits, douchebiscuit; douchewaffle, cockwaffle, fartwaffle, cuntwaffle, shitwaffle (lots of –waffles); crapweasel, fuckweasel, pissweasel, doucheweasel.

I've actually been thinking about insults like this a surprising amount. Ben Zimmer points out about "Shitgibbon" that "...Metrically speaking, these words are compounds consisting of one element with a single stressed syllable and a second disyllabic element with a trochaic pattern, i.e., stressed-unstressed. As a metrical foot in poetry, the whole stressed-stressed-unstressed pattern is known as antibacchius."

I argue that this is correct, but that (1) there's a little bit more to say about it, and (2) there are exceptions.

HOW TO MAKE A SHITGIBBON IN TWO EASY STEPS

First: I argue that the rule for making a novel insult of this type is a single syllable expletive (e.g., dick, cock, douche, cunt, slut, fart, splunk, splooge, piss, jizz, vag, fuck, etc.) plus a trochee. A trochee, as a reminder, is a word that's two syllables with stress on the first. Examples are puffin, womble, trumpet, biscuit, waffle, weasel, and of course, gibbon. Tons of words in English are trochees (have a relevant XKCD! In fact, have two! Wait, no, three! No one expects the Spanish Inquisition!). Because so many words are trochees, you'll have to pick wisely --- something like ninja might not be as humorously insulting as waffle.

That said, in principle, monosyllable expletive + trochee seems to give really good results. Behold:

fart basket, shit whistle, turd helmet, cock bucket, douche blanket, vag weasel, (I'm gonna be so much fun when I get old and have dementia. Good luck grandkids!), shit mandrill, piss gopher, jizz weevil, etc. etc. I can do this all day.

So, it's not the fact of being a gibbon per se. Various other monkeys would work: vervet, mandrill, etc. However, crucially, baboons, macaques, black howlers, and pygmy marmosets are out.

Moreover, it's not completely unlimited. Some words fit but don't make much sense as an insult: cock bookshelf, fart saucepan (which I quite like, actually), dick pension, belch welder.

Others sound like the kind of thing a child would say: fart person! poop human! turd foreman!

Yet others are too Shakespearean: fart monger! piss weasel!

Clearly some words (waffle, weasel, gibbon, pimple, bucket) are better than others (bookshelf, doctor, ninja, icebox), and some just depend on delivery (e.g., ironic twat hero, turd ruler, spunk monarch, dick duchess).

VOWELS MATTER

For a while, I've been discussing vowels in insults with fellow linguist Lauren Spradlin. Note that when we talk about vowels, we mean sounds, not letters. Don't worry about the spelling, try saying the below aloud. Spradlin has brought my attention to the importance of repeating vowels increasing the viability of a new insult of this form: crap rabbit, jizz biscuit, shit piston, spunk puffin, cock waffle, etc.

I would argue that having the right vowels actually gives you some leeway, so you can get away with following the first word with --- gasp! ---- a non-trochee! Be it an iamb (remember iambic pentameter?) as in douche-canoe, spluge caboose, or the delightfully British bunglecunt (h/t Jeff Lidz), or even more syllables: Kobey Schwayder's charming mofo-bonobo.

As you can see, this is a hot topic in the hallowed halls of the ivory tower. If the above simple formulae have motivated even one person to go out and exercise their own creativity to make a novel contribution to the English language, then I've done my job here as a linguist. Different people get into linguistics for different reasons, but this, this is what I live for. Get out there and make a difference!

-----

©Taylor Jones 2017

Have a question or comment? Share your thoughts below!

Gender, Gender, Gender

December 01, 2016 by Taylor Jones

A good Question:

I'm still getting a surprising number of comments and emails about the short post I wrote on David Peterson's slip up with grammatical gender. While most are incoherent and silly (and have a seasonally and statistically unlikely preponderance of the use of the word "snowflake"), there is one in particular that seemed earnest, and that I think warrants a full response. Brele asked:

Taylor, can u help me understand how there are more than two genders? I ask just having watched J. Peterson on a YouTube show and hearing his thoughts on the matter.

I think it's important here to distinguish three related phenomena, and where they do and don't overlap: biological sex, gender (and gender expression), and grammatical gender. The conflation of biological sex with gender, and the subsequent conflation of grammatical gender with both, is where most of the confusion and anger comes from, I think.

Biological Sex:

Biological sex is what it sounds like: the biological properties we associate with sexual reproduction in a species. We assume that there are two sexes in humans: male, and female. This is not strictly true, as biological sex is determined by a constellation of factors, among them:

chromosomes: while we're familiar with XX as female and XY as male, there are people with XXY, or other unusual (but extant!) combinations. Roughly 1:1666 births have atypical chromosomal combinations. That's roughly 210,000 Americans.
gonads: most people have reproductive organs that fall broadly into one of the two expected categories, but again, not all people do. Roughly 1:1500 births have atypical gonads (which means about 230,000 Americans).
hormones: some people have atypical hormonal patterns. For instance, the sikh woman who has polycystic ovarian syndrome, and therefore has a full beard.

The vast majority of people will have phenotypes that 'line up', but a sizeable minority don't. So what we think of as physically binary -- male/female -- is, in reality, a bit more complicated than that, but generally true. Not always true.

Gender and Sexuality:

Gender, in the social sciences, is distinct from biological sex. It is also a complicated constellation of factors, including:

who you are attracted to.
how you physically present yourself, and how you behave, according to (or going against) culturally defined patterns of behavior. For instance "boys wear blue, girls wear pink" is a completely arbitrary, culturally defined dichotomy with no basis in biology, and which is absolutely not universal.
How masculine or feminine (or neither, or both or whatever) you personally feel. That is, maybe I feel really girly (whatever that means, just go with it), but I don't present myself in accordance with that because it's easier to just follow my culture's rules about What Men Do than it is to deal with people's reactions if I start wearing dresses.
a bunch of other stuff I'm probably leaving out.

The key here is that gender is about how you feel, behave, and are attracted to, and is not about your chromosomes, gonads, and hormones.

For a more science-y take: there are multiple parameters, which may be either binary or have multiple levels, along which people can vary continuously. This is a high-dimensional space that we generally try to collapse to a single-dimensional sub-space to then classify with a binary score. Increasingly, people who are hard to classify on that one dimension (studs, bears, beardos, genderqueer, agender, genderfluid [do we need a time dimension?], flannel-heads, balloon-poppers -- yes, I made some of these up, but not the balloon one) are saying you can't collapse things to a single binary parameter, but you night a higher dimensional space to accurately categorize people without losing important information.

Grammatical Gender:

This is how languages group nouns. The name is an unfortunate misnomer, given the conflation of the above two things -- it's etymologically related to genre and that's probably a much better way to think of it. Some languages have two genders, which they call masculine and feminine because noun classes in those languages sort of line up with how actual masculine people and feminine people are classified grammatically. That said, in one such language, French, there's no clear reason a table should be semantically feminine. The genre of the noun just happens to be the same as for women, but in this case it's largely a phonological thing, not a semantic (i.e., meaning) thing. Moreover, some words come in gendered pairs: le tour 'the tour' (as in, the tour of france), versus la tour 'the tower' (as in, the Eiffel Tower).

In other languages, there are two genders, but they don't line up with sex: Dutch has two genders, but they're common and neuter. Both man 'man' and vrauw 'woman' are common, and meisje 'girl' is neuter (along with all other diminitives, so mannetje 'little man' is also neuter).

In other languages, there are more than two genders. German and Russian have masculine, feminine, and neuter.

In yet other languages, there are many more genders: Zulu has 14, and none of them have anything to do with sex. Some are for humans, some are for long, stick-y things (although there's arguments about this), and one is for abstract concepts: umu-ntu is a person, aba-ntu is 'people' (whence "Bantu"), and ubu-ntu is the quality of being human (personhood, or humanity).

Finally, many languages mark all nouns and noun-y things with gender, but many don't. English, for instance, only explicitly marks gender on some pronouns (he and she, but not you), and a handful of nouns for kinds of people ("actress").

The Takeaway:

"Gender" is often interchangeably used to mean any of three things: biological sex, sexual(ity) gender, and grammatical gender. Moreover, each of these things is complex, and non-binary (although biological sex comes close to being binary in everyday life for most people).

English obligatorily marks gender on third person singular pronouns (and that's about it). This gender marking generally overlaps with biological sex and 'mainstream' gender expressions related to cultural assumptions about biological sex. People who do not feel like they are necessarily well described by he or she have been asking to be referred to with a different term -- many ask that we use they, which has the benefits of (1) already existing in English, and (2) being gender neutral already. Others ask for ze or something else.

The point is, marking gender on third person singular pronouns (only) is a weird quirk of the grammatical structure of English, and not representative of objective biological reality, and certainly not reflective of culture. My comments on David Peterson's remarks were solely to laugh at the irony of someone claiming they refused to use gender-neutral pronouns while using gender neutral they to express that contrarianism.

Hopefully, the above answered the question of how there could be 'more than two genders.'

-----

©Taylor Jones 2016

Have a question or comment? Share your thoughts below!

More on Pronouns: Are Gender Creative People Really All That Creative?

November 02, 2016 by Taylor Jones

Controversy is still swirling around trans, non-binary, and other "gender creative" people's occasional insistence on being referred to with pronouns of their choice. I have been thinking about this lately, and while some people are very upset that others are asking for specific pronouns the speaker may disagree with ("But you're a he, not a she!") I've come to the conclusion that the gender creatives are really not being all that creative at all.

As far as I can tell, the vast majority of people who are requesting "special" pronouns are doing one of two things:

Asking to be referred to with the pronouns appropriate to the gender they identify as (whether it's immediately apparent to others or not). That is, hypothetically, someone born female asking to be referred to in the third person with he, him, his, himself. No other changes to the pronominal system.
Asking to be referred to with a gender neutral third person pronoun, usually either they (which has a long history of use for gender neutral, but nonspecific, third person), or some variation on Xe, Ze, or something else pronounced with a voiced coronal sibilant. (/z/). No other changes to the pronominal system.

The thing is, the languages of the world do a lot of really interesting things with pronouns, and these so-called gender creatives are clearly not being creative enough. It's almost as though they're not playing with language at all, but are actually trying to conform to the rules of English while insisting others respect their gender identity.

Here are some things they could be doing, and places where I think they're really dropping the ball:

Gendering pronouns other than the third person. Arabic has gendered second person singular and plural pronouns. Instead of just "you" referring to anyone you're talking with, Modern Standard Arabic has anti, anta, antum, antunna, for "you (male)", "you (female)", "you men," and "you women" respectively.
Proximal and Distal third person pronouns. Algonkian languages tend to differentiate between, say, 'he (who is nearby)' and 'he (who is far from us),' which can then send social signals -- if I talk about you in front of you, but use the him (distal) form, I'm pretty rudely implying that this is an A/B conversation (and you can C your way out).
More case marking. English really only has nominative/oblique/possessive pronouns. Other languages do a lot more. I'd love to be able to say that I identify as male, but my pronouns are he/him/his/hig/hif/hird for nominative, oblique, possessive, ablative (motion towards me), instrumental (using me to do something or doing something accompanied by me), and locative (doing something where I am). Russian and Latin have us beat by like 3 cases, and Hungarian is blowing us out of the water.
Marking tense on the pronoun. Wolof, for instance, marks differences in tense not on the verb, but on the pronoun. This allows meeting the gender...uncreative?... half way: "You can use the pronoun for males when referring to me only if it's got past tense morphology on it."

So yes, at your request I will call you ze/zim/zis, but know that I'm silently judging you for your cliché, unimaginative pronouns, and wishing you'd give me a real challenge. It's almost like this isn't about language at all, but just about asking for me to respect your life choices and identity.

-----

©Taylor Jones 2016

Have a question or comment? Share your thoughts below!

Arguments About Arguments

February 04, 2016 by Taylor Jones

Lately, I've been thinking a lot about what linguists call valency. This is in part because I was recently discussing the weird privileging of some grammatical structures over others by self-appointed "grammar nerds", and in part because it's been very relevant in studying Zulu.

Valence or valency refers to the number of arguments a verb has (or "takes"). What's an argument? In this jargon, it's basically a noun or noun phrase. The idea is that different verbs require --or allow -- different numbers of nouns. The ones that are required are sometimes referred to as core arguments. For instance:

I strolled.

The above is referred to as intransitive, and it allows only one argument. It makes no sense to say, for instance:

*I strolled you.

(the * means the sentence is ungrammatical, meaning that it doesn't make structural sense.)

Similarly, there are verbs that take two arguments (transitive verbs) and verbs that take three (ditransitive verbs). Examples are:

he hit me.
I gave him a book.

Admittedly, this is not all that interesting. What is interesting, however, is valence changing operations.

Different languages have different tools for taking a verb of one kind, and changing the number or structure of core arguments. What does this mean? It means making a peripheral argument into a core argument, like this (kind of cheat-y example):

I gave a book to him
I gave him a book

Or totally changing which arguments have which syntactic positions:

I ate a whole thing of ice cream.
A whole thing of ice cream was eaten.

The above example, many of you will recognize as the passive voice. The passive voice has gotten a bad rap. The passive voice is freaking cool! It lets you keep the same meaning, but shuffle around what structural role all of the noun phrases are playing. This, in turn, allows you to highlight a different part of the sentences, and shift focus away from the agent (or even refuse to name the agent). What's more, it's just one of many valence changing operations that languages make use of.

Image from the blog Heading for English. — Image from the blog *Heading for English.*

Other languages have more. And they're awesome. David Peterson has a great discussion of this in his book The Art of Language Invention, but his examples in that book are often created languages. Natural languages, though, were the inspiration. Zulu, for instance, has passives like English, but also has causatives and benefactives. And you can combine them. For instance:

fona = to telephone someone
ngi-ni-fona = I call you (lit. I you call)

You can make it causative by adding -isa, which then makes the verb mean to cause/make/let/allow/help someone telephone someone. Notice anything? That's right, you've added an argument.

fona -> fonisa
ngi-ni-fona = I call you
ngi-ni-fon-isa umama = I help you call mama

Benefactives are similar, but they make it so you do the verb for/on behalf of/instead of someone else. In Zulu, this is done by adding -ela to the verb stem:

fona -> fonela
ngi-ni-fona = I call you
ngi-ni-fon-ela umama = I call you for mama.

Other languages have other kinds of things. For instance, some languages have malefactives. That is, things that are done not for or on behalf of someone, but despite someone or intending them harm, ill-will, or general bad...ness. Salish, Native American language spoken in the Pacific Northwest makes use of malefactives. English has a construction which does the same thing, but doesn't encode it on the actual verb:

she hung up on me.
he slammed the door on me.
she walked out on me.
My car broke down on me.

Imagine something like "she hung-on-up me." Notice, also, that benefactives do a weird thing to the arguments. In English, you can say:

I cook rice.

And that's transitive. If you add a bit about who you're doing it for, you get:

I cook rice for you.

or:

I cook for you.

In Zulu, though, there's very specific sentence structure. I'll put the words in English, but add the Zulu morphemes, to make it as clear as possible:

I cook rice
I cook-ela you rice
I cook-ela you.
* I cook-ela rice you.

Remember the * means "ungrammatical." This is usually discussed in terms of promotions (yay!) and obligations (ugh). That is, the benefactive in Zulu promotes the argument that is benefiting from the action, and makes it obligatory. It also must immediately follow the verb. The thing that was the object of the sentence (rice, in this case) is then an optional argument. You don't even have to say it. Or think it. Just forget about the rice.

Therefore, as valency changing operations add arguments, so too they taketh away. This is what the passive voice is doing. Whereas benefactives promote an indirect object to direct object, and then make the original direct object optional, passives promote the direct object to subject:

I cook rice
rice is cooked (by me!)

...or, if you prefer:

The whole thing of ice cream was eaten. (I refuse to say by whom.)

This is why I don't understand "grammar snobbery." Your language has a syntactic tool that does a totally cool thing, and you're just gonna decide that it's somehow bad? It's a feature, not a bug! If you think calling a natural function of your grammar that's linguistically universal bad is a way of indicating how much you know about grammar, you've got weird priorities. Appreciating grammar is not a competition to see how little of your language you can use or appreciate.

Not only are valence changing operations not bad, and totally super cool, but get this: you can combine valence changing operations, so you can have a passivized benefactive in Zulu, or a passivized causative. You can have things like:

A cake is being baked for mama

...but they're encoded entirely on the verb:

(ikhekhe) li-zo-bhak-el-wa umama
cake it.FUT.bake.BEN.PASS mama

Even better, you can have a causative, a benefactive, and a passive marker on the verb, so you get something like:

bhala = "write" or "enroll"
ba-bhal-is-el-wa-ni
they.enroll.CAUS.BEN.PASS.why = "why are they being made to enroll?"

Somewhere, there's a language that can express my desire that the passive voice be made to be used by self appointed grammar snobs, malevolently, by me, and that language can encode most of that on the verb. If not Salish, then maybe the Niger-Congo language Koalib. And that's a beautiful thing.

-----

Have a question or comment? Share your thoughts below!

SoCal is Getting Fleeked Out

February 23, 2015 by Taylor Jones

For anyone who's been living under a rock for the past few months, there is a term, "on fleek," that has been around since at least 2003, but which caught like wildfire on social media after June 21, 2014, when Vine user Peaches Monroe made a video declaring her eyebrows "on fleek."

Since then, the apparently non-compositional phrase on fleek has been wildly popular, and has generated the usual discussion: both declarations that it is literally the worst and "should die," and heated debates about what exactly on fleek even means. People seem to be divided on the question of whether it's synonymous with "on point." There is also a great deal of disagreement as to what can and cannot be on fleek, with "eyebrows" now the prototype against which things are measured.

After a conversation with Mia Matthias, a linguistics student at NYU, I decided to look at other syntactic constructions, thinking it possible -- in principle -- to generalize from on fleek to other constructions. Lo and behold, there is a minority of negative-minded people who describe others, snarkily, as "off fleek," (haters). More interestingly, Southern California is getting fleeked out.

Geocoded tweets using variations of fleek. Toronto, you're not fooling anyone. — Geocoded tweets using variations of *fleek*. Toronto, you're not fooling anyone.

This is interesting because it suggests that "on fleek" is being re-interpreted, and that it is not necessarily rigidly fixed for all speakers as an idiom. Moreover, it looks like LA is leading the first move away from strictly adhering to the idiom "on fleek," by extending the use of "fleek" to the stereotypically Californian construction of [x]-ed out.

Geocoded tweets using "fleek" in California. Las Vegas, you're not fooling anyone.

I'm looking forward to watching this develop, just as we can watch bae developing (one can now be baeless, for instance). I'm also looking forward to the day one can get a fleek over, or get one's fleek on.

-----

Have a question or comment? Share your thoughts below!

The Problem With Twitter Maps

December 25, 2014 by Taylor Jones

Twitter is trending

I'm a huge fan of dialect geography, and a huge fan of Twitter (@languagejones), especially as a means of gathering data about how people are using language. In fact, social media data has informed a significant part of my research, from the fact that "obvs" is legit, to syntactic variation in use of the n-words. In less than a month, I will be presenting a paper at the annual meeting of the American Dialect Society discussing what "Black Twitter" can tell us about regional variation in African American English (AAVE). So yeah, I like me some Twitter. (Of course, I do do other things: I'm currently looking at phonetic and phonological variation in Mandarin and Farsi spoken corpora).

Image of North America, entirely in Tweets, courtesy of Twitter Visual Insights: https://blog.twitter.com/2013/the-geography-of-tweets

Moreover, I'm not alone in my love of Twitter. Recently, computer scientists claim to have found regional "super-dialects" on Twitter, and other researchers have made a splash with their maps of vocatives in the US:

More and more, people are using social media to investigate linguistics. However, there are a number of serious dangers inherent to spatial statistics, which are exacerbated by the use of social media data.

Spatial statistics is developing rapidly as a field, and there are a number of excellent resources on the subject I've been referring to as I dig deeper and deeper into the relationship between language and geography. Any of these books (I'm partial to Geographic Information Analysis) will tell you that people can, and do, fall prey to the ecological fallacy (assuming that some statistical relationship that obtains at one level, say, county level, holds at another level -- say, the individual). Or they ignore the Modifiable Areal Unit Problem -- which arises out of the fact that changing where you draw your boundaries can strongly affect how the data are distributed within those boundaries, even when the change is just in the size of the unit of measurement.

The statistical consideration that most fascinates me, and seems to be the most likely to be overlooked in dealing with exciting social media data, however, is the problem of sampling.

Spatial Statistics aren't the same as Regular Statistics.

In regular statistics, more often than not, you study a sample. You can almost never study an entire population of interest, but it's not generally a problem. Because of the Law of Large Numbers, the bigger the sample, the more likely you are to be able to confidently infer something about the population the sample came from (I'm using the day-to-day meanings of words like "confidence" and "infer"). However, in the crazy, upside down world of spatial statistics, sampling can bias your results.

In order to draw valid conclusions about some kinds of spatial processes, it is necessary to have access to the entire population in question. This is a huge problem: If you want to use Twitter, there are a number of ways of gathering data that do not meet this requirement, and therefore lead to invalid conclusions (to certain questions). For instance, most people use the Twitter API to query Twitter and save tweets. There are a few ways you can do this. In my work on AAVE, I used code in Python to interact with the Twitter API, and asked for tweets containing specific words -- the API returned tweets, in order, from the last week. I therefore downloaded and saved them consecutively. This means, barring questionable behavior from the Twitter API (which is not out of the question -- they are notoriously opaque about just how representative what you get actually is), I can claim to have a corpus that can be interpreted as a population, not a sample. In my case, it's very specific -- for instance: All geo-tagged tweets that use the word "sholl" during the last week of April, 2014. We should be extremely careful about what and how much we generalize from this.

Many other researchers use either the Twitter firehose or gardenhose. The former is a real-time stream of all tweets. Because such a thing is massive, and unmanageagable, and requires special access and a super-computer, others use the gardenhose. However, the gardenhose is a(n ostensibly random) sample of 10% of the firehose. Depending on what precisely you want to study, this can be fine, or it can be a big problem.

Why is sampling such a problem?

Put simply, random noise starts to look like important clusters when you sample spatial data. To illustrate, this, I have created some random data in R.

I first created 1,000 random x and 1,000 random y values, which I combined to make points with random longitudes (x values) and latitudes (y values). For fun, I made them all with values that would fit inside a box around the US (that is, x values from -65 to -118, and y values from 25 to... Canada!). I then made a matrix combining the two values, so I had 1,000 points randomly assigned within a box slightly larger than the US. That noise looked like this:

" Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1. " "Never tell me the odds!"

Before we even continue, it's important to note two things. First, the above is random noise. We know this because I totally made it up. Second, before even doing anything else, it's possible to find patterns in it:

A density contour plot of random noise. Sure looks like something interesting might be happening in the upper left.

Even with completely random noise, some patterns threaten to emerge. What we can do if we want to determine if a pattern like the above is actually random is to compare it to something we know is random. To get technical, it turns out random spatial processes behave a lot like Poisson distributions, so when we take Twitter data, we can determine how far it deviates from random noise by comparing it to a Poisson distribution using a Chi-squared test. For more details on this, I highly recommend the book I mentioned above. I've yet to see anyone do this explicitly (but it may be buried in mathematical appendices or footnotes I overlooked).

This is what happens when we sample 100 points, randomly. That's 10%; the same as the Twitter gardenhose:

And this is what happens when we take a different 100 point random sample:

Another random 100 point sample from the same population.

The patterns are different. These two tell different stories about the same underlying data. Moreover, the patterns that emerge look significantly more pronounced.

To give an clearer, example, here is a random pattern of points actually overlaying the United States I made, after much wailing, gnashing of teeth, and googling of error codes in R. I didn't bother to choose a coordinate projection (relevant XKCD):

And here are four intensity heat maps made from four different random samples drawn from the population of random point data pictured above:

This is bad news. Each of the maps looks like it could tell a convincing story. But contrary to map 3, Fargo, North Dakota is not the random point capital of the world, it's just an artifact of sampling noise. Worse, this is all the result of a completely random sample, before we add any other factors that could potentially bias the data (applied to Twitter: first-order effects like uneven population distribution, uneven adoption of Twitter, biases in the way the Twitter API returns data, etc.; second-order effects like the possibility that people are persuaded to join Twitter by their friends, in person, etc.).

What to do?

The first thing we, as researchers, should all do is think long and hard about what questions we want to answer, and whether we can collect data that can answer those questions. For instance, questions about frequency of use on Twitter, without mention of geography, are totally answerable, and often yield interesting results. Questions about geographic extent, without discussing intensity, are also answerable -- although not necessarily exactly. Then, we need to be honest about how we collect and clean our data. We should also be honest about the limitations of our data. For instance, I would love to compare the use of nuffin and nuttin (for "nothing") by intensity, assigning a value to each county on the East Coast, and create a map like the "dude" map above -- however, since the two are technically separate data sets based on how I collected the data, such a map would be completely statistically invalid, no matter how cool it looked. Moreover, if I used the gardenhose to collect data, and just mapped all tokens of each word, it would not be statistically valid, because of the sampling problem. The only way that a map like the "dude" map that is going around is valid is if it is based on data from the firehose (which it looks like they did use, given that their data set is billions of tweets). Even then, we have to think long and hard about what the data generalizes to: Twitter users are the only people we can actually say anything about with any real degree of certainty from Twitter data alone. This is why my research on AAVE focuses primarily on the geographic extent of use, and why I avoid saying anything definitive about comparisons between terms or popularity of one over another.

Ultimately, as social media research becomes more and more common, we as researchers must be very careful about what we try to answer with our data, and what claims we can and cannot make. Moreover, the general public should be very wary of making any sweeping generalizations or drawing any solid conclusions from such maps. Depending on the research methodology, we may be looking at nothing more than pretty patterns in random noise.

-----

Have a question or comment? Share your thoughts below!

Big Data and Black Twitter

September 28, 2014 by Taylor Jones

This post is a story of how combining century old linguistic methods with new sources of data can reveal unexpected insights. It's a small preview of my upcoming talk at the annual meeting of the American Dialect Society, where I will discuss my recent research using social media to map previously undescribed dialect regions in African American Vernacular English (AAVE). It's the intersection of historical linguistics, dialect geography, spatial statistics, and #swag.

Prelude: Maps are Cool

I recently took a class with Bill Labov on Dialect Geography: an under-appreciated subfield of linguistics that had a bit of a heyday in the late 1800s, and which is now starting to make a come back, thanks in no small part to popular dialect surveys like this one from the New York Times.

In the class, we learned methods of mapping and interpreting spatial data to glean information about regional variation in language use, and to begin to understand language variation and change. We learned how maps like this were made:

l'Atlas linguistique de la France published from 1902 to 1920 by J. Gilliéron and E. Edmont – added colors from the version in Lectures de l'Atlas linguistique de la France de Gilliéron et Edmont. Du temps dans l'espace, pu… — l'*Atlas linguistique de la France* published from 1902 to 1920 by J. Gilliéron and E. Edmont – added colors from the version in *Lectures de l'Atlas linguistique de la France de Gilliéron et Edmont. Du temps dans l'espace*, published in G. Brun-Trigaud, Y. Le Berre & J. Le Dû (2005)

...but we also learned how to map data using newer, more sophisticated computational methods. For instance, reading geographic data from a comma separated file and mapping the data in the R programming language. More importantly, we learned what the interaction between geographic features, historical migrations, and a 'snapshot' of linguistic data can tell us about our language and ourselves.

Now, in the late 1800s, there were basically two ways that you could collect data for linguistic atlases: informally known as the German Method, and the French Method. The German Method was the method Georg Wenker used in 1876, when he sent out 50,000 surveys to German schoolmasters who dutifully sent back 45,000 completed surveys. The flaw in this method is that there is no guarantee of standardization as far as how the data is collected and interpreted. The French Method is what Jules Gilliéron used a decade later: send one trained linguist galavanting around the countryside on a bicycle for four years, eating baguettes, drinking wine, and conducting sociolinguistic interviews with everyone he can as he moves from town to town. My kind of job! Both methods resulted in gorgeous, detailed, and informative atlases...decades after the data were collected. More recently, enterprising linguists (among them, Dr. Labov) conducted telephone surveys, resulting in the "gold standard" Atlas of North American English. The ANAE gives an enormous amount of granularity to the study of regional dialects in North America -- seriously, click the link and play around, it's awesome.

Big Data

What the ANAE achieves, it does with a mere 792 speakers, intelligently sampled by region. It is a feat of ingenuity and economy.

However, we now have some intriguing new tools at our disposal, thanks to the internet and social media platforms like Facebook and Twitter. To give you an idea: a search for the word "the," -- a pretty good proxy for English use -- returns 607 million tokens in the last month alone. All of it is literally published work. It is, in effect, an enormous corpus of written language. Given the right tools and know-how, anyone can search that published material.

The Speech Problem: Graffiti and The Writing on the (Facebook) Wall

The only hitch is this: writing is not speech. In fact, if you try to figure out how English speakers anywhere pronounce English based on the spelling conventions of academic written English, you're gonna have a bad time. A few sound shifts here, a few hundred years of weird convention there, and you've got a system that doesn't tell you much of anything useful.

Notice, though, I said the spelling conventions of academic English. Many people have a pet peeve they're more than willing to share (especially on reddit, it seems): they hate when others write should of in lieu of should have. This kind of mistake is any historical linguist's favorite thing ever. Why? Because it tells us something about pronunciation. People who write should of have reduced should have to should've and it is coming out in their writing -- should of and should've are totally indistinguishable in casual speech.

It's precisely this kind of error, along with the writings of hand-wringing pedants lamenting the decline of language (among other things), that allow us to reconstruct the pronunciation of Latin as it changed through time. (aside: ever wonder why it's "inconvenient" but not "inpolite"? A historical linguist can tell you why, and when it happened). In fact, we get an enormous amount of phonologically relevant information from things like graffiti dick jokes in places like Pompeii Who says historical linguistics isn't fun?

Error isn't the whole picture though. It's one thing to say that people who struggle spelling will fall back on sounding things out. It's quite another when the non-standard spelling is intentional. For instance, one task for computational linguists interested in Natural Language Processing (NLP) is to group various spellings into sets that computers can recognize are all the same word. To simplify: a computer needs to know that color and colour are the same thing if it's going to process language quickly and effectively. Recent research in NLP has demonstrated that people on social medial platforms intentionally write how they speak. That is, they go out of their way to spell things in a non-standard way in order to better communicate how they talk informally. The best part is that this research holds across languages. While an American might be sittin (instead of sitting), a Dutch user of Twitter may well sitte (instead of zitten). This is especially true the further a dialect diverges from the written standard, as in modern dialects of Arabic. It's also true in AAVE, where the orthography you learn in school can't capture the phonological and grammatical nuances of the dialect -- something that writers like Zora Neal Hurston, Toni Morrison, and Ralph Ellison grappled with.

Black Twitter: Stigmatized Speech, Innovative Writing

Around the time I was taking the class on dialect geography, I stumbled upon a Youtube video purporting to explain #Blackfolkslang. It's a fun example of what linguists call enregisterment: when a dialect feature gets (consciously) noticed and becomes an overt marker of linguistic belonging. A classic example is the stereotypical Brooklynese fugeddaboudit.

Being a native speaker of AAVE (due to childhood speech community), the forms made intuitive sense to me and were a lot of fun. When I showed them to non-speakers of the dialect out of context, however, they were baffled. "What is ioneem? Is that Arabic?"

I thought it would be fun to dig into their use, and see where these forms were used, and how often. I got help writing a script in Python, using the Twitter API and the Twython package to extract tweets, and started using the mapping tools I was learning in R to check them out.

It became an obsession.

A few months and a few hundred thousands tweets later, I came to a few realizations. First, there's not consensus. Some people tweet nun (for "nothing"), while others tweet nuttin, and others still tweet nuffin. Second, the forms used vary regionally. Third, the phonological clues these tweets provide can be corroborated by both other media and linguistic informants (informant: a fancy term for people who both speak whatever a linguist is interested in and are willing to talk to one). Lastly, there's not just one "Black Twitter." The Black Twitter that blogs, contributes to NPR, and live-tweets sociology conferences was not the Black Twitter I was reading. I was reading tweets from young adults not represented in the Pew Research Center Internet Project, from young gang members who signal affiliation with spelling (fun fact: crips superstitiously avoid the combo "ck" because it could stand for "crip killa," and will instead favor spellings like "fucc"), and from people who use Twitter as a free analog to both texting plans and dating sites.

Some of the writing was not immediately recognizable. For instance, I was perplexed by yeen for "you ain't" (in part because it's not used in NYC or Philadelphia, I would later find). That is, I was perplexed right until I searched for it on YouTube, and came across dozens of different songs, often self-produced, which use yeen in the lyrics. Similarly, nun could conceivably be pronounced in a number of different ways. French Montana to the rescue! People often tweet lyrics to their favorite songs, and quite a number of them tweeted "nigga i ain't worried bout nun". Whether there is a glottal stop or it's elided for some of these tweeters is not clear, but what is clear is that it is two syllables, not one -- the only way to fit the rhythm.

Ultimately, I gathered data on ~30 terms (among them: yeen, talmbout, eem, ion, sumn -- you ain't, talking about, even, I don't, and something, respectively), and found that all of the variation could be explained by recourse to a handful of variations in pronunciation -- variations which can be corroborated by other means.

The Discovery: The Maps Don't Line Up

A handful of computationally minded linguists and linguistically minded computer scientists have been doing work on dialect geography using Twitter data, and I've found their work invaluable in developing this research. One of them, Gabriel Doyle (at UCSD), has demonstrated that dialect forms on Twitter correspond exceptionally well to the established gold standard of the ANAE. Like, uncannily, eerily well. He concluded, after some sophisticated statistical verification, that it's possible to glean geographic information about dialects from Twitter data.

His maps of double modals ("might could") and of the "needs washed" construction ("your car needs washed") line up perfectly with the maps produced by the ANAE and by the Harvard Dialect Survey (HDS).

My maps, however, did not line up.

Now, it has been known for a long time that including data from speakers of AAVE muddies things. In some ways, AAVE speakers do what other people in their general vicinity are doing, but in other ways they seem to do things differently. There's a large body of literature on this, but no national level description of regional variation in AAVE.

The standard maps of dialect regions in North America look like this:

Image from The ANAE, via the Texas English Project website: www.texasenglish.org

Notice the main feature is horizontal bands across the country, spreading from the East Coast. In some maps, the North, Midland, and South extend across the West, which is not given its own region. These regions follow patterns of westward expansion and settlement. In fact, maps of differences in building materials used in making cabins line up nicely with maps of dialect regions.

The thing is, AAVE does not share the same history as other North American dialects. Obviously, it is meaningless to discuss patterns of "settlement" when referring to black Americans, and while there is no consensus on the mechanics of how AAVE developed, it is understood to be largely an ethnolect, the product of a culture that developed in the last few hundred years shaped by (and despite) slavery, systemic racism, and extreme segregation.

In theory, then, the geographic distribution of AAVE should look different, and it should look roughly like the geographic distribution of Black Americans:

Image courtesy the Rural Assistance Center, from US Census 2010 data.

In some instances, this is what we see. For instance, when mapping AAVE-specific grammatical features like stressed been (which I discuss further here), the pattern lines up nicely with the population data:

initial exploratory plot of stressed been on Twitter

Note that tweets are concentrated in the South and the Northeast, and the areas with the highest black populations have the most tokens. Atlanta stands out particularly, but so do Oakland and LA, Chicago and Detroit. This pattern appears with other terms we'd expect to be non-regional, including nigga, tryna, and finna.

Similarly, enregistered lexical items (that is, local words famous for being local words) show up where we would expect them:

Philly's famously local word, "jawn," mapped on Twitter. Some of the unexpected points, on closer investigation, are people originally from Philadelphia. The two in Florida are someone referring to a friend named Jawn.

DC's newly famous "Jont." for more information, see Frances Sellers' fantastic article in the Washington Post, Is There a D.C. Dialect? It's a Topic Locals are Pretty Cised to Discuss, which discusses research by Georgetown's Minnie Annan. — DC's newly famous "Jont." for more information, see Frances Sellers' fantastic article in the Washington Post, *Is There a D.C. Dialect? It's a Topic Locals are Pretty Cised to Discuss,* which discusses research by Georgetown's Minnie Annan.

We see other, unexpected things, however. Things like this distribution of sholl for "sure":

Once we map everything, we get a broad pattern, with some words (tryna, finna) completely non-regional, some (talmbout, sholl) in a band up the middle of the country, some (ioneem, nuffin) solely along the East Coast I-95 corridor, and some (yeen) only in the South:

When I compared the maps of broad patterns of use in AAVE on Twitter to a map of the Second Great Migration from the Schomberg Center for Research in Black Culture, the pattern, and its most likely historical cause, revealed themselves.

What we see on Twitter is exactly what a historical linguist would expect from migration and dispersal, followed in some regions by innovation. The South Central to Midwest corridor and the South share terms like the quotative talmbout and phonological features like the replacement of /r/ with /l/ in the word sure (e.g., "sholl is."). The South, and the South alone, has so called /ey/-raising, consistent with Southern American English, making you ain't into yeen, (whereas other parts of the country simplify it to yain). New York says "nuttin" and "suttin," but D.C. has nuffin, and Philly is split right down the middle by these competing forces:

"My bruva neva syced bout nuffin" - You a bama.

The above is just a small taste; I will be presenting quite a few more maps, and discussing the phonological data in much greater detail in my talk at the American Dialect Society annual meeting, this January. I'm also preparing a paper for publication. The key finding is that the pattern looks like what any historical linguist would expect after migration and innovation.

Why is this a big deal?

I'm extremely excited about this line of research, for a number of reasons:

Not many linguists have been riding the big data wave. Instead, computer scientists with no training in linguistics are compiling huge data sets of, well, language, and they're doing their best to analyze it, and similarly linguists with no training in computer science are often ignoring the new tools at our disposal. In some instances, computer scientists are beating us to interesting discoveries. In others, they're getting flawed research past peer review because they're so unfamiliar with established concepts in linguistics that they think they've discovered "super dialects" when in reality they've stumbled upon register. We should all be collaborating, instead of reinventing the wheel.
This is the first attempt at defining dialect regions in AAVE on a national level, providing a baseline of research - a starting point for other researchers (and me, of course) to refine. For instance, there is a significant body of research that suggests "th-fronting" (that is, pronouncing words with a th like they have f/v, as in nuffin) is universal. While it may be possible to find everywhere (especially now that a Philly rapper has a hit song with the word "mouf" in the title!), it does not appear that way in these data. Moreover, in NYC, it's often interpreted as a marker that the speaker is not from here. Conversely, an informant I interviewed who had recently moved to Waldorf, MD, told me how he had to insist that his children do not say "nuffin," because he didn't want them "sounding like their peers in school," going on to say "everyone around here talks like that." In this way, participation or non-participation in these phonological patterns may be performatively indexing (non)local identity.
This research relies on a new method of gathering data that can be complementary to traditional methods, and can help point toward new hypotheses, and new areas of research. For instance, some of the data suggest a syntactic change in progress (distinct from the one I'm presenting on at LSA 2015, in fact).

Ultimately, I'm excited because social media are new sources of data for linguists to take advantage of, and they're sources that are extremely rich and extremely large. Whereas Georg Wenker needed decades to send out 50,000 surveys and process the results, given the right question, we're on the cusp of being able to gather more data than that in just the better part of an afternoon.

I'm also excited because this research puts black folk back on the map, literally. It's time for a large scale, systematic description of regional patterns in AAVE like what we already have for other North American dialects, and this is a step toward it!