A "Wobbly" Start for The Undefeated, Or: We Can't Talk About Race Without Talking About Language

May 19, 2016 by Taylor Jones

FiveThirtyEight just ran a piece from The Undefeated, an ESPN website that "explores the intersections of race, sports and culture," called ‘We Gonna Be Championship!’: A New Approach To ‘Fixing’ Sports Quotes: The cultural terrain we travel when quoting athletes verbatim.

In it, they discuss the tradition of journalists "cleaning up" quotes, and argue "This is a tradition that needs to go." They (correctly) argue:

"For one, it’s patronizing, with the implication that anything that deviates from the norm is inherently inferior and must be corrected. Black English, for example, isn’t a referendum on intelligence — it’s a reflection of centuries of segregation, just as American English is a linguistic representation of our country’s split from Britain. Passing judgment based on speech can often say more about the listener than the speaker."

However, it seems like the author has not yet really figured out how exactly they feel about African American English, Chicano English, and other non-standard or non-prestige varieties of English. Moreover, while correctly arguing that these varieties are not inferior, the author immediately launches into any linguist's least favorite argument: "Technically, it should be..." While clearly their heart is in the right place, they're actually wrong about the technicalities, likely in part because there's still almost no linguistic education at a compulsory level in the US, and we can't expect everyone to have taken an intro to sociolinguistics class.

In the interest of accuracy, I'm going to add a linguists' perspective to their discussion of language. That is, not just my own, but one I'm confident any linguist would share. These are the kind of things people interested in the intersection of [insert literally anything here], race, and culture should be thinking about.

They refer to Leandro Barbosa's statement "We gonna be championship!" as "wobbly" and use it as an example of where there's "no fix to be found." They're correct there's no fix, but that's because there's literally nothing wrong with it, especially if he learned to speak English from teammates who speak AAE. I can think of two possible things people might think are wrong with it, and I'm not entirely sure which one is the issue, so I'll address both:

African American English allows deletion of verbal copula -- meaning the words is, are, etc. -- in the present tense, when it's not first person. That is, you must say "I'm," but you can say "we are gonna," "we're gonna," or "we gonna." The Toronto Raptors get this: We The North is their motto now.
Championship is clearly functioning as an adjective here.

They write: "Do we consider Yoda any less wise because of his mixed-up syntax?" This is a decent argument when we're talking about non-native English speakers, like Carlos Gomez. When we're talking about native speakers of non-prestige dialects, as is often the case, it kind of falls apart -- they don't have "mixed-up" syntax, they just speak different, equally complex and systematic dialects. It's just that the rules are slightly different, not that there aren't any.
Finally, they discuss someone saying "he's a idiot." The author writes: Technically, it should have been “He’s AN idiot.” This is incorrect. In the prestige dialect -- something we'll call, say, classroom English -- there should be an -n there, however I wonder to what extent the author could explain why. The word an is an allomorph, meaning it's a different shape for the same word, and it is only used before words that start with a vowel (think about it: a pie, an ice cream). Note that "vowel" has a specific meaning in linguistics, and it's not about letters: vowels are sounds that are made in such a way that there is not an obstruction of airflow in your mouth. African American English, and a number of other language varieties have a different allomorph: they have a glottal stop, represented by /ʔ/, instead of an /n/. Glottal stop is the sound in the middle of "uh-oh." Standard English doesn't have a good way of writing glottal stops, despite using them everywhere (think of Jason Statham saying "British" to Idris Elba). So both varieties use a different form of 'a' before vowels, but only one marks it in writing. Linguists avoid this by using the International Phonetic Alphabet, which has one letter per sound, and where you can see the difference (in bold) between "an idiot" and "a idiot" : [ən.ɪdiət] [əʔ.ɪdiət]. Notice that BOTH avoid putting "a" directly before a word-initial vowel.

I'm glad to see that ESPN is starting to think more about race and culture. For many people, professional athletes are their only regular exposure to some of these dialects, and it's only natural that when we hear things that aren't familiar to us, we think they might be wrong, rather than just different -- especially if we don't have enough exposure to see how thoroughly systematic they are. I hope The Undefeated will take language seriously, and not rely on "common-knowledge," folklore, and common fallacies to do it. Perhaps they need a linguist on their staff?

EDIT: A friend mentioned that Barbarosa's native language is Brazilian Portuguese, I've made some slight tweaks to reflect that. The broader point -- that athletes speaking AAE are often "corrected" -- still stands, and there are countless examples, from the recent discussion of Marshawn Lynch or Richard Sherman, to older examples, like the fact that Oscar Gamble's "They don't think it be like it is, but it do," has become an internet meme often used when people think something is gibberish.

-----

©Taylor Jones 2016

Have a question or comment? Share your thoughts below!

Jawn in the news (again)!

March 24, 2016 by Taylor Jones

I recently had a long conversation with Dan Nosowitz at Atlas Obscura about the Philly word jawn. The full article is here.

I want to take a second to expand on a couple of points, and make a few things more precise:

linguists will notice I simplified the (extremely) complicated situation with regards to tense /æ/ in Philadelphia. In particular, depending on the speaker, the direct comparison with bag may or may not be valid. The general point I was getting at was that white and black tense /æ/ systems are diverging in Philadelphia. Relevant work by Bill Labov here.
The by-line is sensational, but I definitely did insinuate as much. I absolutely welcome any examples from linguists of generic nouns that can be count/noncount, human/nonhuman, concrete/abstract, etc. Jawn seems to be radically unspecified.
I can't remember my exact words, but I'm pretty sure I said I suspect/hope/think Bill is enjoying his retirement. He's still doing a lot of work (we discussed a couple of forthcoming papers and a forthcoming book earlier today). I don't want to give the impression that he's dropped everything -- the context was specifically discussing talking to journalists, and whether he was currently available.
With regards to diphthongs, specifically the sound in joint, here's wikipedia on English diphthongs. Words like joint, boy, toy, etc. are generally taken to have [ɔɪ̯]. More generally, here's jawn: With regards to diphthongs, specifically the sound in joint, here's wikipedia on English diphthongs. Words like joint, boy, toy, etc. are generally taken to have [ɔɪ̯]. More generally, here's jawn: /d͡ʒɔ:n/; and here's joint /d͡ʒɔɪ̯nt/ -- which is realized in the song discussed in the article as [d͡ʒɔ::ɪ̯nʔ].
Finally, the PNC is not outdated, but the speakers I currently have access to, who would be relevant to AAE use of jawn are not recent.

I'm starting to think it's time to sit down with everyone who's worked on jawn and write a definitive paper, especially given the radical semantic unspecification that seems to be at play here.

EDIT:

A few more points:

Plural is jawns, however, plural -s is often deleted in AAE, leaving you with just jawn. White Philadelphians have been very, very upset by plural jawn in the article. Black Philadelphians have been very upset with some of the white uses of jawn in the article. I'm bringing Philadelphians together!
"A lot of jawn to do" is not universally liked, and people are very vocal about that as well. That said, I obviously did not make it up (some people think I did -- that's not how linguistics works!), and it's extremely easy to find a few tokens of it on social media, so you don't have to just go by my notes.

-----

©Taylor Jones 2016

Have a question or comment? Share your thoughts below!

Kids Now Don't Discriminate How They Used To

March 01, 2016 by Taylor Jones

In the last two years, I've graded a LOT of undergraduate assignments, and I noticed something I think may point toward a change in progress. Kids keep misusing discriminate.

...or so I thought.

The thing is, they keep making the same mistake. And it's a mistake I had seen in grading for a number of classes over the last few years. And I'm starting to think it might not be a mistake.

Instead of saying "he discriminated against me," for instance, they'll say "he discriminated me." They seem not to have the older reading of discriminate at all, so I'm not sure to what extent something like "he could not be discriminated from the background" meaning distinguished from here, would be used or understood by teenagers now.

A quick search reveals it's quite common on Twitter and other social media, with examples like:

If you flagrantly discriminate me, don't think that will deter any part of who I am. All it does is make me more determined
I always wonder y do ppl discriminate me about being myself
If u discriminate me over my sex & deprive me of my dignity & rights, fuck yes I'll sue your ass
he was mad she discriminated me
she didn't diss me.but indirectly discriminated me

And of course it's not just first person:

Trump discriminated her
obama hates him since he racially discriminated him
you yourself discriminated them

And it can be passivized, and occur in all tenses:

I was discriminated on the basis of my gender once
its sad because everyone with colour will be discriminated with time if trump wins
If Portia was being discriminated because of her gender then all the insults thrown would be focused on that
I never knew Asian gays had been discriminated before
wish I could pull the race card when my career is flopping but I can't bc im white yet my family had been discriminated for being Russian

The most interesting thing to me is that people now will use "against" as well, but to mean different things, and it doesn't seem settled what they want it to mean:

My height never discriminated me against anything
basically your staff discriminated me against an able bodied person.
you couldn't handle my opinion so you discriminated me against my age and gender

It sounds like whereas my generation and older would say "he discriminated against her because she is female (and he's sexist)" people are now likely to say "he discriminated her against being a woman."

Personally, I have a hard time parsing this structure -- my first inclination is to read "he discriminated her against being a woman" as trying to mean something like "he could tell that she wasn't a woman by comparing her to women" but this is clearly not what is meant. Rather, I'd translate it as "he discriminated against her because she is female."

I'm generally a champion of language change and innovation. It's my job, after all. But with this particular structure, I think I understand -- on an emotional level -- the curmudgeonly and pedantic response everything from totes to fleek gets. But, of course, I will accept the fact of language change, and not discriminate kids these days against their language use.

-----

©Taylor Jones 2016

Have a question or comment? Share your thoughts below!

More Jawn Jawn

February 17, 2016 by Taylor Jones

Right after the Grammy Awards, CBS 3 in Philly aired a segment hosted by Nicole Brewer, called Good Question, in which I was featured (links below). The question was "What is jawn?"

I left out something unique about Philly's use of jawn that is distinct from joint, jont, or jaint elsewhere: in jawn in Philly can replace non-count nouns. That is, you can use it where you might use "stuff" whereas joint can only be used for one or more individual things.

Example (remember * means something is ungrammatical -- that is, structurally unacceptable to native speakers):

That joint is dope.
That jawn is dope.

Those joints is dope.
Those jawn(s) is dope.

I got a lot of *joint to do
I got a lot of jawn to do.

This is a really cool feature that further distinguishes jawn from other regional variations on the same theme.

For those interested in the segment, the video is here.

Nicole Brewer's twitter feed is here.

Jezebel coverage is here.

Philly Magcoverage is here.

And since it didn't make it into the video, I have a shout out to Abdul Kareem, manager of the Gap on Walnut & 35th in Philly, who sold me the shirt I wore in the interview. He greeted me when I walked in with "bomber and shelltops? That's my jawn!" When I left, he told me "I hope that jawn goes good for ya." More importantly, he straight up knew the history of the term ("it's our way of saying joint") and why it's special ("...but we can use it for more.")

-----

©Taylor Jones 2016

Have a question or comment? Share your thoughts below!

On strolling

February 16, 2016 by Taylor Jones

Recently I wrote a blog post in which I mentioned that "it makes no sense to say: I strolled you."

Mark Liberman helpfully pointed out that in certain contexts, this is not entirely true. He wrote:

But there's causative examples with a direct object and a goal -- some web examples:

We strolled you around the park and had a picnic.

He was the one who strolled you around the neighborhood and talked to you like you understood everything and encouraged you to have an opinion.

She immediately released Michael's hand while he just stood there in a trance, but realized he needed to be placed somewhere, so Ophelia strolled him over to a nearby chair and sat him down when she said, “You wait for me here..

I strolled him around and admired the beautiful trees.

Her reactions of surprise and delight had been repeated many times over as he strolled her through each garden 'room'

Annie strolled her empty cart straight to the produce department

And there's a transitive sense that the OED glosses as "To walk or pace along (a path) or about (a place)", with citations back to 1623:

1693 R. Gould Corrupt. Times 28 For thee the dirty Drab does strowl the Streets.
1720 Swift Progr. Beauty 87 So rotting Celia stroles the Street, When sober Folks are all a-bed.
a1772 Ess. from Batchelor (1773) I. 249 After strolling the Green, arm in arm with L——d M——lt——on.
1810 Splendid Follies III. 119 [He] had been strolling the solitary path of the elm-walk. 1956 H. Gold Man who was not with It vi. 50 Her laughter rang out as we strolled a business street of the suburb.
1974 New Yorker 3 June 76/3 (advt.) Hike forest trails, stroll lovely gardens.
1977 Gay News 24 Mar. 23/1 They taxi to the Toilet and stroll the dock strip at 3 am. .

In my defense, I was grasping for a good example of a verb that doesn't transitivize well, and without a goal or a causative reading, stroll fits the bill. There are actually a ton of words like this, too (that is, that aren't default transitive, but that can receive such readings under appropriate circumstances). My first inclination was to use walk but immediately realized it had this problem. In retrospect, I should have chosen an unaccusative verb -- a topic I'll cover in another post.

Arguments About Arguments

February 04, 2016 by Taylor Jones

Lately, I've been thinking a lot about what linguists call valency. This is in part because I was recently discussing the weird privileging of some grammatical structures over others by self-appointed "grammar nerds", and in part because it's been very relevant in studying Zulu.

Valence or valency refers to the number of arguments a verb has (or "takes"). What's an argument? In this jargon, it's basically a noun or noun phrase. The idea is that different verbs require --or allow -- different numbers of nouns. The ones that are required are sometimes referred to as core arguments. For instance:

I strolled.

The above is referred to as intransitive, and it allows only one argument. It makes no sense to say, for instance:

*I strolled you.

(the * means the sentence is ungrammatical, meaning that it doesn't make structural sense.)

Similarly, there are verbs that take two arguments (transitive verbs) and verbs that take three (ditransitive verbs). Examples are:

he hit me.
I gave him a book.

Admittedly, this is not all that interesting. What is interesting, however, is valence changing operations.

Different languages have different tools for taking a verb of one kind, and changing the number or structure of core arguments. What does this mean? It means making a peripheral argument into a core argument, like this (kind of cheat-y example):

I gave a book to him
I gave him a book

Or totally changing which arguments have which syntactic positions:

I ate a whole thing of ice cream.
A whole thing of ice cream was eaten.

The above example, many of you will recognize as the passive voice. The passive voice has gotten a bad rap. The passive voice is freaking cool! It lets you keep the same meaning, but shuffle around what structural role all of the noun phrases are playing. This, in turn, allows you to highlight a different part of the sentences, and shift focus away from the agent (or even refuse to name the agent). What's more, it's just one of many valence changing operations that languages make use of.

Image from the blog Heading for English. — Image from the blog *Heading for English.*

Other languages have more. And they're awesome. David Peterson has a great discussion of this in his book The Art of Language Invention, but his examples in that book are often created languages. Natural languages, though, were the inspiration. Zulu, for instance, has passives like English, but also has causatives and benefactives. And you can combine them. For instance:

fona = to telephone someone
ngi-ni-fona = I call you (lit. I you call)

You can make it causative by adding -isa, which then makes the verb mean to cause/make/let/allow/help someone telephone someone. Notice anything? That's right, you've added an argument.

fona -> fonisa
ngi-ni-fona = I call you
ngi-ni-fon-isa umama = I help you call mama

Benefactives are similar, but they make it so you do the verb for/on behalf of/instead of someone else. In Zulu, this is done by adding -ela to the verb stem:

fona -> fonela
ngi-ni-fona = I call you
ngi-ni-fon-ela umama = I call you for mama.

Other languages have other kinds of things. For instance, some languages have malefactives. That is, things that are done not for or on behalf of someone, but despite someone or intending them harm, ill-will, or general bad...ness. Salish, Native American language spoken in the Pacific Northwest makes use of malefactives. English has a construction which does the same thing, but doesn't encode it on the actual verb:

she hung up on me.
he slammed the door on me.
she walked out on me.
My car broke down on me.

Imagine something like "she hung-on-up me." Notice, also, that benefactives do a weird thing to the arguments. In English, you can say:

I cook rice.

And that's transitive. If you add a bit about who you're doing it for, you get:

I cook rice for you.

or:

I cook for you.

In Zulu, though, there's very specific sentence structure. I'll put the words in English, but add the Zulu morphemes, to make it as clear as possible:

I cook rice
I cook-ela you rice
I cook-ela you.
* I cook-ela rice you.

Remember the * means "ungrammatical." This is usually discussed in terms of promotions (yay!) and obligations (ugh). That is, the benefactive in Zulu promotes the argument that is benefiting from the action, and makes it obligatory. It also must immediately follow the verb. The thing that was the object of the sentence (rice, in this case) is then an optional argument. You don't even have to say it. Or think it. Just forget about the rice.

Therefore, as valency changing operations add arguments, so too they taketh away. This is what the passive voice is doing. Whereas benefactives promote an indirect object to direct object, and then make the original direct object optional, passives promote the direct object to subject:

I cook rice
rice is cooked (by me!)

...or, if you prefer:

The whole thing of ice cream was eaten. (I refuse to say by whom.)

This is why I don't understand "grammar snobbery." Your language has a syntactic tool that does a totally cool thing, and you're just gonna decide that it's somehow bad? It's a feature, not a bug! If you think calling a natural function of your grammar that's linguistically universal bad is a way of indicating how much you know about grammar, you've got weird priorities. Appreciating grammar is not a competition to see how little of your language you can use or appreciate.

Not only are valence changing operations not bad, and totally super cool, but get this: you can combine valence changing operations, so you can have a passivized benefactive in Zulu, or a passivized causative. You can have things like:

A cake is being baked for mama

...but they're encoded entirely on the verb:

(ikhekhe) li-zo-bhak-el-wa umama
cake it.FUT.bake.BEN.PASS mama

Even better, you can have a causative, a benefactive, and a passive marker on the verb, so you get something like:

bhala = "write" or "enroll"
ba-bhal-is-el-wa-ni
they.enroll.CAUS.BEN.PASS.why = "why are they being made to enroll?"

Somewhere, there's a language that can express my desire that the passive voice be made to be used by self appointed grammar snobs, malevolently, by me, and that language can encode most of that on the verb. If not Salish, then maybe the Niger-Congo language Koalib. And that's a beautiful thing.

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

Turns our Squarespace doesn't like javascript that much

February 02, 2016 by Taylor Jones

I destroyed my website by trying to include Leipzig Glosses for Zulu the other day. I have new posts coming soon, but the formatting may be less than ideal, since using Leipzig.js kills my website.

We'll see. ¯\_(ツ)_/¯

A very big weekend

January 12, 2016 by Taylor Jones

This last weekend was the annual meeting of the Linguistic Society of America, as well as concurrent meetings of seven sister societies. It was huge, and a particularly big weekend for me.

Chris Hall (@linguistopher) and I presented a poster on Rachel Doležal's linguistic performance of her concept of 'black'-ness.

Lauren Spradlin (@lsprad) and I presented a talk on the morphophonology of totes truncation (which I previously wrote about here), which was featured in the media advisory for the LSA, and which will be written up in a number of pop publications (links as they happen).

I gave a well received talk on negation in African American English (Which I previously wrote about here), and which won an award for being one of the ten best student abstracts (number 10, to be precise, but I take what I can get!).

My current research on AAVE comprehension was mentioned during John Rickford's presidential address (!) and I was acknowledged at the end of his talk in the same line as Bill freakin' Labov (!!).

Last, but not least, the edition of American Speech that came out this week has my article "Toward a Description of African American Vernacular English Dialect Regions Using 'Black Twitter'" as the first article!

It's been an exhausting, but great weekend, and a great way to start the new semester.

Astronomically Strange Intensifiers:

December 23, 2015 by Taylor Jones

Recently, I noticed some strange uses of adverbs. Some examples:

"in an active shooter situation, things can get astronomically bad."
"I miss you unconditionally."

On the surface, these don't make a lot of sense. Astronomically generally refers to scale: astronomically large distances, for instance, are distances that are so large as to be on the same scale as the distances between celestial bodies. In general, astronomical refers to things that are large enough as to be of that scale (say, 93 million miles). However, the speaker who uttered the above had made the jump from very large to just very.

Similarly, if you love someone without any conditions or expectations, it's reasonable to say "I love you unconditionally." If you don't think too hard about what this means, it could be reasonable to interpret it as meaning "I love you a lot." Of course, literally, it has some peculiar entailments, like "I miss you even when you are present, and there is no situation in which I won't miss you."

What's happening here is the same process of semantic bleaching that gave rise to literally as an intensifier (no, it does not mean figuratively. "I'm literally starving" means the same as "I'm really starving" and does not mean the same as "I'm figuratively starving."). For that matter, it's how we got words like truly, as in "I'm truly hungry."

All I'll say about this is that I'm literally astronomically impressed with people's ability to innovate by analogy.

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

Recent Interview Jawn (now with extra jont, jeent, and jaint!).

December 14, 2015 by Taylor Jones

I was recently interviewed (along with expert on all things Philadelphia, Joe Fruehwald) about use of jawn in Philadelphia. The article appeared last week in Metro Philly, and then Metro New York. They even use a quick graphic I made of jawn on Twitter.

Check it out here: http://www.metro.us/philadelphia/jawn-it-s-the-new-yo/zsJola---iR0kUASCcK0nI/

Note, I think the author might have mixed up a quote from me and from Joe, unless Joe also has an interest in jawn-like words in other varieties of AAE (this is a possibility).

For more on DC's jont, there's the (relatively) recent article in the Washington Post on the DC local dialect (hint, Chocolate City's local dialect is local AAE). And of course, there's some discussion of KY, TN, and (Eastern) PA jeent, online, as well as Virginia's (and basically all points south of DC) jaint. (Note to self: Virginia's Jaint sounds awful.)

NYC is still holdin' it down with the classic: joint. That said, I do increasingly hear teenagers in NYC occasionally saying jawn, though it is rare.

I'm very curious to see if the success of Creed really does cause jawn to spread. In AAE, it would be doubtful, since(1) everyone already has their own jawn-like word and (2) not everyone uses the vowel in jawn (the same as in a stereotypical New York pronunciation of coffee), or can even reliably hear or pronounce it. For non AAE speakers, I'm not sure. Inability to pronounce a word the same, or even having the same word already in your lexicon has not stopped white speakers of various mainstream varieties of English from borrowing slightly different pronunciations of words and giving them different meanings than they have in AAE: turnt (turned), bae (babe), cray (crazy) or ratchet (wretched) in the last few years, alone.

And, of course, the original Rocky introduced wider America to yo, so anything could happen.

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

Word Embeddings

November 02, 2015 by Taylor Jones

I recently read an excellent blog post about word embedding models, something I've been fascinated by for some time now.

To simplify some, they're vector models of word relationships in a corpus, so you can imagine relationships between or among lexical items as being situated in (some higher-dimensional) space, that you can then theoretically reduce to meaningful relationships and project in manageable dimensions. Spatial relationships among items can then potentially tell you something about both what words mean similar things (or do similar things), and about the corpus itself. Moreover, because it's all linear algebra, you can perform mathematical operations on items in the space. The most famous example of this, and one that's been going around a lot on social media lately, is:

king - man + woman = queen

For a variety of reasons (better coding skills, new familiarity with matrix algebra, interest in external computational validation of semantic intuitions), now seemed like an excellent time to level up. So, I decided to get to work in R (after some cleaning in python, with the NLTK package).

The tutorial for the wordVector package does a lot of fun things with a large corpus of cookbooks (closest words to fish: salmon, haddock, cod...), but I figured why not play around with some other things? Why not, say, tweets?

I have a corpus of ~17,000 tweets all in (basilectal) AAE that I collected for my research on geographic patterns in AAE on social media. While this definitely is on the small end, it seemed suitable as a trial run, and I'm quite pleased with the results.

For instance, among the closest words to eem are terms that are negation (don't, didn't, ain't, can't) and negative polarity items (even, much, yet, nomo, anymore, anything). Among the closest to nuttin are nuffin, and sayin.

What I'm finding really interesting is the results of projecting the whole thing down to a two dimensional space, even before having really cleaned the data:

Things that belong together are very clearly together: happy is right under birthday (top left). Nuffin and nuttin are both in the same place, as are somebody and sumn. Talm and talmbout are right on top of one another (bottom left), and quite far away from talk and talking (middle right), with said in between (exactly what I would predict based on the material I presented at NWAV). Eem is right next to even in a cloud of negative words: ain't, don't, ion (i.e. "I don't..."), all at the bottom right. Question words all clustered together in the top (slightly left of center). Verbs (sleep, eat, talk, take give, go, hit) are all in a cluster in the middle right. Dat and doe are right by one another (top left). Hell is in the immediate vicinity of both nah and yeah.

Of course, the fact I haven't much cleaned the data means that don, can, ain are a different cluster than dont, cant, aint, but an actual analysis would fix that (and exclude http, https, and all the floating alphanumeric bits).

Even with messy data, there are some intriguing relationships: jawn is right between miss and sombody/sumn (and forms a triangle with somebody/sumn and baby/girl). In fact, the nearest vectors to jawn include jont and philly.

Moreover, performing vector operations like jawn - philly yields jont, the Washington DC equivalent of jawn, in the top 3 results (pragmatics: guess which rank). Nuttin - nyc yields nuffin. This is fascinating, in part because geographical variation is showing up in a very abstract high dimensional space, almost like a regional AAE translator:

jawn - Philly = jont

nuttin - NYC = nuffin

The next step is to do some transformations of the vector space to dig into these relationships: what happens when you frame things in terms of an opposition between love and hate? Where would jawn fall relative to girlfriend?

I have a lot of work to do to develop this, but already I can see some excellent potential for future research.

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

NWAV! Also, more (interesting) posts coming soon.

October 22, 2015 by Taylor Jones

This weekend I'm attending NWAV 44 (New Ways of Analyzing Variation), where I will be giving a talk on the African American English verb of quotation talkin' 'bout. I'll have a post on that very soon.

I'm excited to see that it looks like someone else has noticed and operationalized the study of something I noticed a few years ago, but couldn't figure out how to get at: the interaction between affect and creaky voice. The paper is "Creak as disengagement: gender, affect, and the iconization of voice quality."

I'll have updates after the conference, as well as ton of new posts that are in the pipeline, inlcuding a run-down of talmbout/talkin' 'bout to introduce quotation, a discussion of recent work I co-authored on reduction in Mandarin, a tin-foil hat crazy speculation about French and Zulu.

I'm also going to experiment with blogging more, but shorter. I have a problem with wanting to put up long(ish) essays on subject matter I'm very confident about, and this means that shorter observations -- but ones that are more than 140 characters -- get ignored. Insofar as I'm also in the middle of finishing up 3 publications, have two qualifying papers to write, and still have homework for my courses, this also means I just don't blog. I'm going to try and change this.

I've already got the javascript working for interlinear glosses, and a lot of material to write up. Here's to the future!

Bruh, breh, brah, bro.

September 11, 2015 by Taylor Jones

I was recently sent a tweet, where the author wrote <brah> as a term of address, and I was asked, more or less, "what's going on here?"

Since this actually comes up surprisingly often, I decided to take a closer look.

My friend Bri told me, on multiple occasions that she has been chastised by white people for saying or writing "bruh," as it's ostensibly "a lazy version of bro," where bro is a truncation of brother. I don't think I'm blowin' up anybody's spot when I say that it turns out people really, truly feel this way:

@william_warren bruh is even worse than bro. It's like a lazy version.
— Leslie Caitlin (@lesliecaitlin) December 5, 2014

Bruh is just a lazy way of saying bro which is a lazy way of saying brother
— MAX (@Hack_Maxman) June 25, 2014

The thing about this is that, as I was explaining to the undergrads in Mark Liberman's Intro to Linguistics class, often the way we evaluate language is about social factors and not anything inherent to the language itself. In this instance, it would be difficult to claim, scientifically, that one or the other form is lazy: all three have the same number of segments, and the first two are the same. The only difference is the vowel: br[ʌ], br[ɛ], br[a], br[o], with the vowels in brother, bed, cot, and flow, respectively.

So, from a speech production perspective, none is more or less difficult than any of the others.

What's happening then?

For starters, there's social perception of who says what. Each of these ways of addressing someone has its own indexical field, following Penny Eckert's use. To simplify, this means that you don't just hear and parse the word, but it evokes a whole range of associated concepts related to social identity.

My intuition is that bro is taken as the de facto 'standard' way of saying it, and that the other three index identity. bruh is stereotypically black, and conforms to a common way of truncating words in African American English (which I discuss briefly here; cf. luh 'love', belee 'believe', cuh 'cousin', etc.)

Breh and brah are suggestive of the California Vowel Shift, but this doesn't mean that people who use it are from California. It may be that people are trying to build an identity evocative of something (say, a laid-back surfer) without being that thing.

In order to further investigate, I pulled a bunch of tweets. It turns out, all the variants are used everywhere. Unsurprisingly, bro is the most common:

Tweets containing bro. — Tweets containing *bro.*

Bruh is the next most common, but occurs at one fifth the rate of bro (sort of like how black people in the US occur at 1/10 to 1/5 the rate of white people. Hmmmm.):

Tweets containing bruh. — Tweets containing *bruh*.

Bringing up the rear are breh, and brah:

I decided to poke around a little bit more, so I joined all of the above tweets to a Census 2010 spatial data frame, and ran a few models. What I found was basically that the best predictor for all varieties was population (that is, people tweet where people are) but that things like total black population or percent black population did not seem to have a terribly strong effect on which variant was used. Granted, it was a very cursory attempt, and if I were really digging into these data I would be doing a lot more to verify this, but preliminarily, it seems like everyone's using everything.

What is particularly interesting to me is this person's assertion that she uses different forms to address different kinds of people:

Different forms of bro: Bruh- addressing anyone Brah- for females Breh- lazy way to say it
— oblivion (@phantasmagoricc) January 18, 2015

I'm particularly interested in hearing people's thoughts on this. If you don't stigmatize any of the forms, are they in free variation, or do you use different forms for different people? Leave a comment below, or tweet at me!

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

What Nobody's Discussing about Rachel Doležal, Dishonesty, Dialect, and Strategy

June 16, 2015 by Taylor Jones

EDIT: This looks worse than I originally thought. The linguistic observations in this post are based on the video linked in the post (from early 2014), however, Rachel Doležal speaks differently in her recent interview with Melissa Harris-Perry. Looks like my new summer project is a more rigorous comparison.

By now, everyone I know has heard about Rachel Doležal, the former president of the Spokane, WA chapter of the NAACP: specifically, that she is a white woman who has been successfully posing as black for over a decade, and was recently outed by her white parents. Many of my friends have written thoughtful and interesting comments on the situation (e.g., Brianna Cox's take on it) and a few professors I know are discussing how to go about using it as a teaching moment (and what to take away from it).

While people have discussed power structures and privilege, passing and assimilation, whether "trans-racial" is really "a thing," and what Doležal's motivations could possibly have been, there are a few interrelated aspects of the increasingly ridiculous spectacle that have been overlooked, and which I find fascinating. Really, they're all facets of the same one observation:

Rachel Doležal didn't bother to attempt ANY use of African American English. AT ALL.

Don't believe me? Here are videos of a 2014 interview with her.

This is incredible to me -- not in the sense of 'unbelievable' but in the sense that it's just astounding that she didn't use any AAE features and that she was right that she didn't need to in order to pass. For a decade.

First, I will explain what I mean when I say she didn't use any AAE features. Then I will discuss two interrelated possibilities: (1) she couldn't hear it well enough to even know she wasn't doing it, and (2) she made the decision that she didn't need to bother with speech after changing her appearance.

As a linguist, and as a white person who speaks AAE by virtue of my childhood speech communities, I am as aware as anyone could be that race and dialect are not intrinsically related. There are white people who speak AAE, there are black people who don't, there are people of both races who think they do but really don't (like this guy!), and ultimately though race is even more difficult to pin down than dialect both resist simple description (although dialects are easier to pin down). In the US, though, because of our history, there is an ethnolect spoken by many people who are raced (i.e, "perceived by most people") as black, and while it varies from speech community to speech community, it has overarching features that we can describe -- just like how we can talk about "French" despite the fact that what's spoken in Quebec is very different from what's spoken in Côte d'Ivoire.

What Doležal pulled off was the equivalent of successfully posing as a Parisian without having so much as a French accent, let alone speaking French. A black turtleneck, a beret, the occasional Galois cigarette, and voilà: you're French. Except in this example Doležal wouldn't have needed to say voilà.

While no black American will necessarily speak with all the distinctive features of AAE, and some have none of the features of AAE, I'm flabbergasted by the fact that Doležal didn't bother with any. An excellent introduction to all the things she didn't do is Dr. John Rickford's article Phonological and Grammatical Features of African American Vernacular English. Listening to her speak, even accounting for the fact that it's a semi-formal interview and she's speaking in the capacity of Professor, it's still surprising how few features she exhibits. There are zero grammatical features: no habitual be, no stressed been, no preterite had, no AAE patterns of wh-operator and do-support inversion in main and relative clauses (relative to white North American Englishes), no done, no finna, no tryna, no typically AAE use of ain't, and so on and so forth. There's not even negative concord!

There are also very few to no phonological markers: she does not generally reduce consonant clusters (as in I was juss confuse for 'I was just confused'). She tends to fully release stops, even word finally, and she has no secondary glottalization on unvoiced stops. There's no word final devoicing of anything let alone deletion. Words ending with -ing don't get reduced to -in'.

Finally, she doesn't use any AAE specific words or phrases. This is bizarre, especially given that she went to Howard. Even white non-speakers of AAE pick up words and phrases when they live in AAE speaking communities (it's called "accommodation"). I'm not expecting her, as a Howard grad, to use bama, lunchin', cised, press, or jont after leaving DC, but...like...give us something. Say bison after mentioning Howard (even though you sued them for 'reverse racism' when you identified as white, and lost). There's no stress shift, even in completely enregistered words, like police. There's literally nothing. It. Ain't. No. NOTHIN. What's perplexing about this is that such features can often serve as in-group signals that reinforce shared community, and so it would seem reasonable that she would employ some AAE features (1) to demonstrate she is actually 'down', and (2) to connect with her interviewer.

So the question here is: why no AAE?! I have two theories. The first is that she couldn't hear it. More precisely, she could hear something, but didn't know exactly what it was that made people speak differently, and if she tried to imitate it, found out she did so poorly. I did my undergrad in Canada, but while I knew there was something going on with Torontonians' accents, I couldn't imitate it successfully until I learned about Canadian Raising. I knew words: sore-E for 'sorry,' and when to use "eh", but saying "ooot" for "out" would have just oooted me as faking the accent. It may be that, having grown up in the Midwest and only encountering speakers of AAE after she went to Howard, at 18 years old, she just couldn't successfully acquire the phonology, and knew that if she tried it would sound like caricature. A friend of mine from Georgia can hear the New Yorkers around her saying /ɔ / in coffee and boss, but can't reliably produce it, and doesn't know which words to put it in. Others in a similar position try, but say "kwafie" and think they're succeeding. Maybe Doležal can hear AAE phonology, but isn't sure when to use it, and knows it sounds 'off' when she does.

However, this can't be the full story. She could have taken classes on African American English at Howard, and in an immersion environment for four years, she could have gotten good at it.

The other aspect to this situation is that she did the equivalent of a cost-benefits analysis and decided she didn't need to fake AAE to pass. One of my academic interests is Game Theory, and specifically Bayesian Signaling Games, which are surprisingly applicable here. In the simplest of this class of games, you have two possible types, and at least two possible signals. Agents try to infer the type of other agents by the signals those agents send. In pooling equilibria, there's no way of telling: agents of both types send the same signal and you can't glean anything from it. In separating equilibria, it's the opposite: you can completely categorize agents by the type of signal they send. Of course, the interesting class of games are those with semi-separating equilibria: this opens up the possibility of dishonesty.

In the Game Theory literature, there's also a distinction between cheap talk and costly signaling. The basic idea is that a signal that costs the sender something is potentially more trustworthy. If it costs me nothing to tell you something, it costs me nothing to lie, and if our interests are not aligned, you should be wary. Conversely, if our interests aren't aligned and I send a costly signal, that signal might tell you something useful, otherwise, why would I incur the cost of sending it?

If you were very white and chose to lie and pose as black, you would have to make decisions about what gives you the most bang for your buck, so to speak. It may be that Doležal made the evaluation that trying to use AAE features in her speech was already past the point of diminishing returns. In order to be immediately raced as black, you have to do something about appearance, although appearance alone isn't enough (the literature on passing and assimilation is relevant here). If you're going to pull off the deception, and if you're being rational about it, you want to do the minimum necessary to lie effectively, assuming going out of your way to lie incurs cost. She's trying to get the most value out of tricking people into believing she's black (being paid to teach classes, paid to be interviewed about black womens' experiences, presiding over a chapter of the NAACP, she's definitely getting value, even if we limit ourselves to financial terms only). She does so by doing the least: change her hair, think carefully about wardrobe, spray-tan but not too much. Monitoring your speech all day every day to imitate a dialect you did not acquire in the critical period? That's WAY harder.

In this respect, Doležal is reminiscent of various animals that successfully invade ecological niches, like bird species who replaces other birds' eggs with their own. She found the one niche where black women might have a slight advantage, and she did the minimum necessary to successfully signal that she was a black woman. And it worked for a decade. What's weird is that there are white professors of African American Studies, and white presidents of NAACP chapters, so it's not clear how much more advantage she got from posing as black, and she obviously incurred a cost (the enormous cost of being raced as black in America), although it may be that her values were such that she got some perverse benefit out of experiencing that cost (in a discussion of Bayesian Games in his excellent textbook on Game Theory, Steven Tadelis refers to a type that derives perverse pleasure from what should be a losing strategy as the "crazy" type.).

It seems that the obvious cost of being black in America was strong enough that, in combination with the minimum necessary changes to plausibly looks some kind of black, everyone just went with it, since it is a priori bat-shit crazy to pose as a disenfranchised type rather than posing as the type with the significantly higher expected value (in everything from educational outcomes, to interview call-backs and job prospects, to heart disease and life expectancy, to interactions with the police. EDIT: Here's a clue to the expected value). So, given that it seems crazy to pose as black, everyone just went with it -- for a decade, people made the Bayesian calculation "what is the likelihood she's not black and just faking it given (1) her appearance and overall bearing, and (2) the relative costs and benefits of being black versus being white in America?"...and came to the rational, but wrong, conclusion that she was being truthful.

If the above is right, it has an upsetting implication for the trans-racial camp. She claims she feels black, and that she's really black, whether her ancestry is or is not. If she has such an affinity for blackness, why then do the bare minimum to pass? I'm not black, but I'm a white ally with positive feelings toward a number of black cultures, and I use AAE not just because I am natively able to speak it, but because I like it and I respect it. What is most disturbing about what's coming to light about Doležal is that she seems to have a love-hate relationship with the idea of blackness that tends surprisingly toward hate, and tends toward caricature where it's love. She sued Howard University for being pro-black at her (white) expense (and lost), and then did the minimum to take on the mantle of blackness to benefit in precisely the ways she claimed actual black people were benefiting at her expense, and she did so in a places where there are the fewest actual black people around to compare against or to call her bluff. And, now, Black Twitter has called her bluff precisely in this arena, with #AskRachel, giving multiple choice questions 'any' black person should know the answer to, where the answers are [SPOILERS] things like "they smell like outside," or the word for "remote control" is C) Moken Troll.

I'm not sure what more to say about this other than: I can'eem deal right now.

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

SoCal is Getting Fleeked Out

February 23, 2015 by Taylor Jones

For anyone who's been living under a rock for the past few months, there is a term, "on fleek," that has been around since at least 2003, but which caught like wildfire on social media after June 21, 2014, when Vine user Peaches Monroe made a video declaring her eyebrows "on fleek."

Since then, the apparently non-compositional phrase on fleek has been wildly popular, and has generated the usual discussion: both declarations that it is literally the worst and "should die," and heated debates about what exactly on fleek even means. People seem to be divided on the question of whether it's synonymous with "on point." There is also a great deal of disagreement as to what can and cannot be on fleek, with "eyebrows" now the prototype against which things are measured.

After a conversation with Mia Matthias, a linguistics student at NYU, I decided to look at other syntactic constructions, thinking it possible -- in principle -- to generalize from on fleek to other constructions. Lo and behold, there is a minority of negative-minded people who describe others, snarkily, as "off fleek," (haters). More interestingly, Southern California is getting fleeked out.

Geocoded tweets using variations of fleek. Toronto, you're not fooling anyone. — Geocoded tweets using variations of *fleek*. Toronto, you're not fooling anyone.

This is interesting because it suggests that "on fleek" is being re-interpreted, and that it is not necessarily rigidly fixed for all speakers as an idiom. Moreover, it looks like LA is leading the first move away from strictly adhering to the idiom "on fleek," by extending the use of "fleek" to the stereotypically Californian construction of [x]-ed out.

Geocoded tweets using "fleek" in California. Las Vegas, you're not fooling anyone.

I'm looking forward to watching this develop, just as we can watch bae developing (one can now be baeless, for instance). I'm also looking forward to the day one can get a fleek over, or get one's fleek on.

-----

©Taylor Jones 2015

Have a question or comment? Share your thoughts below!

"Eem" Negation in AAVE

February 19, 2015 by Taylor Jones

I recently found out that another paper of mine was accepted for a talk, at the Penn Linguistics Conference (PLC39).

This time, I'll be discussing a phenomenon in some registers of African American Vernacular English that I recently noticed, and have dubbed "eem negation." As with much of my research interest, this is a phenomenon that is not used by everyone, but which suggests a possible syntactic change which may or may not catch on. The basic idea is that Jespersen's Cycle is progressing for some speakers of AAVE, such that for some people, "eem" is available as a negative marker. If the last sentence sounded like gibberish to you, don't worry: the rest of this post will unpack it.

Jespersen's Cycle: two negatives makes one positive

Jespersen's Cycle is a phenomenon named for this handsome fellow:

First thing first: Jespersen's Cycle is about Negation. It's very important to start by noting that so-called double negation (e.g., "I don't owe nobody nothin'!") is not inherently wrong or bad, as some style guides, high school English teachers, or annoying relatives who correct your speech obnoxiously at family functions might have you believe. In English, multiple negation is stigmatized, but there is nothing inherent to the grammar of English that makes it bad, as evidenced in part by how many varieties of English use it. That is, what makes it 'bad' is that it is socially evaluated, not anything about the structure of the language. When people say "two negatives make a positive" they are both demonstrably wrong and also weirdly trying to force other people's speech to conform to half-baked mathematical assumptions. Two negatives make a positive when multiplying numbers, but there's no reason language shouldn't be, say additive instead of multiplicative (e.g., -1 + -1 = -2). Moreover, 'operations on real numbers' is a really bizarre way to think of language.

Instead, many dialects of English, and many other languages have what's called negative concord. That is, elements of a sentence should agree in negation -- meaning if one thing is negative, everything is. English has this ("It don't never mean nothing to no one nohow."), but so do French, Spanish, Italian, Russian, Chinese, and many, many other languages. That is, two negatives make one positive -- if by that you mean two negatives make a person certain you really meant it.

The Cycle that Jespersen observed was about how some languages change how they encode negatives over time. There are a few stages: in the first, you have one negative word. People, for whatever reason, choose to intensify the force of their sentence with another word (I didn't walk a step, I didn't drink a drop, etc.). Later, the intensifier is learned by children learning the language, and interpreted as obligatory. So then you have two words doing the same thing -- encoding negation. Then, the original word may become reanalyzed as optional - the word that was the intensifier becomes the word that 'means' negation. Finally, the original negation may disappear (and the new word takes its place as the sole marker of negation).

English has gone through this process, so that you get something like:

I ne say >> I ne say not >> I say not >> I do not say

For the moment, we'll ignore the can of worms that is the introduction of "do". Similarly, French has undergone this process. Negation used to be indicated with "ne" and is now indicated with "pas" (which used to mean "step" as in I don't walk a step). Many textbooks and fuddy-duddy teachers will claim that you should say something like "je ne dis pas" to mean "I don't say," however, in modern spoken French, the ne is just...not there.

Note: the "jeo" is not a typo; that's just an older form.

Many, many languages have undergone or are undergoing such a process, including Greek (6 times!), number of varieties of Arabic, French, and English. I'm arguing that in one dialect of English, it may be happening again.

U.O.E.N.O it

Here, I'm assuming a basic level of knowledge about AAVE -- that it exists, that it is a valid dialect, etc. (A quick primer can be found here).

So. There exists a word we'll call eem. If you search for it on Twitter, you'll notice a few things: it's used between 500 and 1,000 times a day. It shows up in the context of negation 98% of the time. It looks a lot like the word even. It can sometimes be spelled een.

My argument is that eem is not the same word as even. This is not a trivial thing to posit, since variations on even are common, and people often tweet how they speak. Moreover, that other 2% is made up almost entirely of people saying "eem = even."

Now, it's important to note that while I used Twitter to compile a lot of data quickly, it is by no means my only source of data -- it's a quick way to get a lot of data when you have the right kind of question. Other sources of data are sociolinguistic interviews, TV, movies, and music (there are 50-100 tokens in the extended cut of the song UOENO it (i.e., you don't eem know it), and Childish Gambino uses it in his song sweatpants, among others.).

Why claim that eem is not even? Well, for one, it only shows up as eem when there's negation, or some sort of counterfactual. That is, you can say:

"I don't eem know" or "he stopped before I eem noticed,"

...but you almost never see:

"eem Jamal was at the party"

...and you don't ever see:

*"2, 4, and 6 are all eem numbers." (the asterisk means 'ungrammatical').

In fact, in all instances I've seen of the second example ("eem Jamal was at the party") I haven't been convinced it was a native speaker of AAVE, and not someone who came across a "eem = even" tweet. That said, it's roughly ~1% of tweets that have that (almost exclusively young white women, for what what that's worth), and I have never come across it in speech.

So, eem is not just a phonological reduction of "even," (like sebm for "seven", etc.), although that's likely where it came from, nor is it just a new orthographic convention on Twitter.

Now for the cool part:

Not only do you get a lot of negation with eem, but you get a number of cool other things. Note, examples below have the original first, a rough gloss below that, and a more colloquial 'translation' below that.

(1) There are people who use eem and also then intensify their sentence with even, as in:

"I ain't eem even feelin' it."
I am NEG NEG even feeling it.
'I don't like it.'

(2) There are people who only negate with eem. I cannot overstate how cool this is. It is trivially easy to find example after example of tweets where the only negation is eem, as in:

"Ya'll some troublemakers, but I eem mad tho."
You PL some troublemakers, but I NEG mad, though.
You all are some troublemakers, but I'm not mad, though.

"I'm da shit, I eem care."
I'm the shit, I NEG care
'I'm great, I don't care (about anything/anyone/etc.)'

"Irony is: in most states, strippers can eem get naked! Dey literally dancin in bathing suits rackin da fck up..."
Irony is: in most states, strippers can NEG get naked...
Irony is: in most states, strippers can not get naked

(3) There are people who use eem as the only marker of negation, and then intensify it with even:

"I'ma act like I eem even read that."
I FUT act like I NEG even read that
I am going to act like I didn't even read that.

"You...eem even know it."
you NEG even know it
you don't even know it.

This all suggests that eem is not the same as 'even' (although it's very likely descendent from 'even'), and that eem is a marker of negation -- in some cases, the only one.

Phonology

There are a number of interesting phonological processes around eem, but the discussion is pretty arcane and thorny, so I'm saving it for my conference talk. The basic gist is that eem can be pronounced in a number of ways, including een, and just a long, nasalized vowel. Because of the patterns in the audio examples I have of it (linguists: /m/ before labials, /n/ or /m/ before coronals, nasalized vowel before vowels, but also /m/, not engma, before velars), I argue that it is underlyingly eem.

The Big Question: Where's this going?

The thing about language change is that we often only know about it in hindsight. Given the tidal wave of data the Internet era has ushered in, we're now able to see trends like this in realtime -- but I don't know of a way to use this to predict the future. 25,000-30,000 tokens of eem per month on Twitter is simultaneously massive -- in fact, so massive it's too much to deal with, since we still need to read each hit to determine syntactic function, presence of other negation, etc. -- and weirdly way too little, given that it's probably much less than 1% of total use of negation in AAVE. Think about how many possible negative sentences could be uttered on any given day by ~40 million people, and 1,000 Tweets a day of eem is piddling. For the sake of comparison, there were more than 183,000 of "not" in the last hour alone. Moreover, the last time this kind of thing happened in English, it took centuries.

What this means, though, is that we are possibly able to track this kind of change in real time, for the first time in history. Either way, if it fizzles out and disappears or if it spreads to completion such that the standard way of negating a sentence becomes eem after a few hundred years, we stand to learn something about language change. I can eem hardly contain my excitement.

-----

©Taylor Jones 2014

Have a question or comment? Share your thoughts below!

The Problem With Twitter Maps

December 25, 2014 by Taylor Jones

Twitter is trending

I'm a huge fan of dialect geography, and a huge fan of Twitter (@languagejones), especially as a means of gathering data about how people are using language. In fact, social media data has informed a significant part of my research, from the fact that "obvs" is legit, to syntactic variation in use of the n-words. In less than a month, I will be presenting a paper at the annual meeting of the American Dialect Society discussing what "Black Twitter" can tell us about regional variation in African American English (AAVE). So yeah, I like me some Twitter. (Of course, I do do other things: I'm currently looking at phonetic and phonological variation in Mandarin and Farsi spoken corpora).

Image of North America, entirely in Tweets, courtesy of Twitter Visual Insights: https://blog.twitter.com/2013/the-geography-of-tweets

Moreover, I'm not alone in my love of Twitter. Recently, computer scientists claim to have found regional "super-dialects" on Twitter, and other researchers have made a splash with their maps of vocatives in the US:

More and more, people are using social media to investigate linguistics. However, there are a number of serious dangers inherent to spatial statistics, which are exacerbated by the use of social media data.

Spatial statistics is developing rapidly as a field, and there are a number of excellent resources on the subject I've been referring to as I dig deeper and deeper into the relationship between language and geography. Any of these books (I'm partial to Geographic Information Analysis) will tell you that people can, and do, fall prey to the ecological fallacy (assuming that some statistical relationship that obtains at one level, say, county level, holds at another level -- say, the individual). Or they ignore the Modifiable Areal Unit Problem -- which arises out of the fact that changing where you draw your boundaries can strongly affect how the data are distributed within those boundaries, even when the change is just in the size of the unit of measurement.

The statistical consideration that most fascinates me, and seems to be the most likely to be overlooked in dealing with exciting social media data, however, is the problem of sampling.

Spatial Statistics aren't the same as Regular Statistics.

In regular statistics, more often than not, you study a sample. You can almost never study an entire population of interest, but it's not generally a problem. Because of the Law of Large Numbers, the bigger the sample, the more likely you are to be able to confidently infer something about the population the sample came from (I'm using the day-to-day meanings of words like "confidence" and "infer"). However, in the crazy, upside down world of spatial statistics, sampling can bias your results.

In order to draw valid conclusions about some kinds of spatial processes, it is necessary to have access to the entire population in question. This is a huge problem: If you want to use Twitter, there are a number of ways of gathering data that do not meet this requirement, and therefore lead to invalid conclusions (to certain questions). For instance, most people use the Twitter API to query Twitter and save tweets. There are a few ways you can do this. In my work on AAVE, I used code in Python to interact with the Twitter API, and asked for tweets containing specific words -- the API returned tweets, in order, from the last week. I therefore downloaded and saved them consecutively. This means, barring questionable behavior from the Twitter API (which is not out of the question -- they are notoriously opaque about just how representative what you get actually is), I can claim to have a corpus that can be interpreted as a population, not a sample. In my case, it's very specific -- for instance: All geo-tagged tweets that use the word "sholl" during the last week of April, 2014. We should be extremely careful about what and how much we generalize from this.

Many other researchers use either the Twitter firehose or gardenhose. The former is a real-time stream of all tweets. Because such a thing is massive, and unmanageagable, and requires special access and a super-computer, others use the gardenhose. However, the gardenhose is a(n ostensibly random) sample of 10% of the firehose. Depending on what precisely you want to study, this can be fine, or it can be a big problem.

Why is sampling such a problem?

Put simply, random noise starts to look like important clusters when you sample spatial data. To illustrate, this, I have created some random data in R.

I first created 1,000 random x and 1,000 random y values, which I combined to make points with random longitudes (x values) and latitudes (y values). For fun, I made them all with values that would fit inside a box around the US (that is, x values from -65 to -118, and y values from 25 to... Canada!). I then made a matrix combining the two values, so I had 1,000 points randomly assigned within a box slightly larger than the US. That noise looked like this:

" Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1. " "Never tell me the odds!"

Before we even continue, it's important to note two things. First, the above is random noise. We know this because I totally made it up. Second, before even doing anything else, it's possible to find patterns in it:

A density contour plot of random noise. Sure looks like something interesting might be happening in the upper left.

Even with completely random noise, some patterns threaten to emerge. What we can do if we want to determine if a pattern like the above is actually random is to compare it to something we know is random. To get technical, it turns out random spatial processes behave a lot like Poisson distributions, so when we take Twitter data, we can determine how far it deviates from random noise by comparing it to a Poisson distribution using a Chi-squared test. For more details on this, I highly recommend the book I mentioned above. I've yet to see anyone do this explicitly (but it may be buried in mathematical appendices or footnotes I overlooked).

This is what happens when we sample 100 points, randomly. That's 10%; the same as the Twitter gardenhose:

And this is what happens when we take a different 100 point random sample:

Another random 100 point sample from the same population.

The patterns are different. These two tell different stories about the same underlying data. Moreover, the patterns that emerge look significantly more pronounced.

To give an clearer, example, here is a random pattern of points actually overlaying the United States I made, after much wailing, gnashing of teeth, and googling of error codes in R. I didn't bother to choose a coordinate projection (relevant XKCD):

And here are four intensity heat maps made from four different random samples drawn from the population of random point data pictured above:

This is bad news. Each of the maps looks like it could tell a convincing story. But contrary to map 3, Fargo, North Dakota is not the random point capital of the world, it's just an artifact of sampling noise. Worse, this is all the result of a completely random sample, before we add any other factors that could potentially bias the data (applied to Twitter: first-order effects like uneven population distribution, uneven adoption of Twitter, biases in the way the Twitter API returns data, etc.; second-order effects like the possibility that people are persuaded to join Twitter by their friends, in person, etc.).

What to do?

The first thing we, as researchers, should all do is think long and hard about what questions we want to answer, and whether we can collect data that can answer those questions. For instance, questions about frequency of use on Twitter, without mention of geography, are totally answerable, and often yield interesting results. Questions about geographic extent, without discussing intensity, are also answerable -- although not necessarily exactly. Then, we need to be honest about how we collect and clean our data. We should also be honest about the limitations of our data. For instance, I would love to compare the use of nuffin and nuttin (for "nothing") by intensity, assigning a value to each county on the East Coast, and create a map like the "dude" map above -- however, since the two are technically separate data sets based on how I collected the data, such a map would be completely statistically invalid, no matter how cool it looked. Moreover, if I used the gardenhose to collect data, and just mapped all tokens of each word, it would not be statistically valid, because of the sampling problem. The only way that a map like the "dude" map that is going around is valid is if it is based on data from the firehose (which it looks like they did use, given that their data set is billions of tweets). Even then, we have to think long and hard about what the data generalizes to: Twitter users are the only people we can actually say anything about with any real degree of certainty from Twitter data alone. This is why my research on AAVE focuses primarily on the geographic extent of use, and why I avoid saying anything definitive about comparisons between terms or popularity of one over another.

Ultimately, as social media research becomes more and more common, we as researchers must be very careful about what we try to answer with our data, and what claims we can and cannot make. Moreover, the general public should be very wary of making any sweeping generalizations or drawing any solid conclusions from such maps. Depending on the research methodology, we may be looking at nothing more than pretty patterns in random noise.

-----

Have a question or comment? Share your thoughts below!

Maya glyphs in stucco at the Museo de Sitio in Palenque.

Language Jones and the Temple of...Burritos?

November 10, 2014 by Taylor Jones

Like most people, I enjoy burritos. Unlike most people, I also enjoy learning about ancient hieroglyphic writing systems, because I’m Indiana—er—Language Jones. A while back, I bought Stone & Zender’s Reading Maya Art: A Heiroglyphic Guide to Ancient Maya Painting and Sculpture, and borrowed or checked out a number of similar books from the library. I skimmed and enjoyed them, and then returned them. Stone & Zender took a place on my bookshelf and I moved on to other things, not anticipating I’d be able to see the Mayan ruins in Mexico any time soon.

Then it happened.

I went to a Chipotle in Philadelphia, looked at the wall, and realized their design was more than just decoration. There, looking back at me, was K’awiil, also known as God K, the “most ubiquitous god in Classic Maya art.” Next to K’awiil was a glyph representing a lord, possibily Juun Ajaw, one of the Hero Twins. All over the wall was seeing bits and pieces of legible, decipherable Classic era Mayan art. Here, the glyph for mountain. There, a shark.

Maya (?) glyphs at the Chipotle on 15th and Walnut, in Philadelphia.

Then I noticed some weird things. Like what I thought might be K’awiil had a bird on his head (because, why not put a bird on it?). A lord seemed to be writing, but with a burrito instead of a reed stylus. My studies had not prepared me for this.

I couldn’t just tell myself “that’s interesting,” and move on. So I did some research, and found that the wood and metal sculptures at many (all?) Chipotle locations were provided by a company named Mayatek Inc. Some of their artwork is surprisingly informed, and the names suggest that the creators are familiar with Maya art. Other works are named strange things that suggest little familiarity with what the art represents. There are some clearly modern embellishments on ancient symbols (like putting a bird on it, but said bird is not a quetzal).

For instance, the above is referred to on the Mayatek website as "dancer." There is a similar sculpture called "warrior dancer." However, anyone with even a passing knowledge of Maya art will recognize this is not a dancer, but a specific god. Lucky for you, reader, I have just such a passing knowledge: This is Chahk, the Maya rain deity. In his right hand is the axe he uses to strike clouds to make it rain. In fact, the above sculpture looks suspiciously like Chahk as represented in the Dresden Codex:

Chahk detail from a Codex style vessel in the Metropolitan Museum of Art. This particular variant is known as  Chak Xib Chahk — Chahk detail from a Codex style vessel in the Metropolitan Museum of Art. This particular variant is known as *Chak Xib Chahk*

Similarly, the below is referred to as "baldy," on the Mayatek website. However, the distinctive mark on the cheek is an indicator that the person depicted is a lady (as in "Lord and ____"). While, as it has been pointed out, ladies can be bald, I would argue the unmarked case is not, and "baldy" is not an appropriate appellation for a lady, bald or not -- suggesting the iconography was copied without full understanding.

In the photograph I took above, from a Chipotle restaurant in Philadelphia, the second glyph in the first column looks like a shark head, the glyph for stone, and a fish. The bottom glyph in the first row looked to me like a lord, although it seemed likely he's Juun Ajaw.

Hilariously (to me) the top right glyph looks like Acan, the god of alcohol, who is associated with swarms of bees, since bees are attracted to xtabentún an alcohol made from fermented honey and tree bark. And by "associated with bees" I mean "he literally vomits swarms of bees." I'm pretty sure that bit is before he ritually decapitates himself ("Chipotle: so good, you'll vomit bees and decapitate yourself!").

There's also a sculpture that is clearly Ixchel, the jaguar god, from the Dresden Codex.

Perhaps my favorite find was this:

....which looks a lot like God A, one of the Maya Death Gods (which, by the way, is an excellent name for a band). This would not make Chipotle the first major American chain restaurant to decorate with death iconography from another culture (that distinction may go to P.F. Chang's, with their terracotta soldiers), but I'm of the opinion "death by burrito" should be about portion size, and not about inadvertently invoking the wrath of an ancient deity.

In order to get more information, I wrote an email to Dr. Marc Zender one of the leading scholars on Maya glyphs and author of The Book on the subject, asking if he could tell me whether the bas relief decoration at this Chipotle was imitating some known work or complete gibberish (email title: "a frivolous question"). To my surprise, he responded, and the answer is that it's a little of both. He told me that the artist for Chipotle intended to copy a well-known collection of stucco glyphs from Palenque's Temple 18.

He explained: "The text was commissioned by the early 8th-century king K'inich Ahkal Mo' Nahb, and had fallen from the rear wall of a temple in antiquity. The stuccos were then recovered piecemeal by several different archaeological projects between the 1920s and 1950s. Primarily because their original order couldn't be determined, but also because most of them couldn't be read at that time, the curators at Palenque's archaeological site museum unfortunately ended up mounting them in (unreversible) cement, placing similar signs next to one another and creating a nonsensical text. "

He went on to explain that " the Chipotle artist has also picked glyphs at random from this collection and has made his best attempt to copy them. It's not a bad effort in some places, but note the 'bird with wings' the artist has created in the bottom rightmost glyph, as well as some missing or invented details in a few other places." So my intuitions that (1) it was partially invented and (2) the artist followed the Portlandia mantra "put a bird on it" both check out! I was paying attention.

Then, Dr. Zender made my day. "Just for the fun of it," He translated the glyph blocks from Chipotle: (left to right, top to bottom):

u-K'AM-ma-K'AJAN?-ch'o-ko
uk'amk'ajan ch'ok
"the youth's rope-taking" (a ceremony)
u-TZ'AK-AJ
utz'akaj
"its count" (calendric information)
WAX-YAX-SIHOOM-ma
"6 Yax" (part of a date)
chu-lu-ku-?
Chuluk ... (pre-accession name of the king)
i-K'A'-yi
i k'a'ayi
"his ... stopped" (a death verb, here referring to the king's father)
TIWOL?-4-ma-ta
Tiwohl Chan Mat (the father of the king)
mu-ka-ja
muhkaj "he was buried" (again referring to the father)
u?-na-ta-la
u naahtal "the first"? (ordinal title?)
MO'-na-bi
... Mo' Nahb (part of the name of the king)

Dr. Zender also explained that the "shrunken head" glyph I thought might be God A is actually a complex Early Classic spelling of the name of the serpent deity Chak Bay Kaan (CHAK-ba-ya-ka-KAAN). He went on "We're still not sure what bay means, but the other portions of the name are 'Red ... Serpent'."

So there you have it folks. Death verbs. "He was buried." Enigmatic dates. Mysterious serpents. Next time you're at Chipotle, forget the secret menu and instead focus on what one of my colleagues at U Penn enthusiastically referred to as a "disjoined, incoherent stream of historical tidbits." (Said colleague continued, "in that sense, it's not that different from the history of the non-European world that most people get anyway.")

Now if I could only figure out why a restaurant with a Nahuatl (=Aztec) name has Maya glyphs everywhere...

LSA talk preview: Semantic Bleaching of Taboo Words, and New Pronouns in AAVE

October 14, 2014 by Taylor Jones

Note: this post was coauthored with Christopher Hall.

TRIGGER WARNING: this post will discuss profanity, obscenity, taboo language, slurs, and racially charged terms.

I recently received word that an abstract Chris and I submitted to the Linguistic Society of America was accepted for a 30 minute talk at the LSA annual meeting in January of 2015. While exciting, this is also somewhat terrifying, because our research involves not just syntax, but taboo words, dialect divergence, and America's ugly racial history (and present). Outside of academia, there's an enormous amount of potential for misunderstanding, offense, hostility, and other ill feelings. Even among academics there's the potential for hurt feelings.

In brief, our research takes both recent work in syntax and recent work in sociolinguistics, and couples it with good, old-fashioned field-work and new computation methods (read: tens of thousands of tweets). However, the subject matter involves the emergence of a new class of pronouns in one (sub-)dialect of English from words that are considered offensive or taboo in other varieties of English. As such, it's potentially quite charged.

Before describing the research, it is absolutely crucial to note that:

we work as descriptive linguists: this means we observe a real-world phenomenon and describe it.
We neither condone nor disapprove of the data. Our job is simply to describe and analyze natural language as it is used in the world.
Both authors are native speakers of the variety of English in question.

So what's the big deal? Well, we argue that there is an emerging class of words that function as pronouns (remember elementary school English class? A pronoun is a word that stands in for another noun or noun-phrase) in some varieties of African American Vernacular English (AAVE), that are built out of the grammatical reanalysis of phrases including the n- word. Well, sort of the n- word because there's excellent evidence that there are actually at least two n-words, and that some speakers of AAVE differentiate between them and use them in different contexts.

WARNING: from here out, we will be discussing the use of words some deem extremely offensive. Seriously, just stop here if such discussion will offend you despite the above points. We will be using the actual words, not variants like b-h and n-. You've been warned!

Some preliminaries:

Pronunciation

One of the most potent slurs in American English is the racial epithet nigger (we warned you!). However, many white people oblivious to history and privilege don't hesitate to muse, "why can they [read: "black" people] use it, then?" Their observation - that some black Americans use what sounds like the same word - is valid, although insisting that makes the use of slurs OK is not valid.

AAVE is (generally) what can be called r-less and l-less. That is, in some contexts, especially at the end of words or syllables and when not followed by a vowel, words that may have an r or l are pronounced as though they do not. The stereotypical Boston accent is r-less: "pahk the car in Hahvahd yahd." (Note: "car" comes before a vowel, and therefore the r is pronounced!).

So when some speakers of AAVE use the word nigga, it is understandably interpreted as an r-less variant of a word that underlyingly has an r. However, the supposed r never shows up, not even intervocalically (jargon for "between vowels").

When people maintain that they're two different words, there seems to be good evidence for that. Note to white people: This does not give you license to use either. If you do not speak AAVE, and chances are you don't, you don't get to use either word. You WILL offend people, and no one will like you.

Semantic Bleaching

This is a term that has existed in linguistics for a long time, which we did not invent, so there is actually no pun intended. It means that a word, over time, loses shades of meaning. For our purposes, there is excellent research on "obscenity" in AAVE, the main argument being that many things that are considered obscene in other dialects have been semantically bleached. Spears (1998), for instance, argues that nigga, shit, bitch, and ass have been semantically bleached. In fact, Collins and Postal have shown that there is a particular grammatical construction that relies on the semantic bleaching of ass: the Ass Camouflage Construction (ACC), as in:

how ya no-phone-havin'-ass gonna call me?

Not content to just rely on the previous literature, we collected data from our stomping grounds: Harlem and the South Bronx, as well as West Philadelphia (mostly, this required little more than going outside and paying attention, although we did take notes on time, place, and type of use). We also used the Twython library for Python to extract and stored 10,000 tweets using the word nigga. While this is a huge sample by regular regular sociolinguistic norms (where 500 data points is impressive), it's worth keeping in mind that it's about 1/60th of what is tweeted in an average afternoon.

tweets containing nigga from August 19 - September 18, 2014. 16 MILLION tokens. — tweets containing *nigga* from August 19 - September 18, 2014. 16 MILLION tokens.

In none of the 10,000 we read was the word used as an epithet or slur (although there were some cheeky white people trying to test boundaries).

In fact, we argue that in this dialect, it is now human and male by default, but not always (an example of the not always: "I adopted a cat and I love that nigga like a person"). It is also not inherently specified for race, like nigger and other epithets are. In fact, race is often added to it, so the authors may be referred to in our neighborhoods as "that white nigga" and "the black nigga who was with him." Others include "asian nigga," and even "African nigga."

Among those who use the term, it is now a generic term like guy.

This shift in meaning seems to have happened some time after 1972-ish, possibly in conjunction with the rise of the Black Power movement, as an attempt to reclaim the word, similar to some feminists reclaiming bitch, and cunt. It was a necessary prerequisite for the super cool grammatical change our paper is actually about.

Grammatical Change: Pronouns or ...Imposters?!

The real point of our paper is about grammatical change. There exists a class of phrases first described by Collins and Postal, called Imposters. These are phrases that grammatically behave as though they are third person (reminder: he, she, it), but actually have first person (I, we) meaning. Great examples are:

Daddy is going to buy you an ice cream!
This reporter has found himself behind enemy lines.
The authors have already used 3 imposters in this very article.

Where the meanings are:

I am going to buy you an ice cream!
I have found myself behind enemy lines.
We have already used 3 imposters in this very article.

The key here is that the noun phrases behave in the syntax of the sentence as though they are 3rd person, but the actual meaning is first person -- we just decode it.

What we do is argue that there are new pronouns in AAVE, but first we have to argue that they're not just imposters. This is not trivial! For instance, Zilles (2005) argues that Brazilian Portuguese is developing a new first person pronoun, a gente ("ah zhen-tshy"), but Taylor (2009) argues that no such thing is happening, and it's just a popular imposter.

The Paper

We argue that a nigga is becoming a pronoun, meaning "I". The corresponding plural is niggas or niggaz. We also argue that there are two second person vocatives (that is, "terms of address") which are used depending on social deference one wants to show: nigga, and my nigga.

Yes. You read that correctly: we are claiming that saying my nigga signals politeness (...among speakers of this and only this dialect!!! Don't go saying Jones & Hall gave you the green light to say "my nigga" to your black friends!!!).

What's the evidence for pronoun status?

a nigga and my nigga are phonologically reduced. That is, there is a clear difference in pronunciation between the pronoun forms and the terms meaning "a person" and "my friend." To this end, we tend to use anigga and manigga, pronounced /ənɪgə/ and /mənɪgə/ (we leave the original spacing when quoting tweets, though).
No other words can intervene while still retaining the first person meaning. "A friendly nigga said hello" does not mean "I said hello," whereas "anigga said hello" can. The first means that some friendly guy said hello, but it wasn't the speaker.
anigga binds anaphors. No, that's not some kind of Greek fetish; Anaphors are words like "myself" "himself," "herself," etc. Binding in this case refers to which anaphors show up with the word. anigga patterns with the first person words, whereas imposters do not. For almost everyone "daddy is going to buy myself an ice cream" is either ungrammatical or sounds like daddy got lost in the middle of his sentence. anigga, on the other hand, is often used with myself, as in "anigga proud of myself."
Other pronouns refer back to anigga. That is, "you read all a nigga's tweets but you still don't know me."
Verbs are conjugated first person, not third person, with anigga. This is totally ungrammatical with imposters, and totally normal for actual pronouns. Example:
"Finna make myself dinner. a nigga haven't eaten all day." Compare that to "Daddy haven't eaten all day; he's going to make myself dinner." Really, really, abysmally bad.
anigga can be used in certain conditions that imposters - like "a brotha" - cannot. For instance, you can say "anigga arrived," with first person meaning, but the only interpretation available for "a brotha arrived" is third person. It's for this reason that we cannot simply substitute the much-less-likely-to-offend "a brotha" in our discussion of these terms.

That's basically it. In every conceivable grammatical test, anigga patterns with actual pronouns and not with imposters.

We then attempt to pinpoint the origin of it, and find that it must have happened some time between 1970 (The Last Poets) and 1992 (Wu-Tang). In 1993, it's already being used in puns in rap music, as in Wu-Tang Clan's "Shame on a nigga (that tries to run game on a nigga)", where the meaning is "shame on a guy who tries to run game on me." The first unambiguously pronoun appearance we can find in print is from a 1995 interview with ODB ("old dirty bastard") of the Wu-Tang Clan, followed shortly by use in a magazine interview with Slick Rick. This is over 100 years after the first records we can find of the use of anigga as an imposter -- all of which are from exceedingly racist old books from the 1880s.

With regards to the terms of address nigga and manigga, the difference seems to be social deference. When in a position of greater authority, nigga is the term of address used toward another person (As in the first minute of this video of possibly the best cooking show for chefs on a budget, and an excellent example for Spears, 1998). When showing deference, manigga is used. This is why there's a clear difference in meaning between "nigga, please," and "manigga, please." The first is dismissive, the second is pleading.

Non-linguists, feel free to skip this technical paragraph. Currently, we're in the process of tallying use in Urban Fiction as a way of getting at the frequency of use. It's exceptionally difficult to get a large enough sample of material to be able to tally use of these new pronouns compared to other pronouns. If you try and compare to the frequency of "I" on Twitter, for instance, you're then comparing against all varieties of English, not just AAVE. If you use some other word as a proxy for AAVE use (hypothetically, tweets that contain the word nigga), you then have a number of other confounds, like potential bias in your data set, or in the case of using nigga possible lexical priming effects. If you try and do sociolinguistic interviews, you get observer effects that bias the data. Fiction is a good way to get at what the author of a given novel perceives as natural, which we can then compare against other authors and other datasets (eg, Twitter). The goal right now is simply to get a baseline for comparison so we can begin to home in on a plausible range we can later refine.

Concluding thoughts

It's unlikely that this pronoun will ever replace or even truly rival the usual English pronouns, however speakers of this variety of AAVE now have a new way of expressing themselves at their disposal. For the moment, the authors have the dubious distinction of potentially being the world's leading experts on the n- words. So we've got that going for us, which is nice.

Big Data and Black Twitter

September 28, 2014 by Taylor Jones

This post is a story of how combining century old linguistic methods with new sources of data can reveal unexpected insights. It's a small preview of my upcoming talk at the annual meeting of the American Dialect Society, where I will discuss my recent research using social media to map previously undescribed dialect regions in African American Vernacular English (AAVE). It's the intersection of historical linguistics, dialect geography, spatial statistics, and #swag.

Prelude: Maps are Cool

I recently took a class with Bill Labov on Dialect Geography: an under-appreciated subfield of linguistics that had a bit of a heyday in the late 1800s, and which is now starting to make a come back, thanks in no small part to popular dialect surveys like this one from the New York Times.

In the class, we learned methods of mapping and interpreting spatial data to glean information about regional variation in language use, and to begin to understand language variation and change. We learned how maps like this were made:

l'Atlas linguistique de la France published from 1902 to 1920 by J. Gilliéron and E. Edmont – added colors from the version in Lectures de l'Atlas linguistique de la France de Gilliéron et Edmont. Du temps dans l'espace, pu… — l'*Atlas linguistique de la France* published from 1902 to 1920 by J. Gilliéron and E. Edmont – added colors from the version in *Lectures de l'Atlas linguistique de la France de Gilliéron et Edmont. Du temps dans l'espace*, published in G. Brun-Trigaud, Y. Le Berre & J. Le Dû (2005)

...but we also learned how to map data using newer, more sophisticated computational methods. For instance, reading geographic data from a comma separated file and mapping the data in the R programming language. More importantly, we learned what the interaction between geographic features, historical migrations, and a 'snapshot' of linguistic data can tell us about our language and ourselves.

Now, in the late 1800s, there were basically two ways that you could collect data for linguistic atlases: informally known as the German Method, and the French Method. The German Method was the method Georg Wenker used in 1876, when he sent out 50,000 surveys to German schoolmasters who dutifully sent back 45,000 completed surveys. The flaw in this method is that there is no guarantee of standardization as far as how the data is collected and interpreted. The French Method is what Jules Gilliéron used a decade later: send one trained linguist galavanting around the countryside on a bicycle for four years, eating baguettes, drinking wine, and conducting sociolinguistic interviews with everyone he can as he moves from town to town. My kind of job! Both methods resulted in gorgeous, detailed, and informative atlases...decades after the data were collected. More recently, enterprising linguists (among them, Dr. Labov) conducted telephone surveys, resulting in the "gold standard" Atlas of North American English. The ANAE gives an enormous amount of granularity to the study of regional dialects in North America -- seriously, click the link and play around, it's awesome.

Big Data

What the ANAE achieves, it does with a mere 792 speakers, intelligently sampled by region. It is a feat of ingenuity and economy.

However, we now have some intriguing new tools at our disposal, thanks to the internet and social media platforms like Facebook and Twitter. To give you an idea: a search for the word "the," -- a pretty good proxy for English use -- returns 607 million tokens in the last month alone. All of it is literally published work. It is, in effect, an enormous corpus of written language. Given the right tools and know-how, anyone can search that published material.

The Speech Problem: Graffiti and The Writing on the (Facebook) Wall

The only hitch is this: writing is not speech. In fact, if you try to figure out how English speakers anywhere pronounce English based on the spelling conventions of academic written English, you're gonna have a bad time. A few sound shifts here, a few hundred years of weird convention there, and you've got a system that doesn't tell you much of anything useful.

Notice, though, I said the spelling conventions of academic English. Many people have a pet peeve they're more than willing to share (especially on reddit, it seems): they hate when others write should of in lieu of should have. This kind of mistake is any historical linguist's favorite thing ever. Why? Because it tells us something about pronunciation. People who write should of have reduced should have to should've and it is coming out in their writing -- should of and should've are totally indistinguishable in casual speech.

It's precisely this kind of error, along with the writings of hand-wringing pedants lamenting the decline of language (among other things), that allow us to reconstruct the pronunciation of Latin as it changed through time. (aside: ever wonder why it's "inconvenient" but not "inpolite"? A historical linguist can tell you why, and when it happened). In fact, we get an enormous amount of phonologically relevant information from things like graffiti dick jokes in places like Pompeii Who says historical linguistics isn't fun?

Error isn't the whole picture though. It's one thing to say that people who struggle spelling will fall back on sounding things out. It's quite another when the non-standard spelling is intentional. For instance, one task for computational linguists interested in Natural Language Processing (NLP) is to group various spellings into sets that computers can recognize are all the same word. To simplify: a computer needs to know that color and colour are the same thing if it's going to process language quickly and effectively. Recent research in NLP has demonstrated that people on social medial platforms intentionally write how they speak. That is, they go out of their way to spell things in a non-standard way in order to better communicate how they talk informally. The best part is that this research holds across languages. While an American might be sittin (instead of sitting), a Dutch user of Twitter may well sitte (instead of zitten). This is especially true the further a dialect diverges from the written standard, as in modern dialects of Arabic. It's also true in AAVE, where the orthography you learn in school can't capture the phonological and grammatical nuances of the dialect -- something that writers like Zora Neal Hurston, Toni Morrison, and Ralph Ellison grappled with.

Black Twitter: Stigmatized Speech, Innovative Writing

Around the time I was taking the class on dialect geography, I stumbled upon a Youtube video purporting to explain #Blackfolkslang. It's a fun example of what linguists call enregisterment: when a dialect feature gets (consciously) noticed and becomes an overt marker of linguistic belonging. A classic example is the stereotypical Brooklynese fugeddaboudit.

Being a native speaker of AAVE (due to childhood speech community), the forms made intuitive sense to me and were a lot of fun. When I showed them to non-speakers of the dialect out of context, however, they were baffled. "What is ioneem? Is that Arabic?"

I thought it would be fun to dig into their use, and see where these forms were used, and how often. I got help writing a script in Python, using the Twitter API and the Twython package to extract tweets, and started using the mapping tools I was learning in R to check them out.

It became an obsession.

A few months and a few hundred thousands tweets later, I came to a few realizations. First, there's not consensus. Some people tweet nun (for "nothing"), while others tweet nuttin, and others still tweet nuffin. Second, the forms used vary regionally. Third, the phonological clues these tweets provide can be corroborated by both other media and linguistic informants (informant: a fancy term for people who both speak whatever a linguist is interested in and are willing to talk to one). Lastly, there's not just one "Black Twitter." The Black Twitter that blogs, contributes to NPR, and live-tweets sociology conferences was not the Black Twitter I was reading. I was reading tweets from young adults not represented in the Pew Research Center Internet Project, from young gang members who signal affiliation with spelling (fun fact: crips superstitiously avoid the combo "ck" because it could stand for "crip killa," and will instead favor spellings like "fucc"), and from people who use Twitter as a free analog to both texting plans and dating sites.

Some of the writing was not immediately recognizable. For instance, I was perplexed by yeen for "you ain't" (in part because it's not used in NYC or Philadelphia, I would later find). That is, I was perplexed right until I searched for it on YouTube, and came across dozens of different songs, often self-produced, which use yeen in the lyrics. Similarly, nun could conceivably be pronounced in a number of different ways. French Montana to the rescue! People often tweet lyrics to their favorite songs, and quite a number of them tweeted "nigga i ain't worried bout nun". Whether there is a glottal stop or it's elided for some of these tweeters is not clear, but what is clear is that it is two syllables, not one -- the only way to fit the rhythm.

Ultimately, I gathered data on ~30 terms (among them: yeen, talmbout, eem, ion, sumn -- you ain't, talking about, even, I don't, and something, respectively), and found that all of the variation could be explained by recourse to a handful of variations in pronunciation -- variations which can be corroborated by other means.

The Discovery: The Maps Don't Line Up

A handful of computationally minded linguists and linguistically minded computer scientists have been doing work on dialect geography using Twitter data, and I've found their work invaluable in developing this research. One of them, Gabriel Doyle (at UCSD), has demonstrated that dialect forms on Twitter correspond exceptionally well to the established gold standard of the ANAE. Like, uncannily, eerily well. He concluded, after some sophisticated statistical verification, that it's possible to glean geographic information about dialects from Twitter data.

His maps of double modals ("might could") and of the "needs washed" construction ("your car needs washed") line up perfectly with the maps produced by the ANAE and by the Harvard Dialect Survey (HDS).

My maps, however, did not line up.

Now, it has been known for a long time that including data from speakers of AAVE muddies things. In some ways, AAVE speakers do what other people in their general vicinity are doing, but in other ways they seem to do things differently. There's a large body of literature on this, but no national level description of regional variation in AAVE.

The standard maps of dialect regions in North America look like this:

Image from The ANAE, via the Texas English Project website: www.texasenglish.org

Notice the main feature is horizontal bands across the country, spreading from the East Coast. In some maps, the North, Midland, and South extend across the West, which is not given its own region. These regions follow patterns of westward expansion and settlement. In fact, maps of differences in building materials used in making cabins line up nicely with maps of dialect regions.

The thing is, AAVE does not share the same history as other North American dialects. Obviously, it is meaningless to discuss patterns of "settlement" when referring to black Americans, and while there is no consensus on the mechanics of how AAVE developed, it is understood to be largely an ethnolect, the product of a culture that developed in the last few hundred years shaped by (and despite) slavery, systemic racism, and extreme segregation.

In theory, then, the geographic distribution of AAVE should look different, and it should look roughly like the geographic distribution of Black Americans:

Image courtesy the Rural Assistance Center, from US Census 2010 data.

In some instances, this is what we see. For instance, when mapping AAVE-specific grammatical features like stressed been (which I discuss further here), the pattern lines up nicely with the population data:

initial exploratory plot of stressed been on Twitter

Note that tweets are concentrated in the South and the Northeast, and the areas with the highest black populations have the most tokens. Atlanta stands out particularly, but so do Oakland and LA, Chicago and Detroit. This pattern appears with other terms we'd expect to be non-regional, including nigga, tryna, and finna.

Similarly, enregistered lexical items (that is, local words famous for being local words) show up where we would expect them:

Philly's famously local word, "jawn," mapped on Twitter. Some of the unexpected points, on closer investigation, are people originally from Philadelphia. The two in Florida are someone referring to a friend named Jawn.

DC's newly famous "Jont." for more information, see Frances Sellers' fantastic article in the Washington Post, Is There a D.C. Dialect? It's a Topic Locals are Pretty Cised to Discuss, which discusses research by Georgetown's Minnie Annan. — DC's newly famous "Jont." for more information, see Frances Sellers' fantastic article in the Washington Post, *Is There a D.C. Dialect? It's a Topic Locals are Pretty Cised to Discuss,* which discusses research by Georgetown's Minnie Annan.

We see other, unexpected things, however. Things like this distribution of sholl for "sure":

Once we map everything, we get a broad pattern, with some words (tryna, finna) completely non-regional, some (talmbout, sholl) in a band up the middle of the country, some (ioneem, nuffin) solely along the East Coast I-95 corridor, and some (yeen) only in the South:

When I compared the maps of broad patterns of use in AAVE on Twitter to a map of the Second Great Migration from the Schomberg Center for Research in Black Culture, the pattern, and its most likely historical cause, revealed themselves.

What we see on Twitter is exactly what a historical linguist would expect from migration and dispersal, followed in some regions by innovation. The South Central to Midwest corridor and the South share terms like the quotative talmbout and phonological features like the replacement of /r/ with /l/ in the word sure (e.g., "sholl is."). The South, and the South alone, has so called /ey/-raising, consistent with Southern American English, making you ain't into yeen, (whereas other parts of the country simplify it to yain). New York says "nuttin" and "suttin," but D.C. has nuffin, and Philly is split right down the middle by these competing forces:

"My bruva neva syced bout nuffin" - You a bama.

The above is just a small taste; I will be presenting quite a few more maps, and discussing the phonological data in much greater detail in my talk at the American Dialect Society annual meeting, this January. I'm also preparing a paper for publication. The key finding is that the pattern looks like what any historical linguist would expect after migration and innovation.

Why is this a big deal?

I'm extremely excited about this line of research, for a number of reasons:

Not many linguists have been riding the big data wave. Instead, computer scientists with no training in linguistics are compiling huge data sets of, well, language, and they're doing their best to analyze it, and similarly linguists with no training in computer science are often ignoring the new tools at our disposal. In some instances, computer scientists are beating us to interesting discoveries. In others, they're getting flawed research past peer review because they're so unfamiliar with established concepts in linguistics that they think they've discovered "super dialects" when in reality they've stumbled upon register. We should all be collaborating, instead of reinventing the wheel.
This is the first attempt at defining dialect regions in AAVE on a national level, providing a baseline of research - a starting point for other researchers (and me, of course) to refine. For instance, there is a significant body of research that suggests "th-fronting" (that is, pronouncing words with a th like they have f/v, as in nuffin) is universal. While it may be possible to find everywhere (especially now that a Philly rapper has a hit song with the word "mouf" in the title!), it does not appear that way in these data. Moreover, in NYC, it's often interpreted as a marker that the speaker is not from here. Conversely, an informant I interviewed who had recently moved to Waldorf, MD, told me how he had to insist that his children do not say "nuffin," because he didn't want them "sounding like their peers in school," going on to say "everyone around here talks like that." In this way, participation or non-participation in these phonological patterns may be performatively indexing (non)local identity.
This research relies on a new method of gathering data that can be complementary to traditional methods, and can help point toward new hypotheses, and new areas of research. For instance, some of the data suggest a syntactic change in progress (distinct from the one I'm presenting on at LSA 2015, in fact).

Ultimately, I'm excited because social media are new sources of data for linguists to take advantage of, and they're sources that are extremely rich and extremely large. Whereas Georg Wenker needed decades to send out 50,000 surveys and process the results, given the right question, we're on the cusp of being able to gather more data than that in just the better part of an afternoon.

I'm also excited because this research puts black folk back on the map, literally. It's time for a large scale, systematic description of regional patterns in AAVE like what we already have for other North American dialects, and this is a step toward it!