Today I was tagged into a conversation on Twitter by New York Times best-selling author Morgan Jerkins, who had been watching an episode of The Cleveland Show and happened to have the closed captions on. She wrote:

So last night I was watching "The Cleveland Show" and I want someone to explain the last few captions to me. pic.twitter.com/j5r84Xn5zG
— Morgan Jerkins (@MorganJerkins) December 13, 2020

I was expecting closed captions that were, at least, an attempt at accurately captioning what was said. Perhaps the transcriptionist misheard or misunderstood, but transcribing in good faith.

Instead, there was this:

The caption reads “dam-fa-foo-dun-may-hebeyad-shoot.”

That was followed by:

The caption reads “Naw-a-gah-may-mah-beyad, dayum.”

This raises an important question: what is the function of closed captions? Ostensibly, it’s so the viewers know what was said.

In this case, what was said was:

In IPA (this will be relevant later) that’s:

[dæ̃ fæ fuw dʌ̃ me͡ɪ hɪ beːʲɪʔ ʃːuʔ næ͡w ɑ͡ɪ gɑː me͡ɪk̚ mɑː beːʲɪd̥ deʲɪ̃ʰ]

The transcript should read “damn fat fool done made his bed? Shoot. Now I gotta make my bed. Damn.”

There are a couple of things happening here.

First, the character is a black character being voiced by a white voice actor, who does not, evidently, have early life contact with AAE speech communities necessary to speak it natively. He’s very good, but he also is not perfect, and it’s clear that while he’s nailed some of the harder parts of some black accents, he’s also missed some important nuance, overgeneralized some parts of the accent, and applied the wrong accent to the wrong place. He’s noticed that word final consonants are often deleted, unreleased, realized as glottal stops, or deleted altogether, but he has overgeneralized, and left no word final consonants in places where they should appear. In fact, it was his second word — [fæ] instead of [fæʔ] — that made me look up who was voicing the character — I’ve never heard an AAE speaker who would say fah for fat. He noticed that word final nasals (n, m, ng) are often pronounced as nasalization on the vowel, like in French, and not as a following segment. He also noticed that the vowel in bed is often split, so it sounds like the vowels in “play hid”. However, the show takes place in Stoolbend, Virginia, and this accent feature is not as common in Virginia AAE. It’s common in parts of the Carolinas, and from the Gulf to the Great Lakes, along the Mississippi, but not most of the mid-Atlantic or Northeast. He also overdoes the consonant deletion — this level of syllable coda deletion is only really plausible in Georgia. He also over does it with “gotta.” That kind of reduction does happen, but not exactly in that context: the word is too slow and too carefully pronounced, so it comes across as caricature.

Caricature brings me to the second: This is a white actor voicing a black character on a comedy show, where part of the humor is evidently making fun of how he speaks. It should not be controversial for me to plainly state that it looks a lot like minstrelsy. I’m not entirely clear on how Rallo Tubbs is significantly different from Amos ‘n Andy, or from Thomas D. Rice. Evidently, after the killing of George Floyd, even the voice actor realized it was probably a bad look, and he publicly announced he would not be voicing black characters anymore. Why George Floyd, but not Mike Brown or Emmett Till, changed his mind remains a mystery. He made it clear he doesn’t want to take work from Black voice actors, but I’m not sure if the broader context is clear to him, given that statement and, you know, the decade or so of him doing this work. As I mentioned in my replies on Twitter, it’s uncomfortably evocative, to me anyway, of Jim Crow in Dumbo. The crows are clearly a vaudeville/minstrel act, and clearly intended to be speaking AAE (“I-uh be done seen most ev’rything/when I seen an elephant fly!”). They’re also voiced by white actors in the 1950s, and the line between imitation as flattery and caricature as mockery is razor thin there (and they’re on the wrong side of that line anyway). We can say that they clearly have contact with AAE speakers, and that there’s clearly a certain level of respect, but at the end of the day they’re taking a job that a Black man simply could not have at that time, to play at the culture, music, and language for laughs. It’s no longer the case that a Black voice actor could never get the job (just look at the cast of the Cleveland Show), but there’s still a direct line from Al Jolson, through Amos ‘n’ Andy, through the crows in Dumbo, right up to the Cleveland Show.

Third, and most importantly for this discussion, there’s the captions on top of all of that. If you’re reading the captions to know what was said, you still don’t know what was said! What you get is that the character said something unintelligible. The way I look at it, there’s two plausible possibilities, neither of which is good: first, the transcriptionist couldn’t make sense of the utterance and did the best they could, assuming it was some kind of gibberish. Or Jive (note to self: write post about Airplane). Second, the transcriptionist thought it was more important to show that the character wasn’t speaking “right” than to actually, you know, transcribe what was said. That would explain why “done” was written as “dun”. They’re pronounced the same, but any time a writer chooses to write something like “eye dun tole yew” instead of “I done told you”, they’re not telling us much about how a character sounds, but they’re telling us a great deal about how we’re supposed to perceive that character. Is it really possible that the transcriptionist who had flawlessly transcribed up to that point could no longer tell from context that the character was talking about making his bed? That he said “now” — a recognizable word of English — and not “naw”?

This is a really interesting case to me, because it is in some ways very subtle. What has to be behind this choice, any way you slice it, is a certain often unstated linguistic ideology. Most of us were taught explicitly in school that writing takes precedence over speech, and that for both writing and speech, there is one correct way to do things, which coincidentally overlaps with how well educated, wealthy White people (but not White Ethnics!) speak. This manifests itself in all aspects of our society, from arguments about pronunciation, to whether something “is (really) a word.” Built into that ideology is that there is some reason why one way of doing things is better — clarity, logic, authority — and it’s never the truth: that the prestige variety exists based on social norms, not linguistic facts. Lastly, this ideology positions ways of speaking that are not “classroom” English as inferior (and lacking in clarity, logic, and authority). This captioning only makes sense if we recognize that the transcriptionist, the service running these captions unquestioningly (in this case, Hulu), and likely most of people involved in the show’s production either view AAE as unintelligible, as something that can function as the butt of a joke, or both. It’s a subtle form of anti-blackness that’s not necessarily predicated on overt or deep hostility. It’s casual.

That’s not to say that all nonstandard spellings are inherently racist, or offensive, or what have you. I’ve even written chapters on how people intentionally represent how they speak with novel spellings (as in “dis tew much”). But in this particular case, there’s no valid reason I can think of why turning on the captions on Hulu should result in “dam-fa-foo-dun”. And this is, weirdly, something you only really see with AAE, and some socially stigmatized varieties of English spoken by (generally poor) white people, like Appalachian English.

As a thought experiment, can you imagine what would happen if Downton Abbey were captioned this way?

“noaw mayde? noaw nah-nee? noaw valette eevun?”

“ihts nayntiyn twuntee sevun, wi-uh mahdun foake”

(If that wasn’t transparent to you, It was the first lines in the Downton Abbey movie trailer).

This was a very interesting counterpoint for me this week, as I’ve been reviewing transcripts of a deposition and I was blown away by the accuracy and professionalism of the court reporter. While mistranscription of AAE (and mock AAE!) is a systemic problem, it’s not a universal one.

I don’t know where people stand on this issue, but I know where I do. While I see mistranscriptions of AAE everywhere, from Netflix to Turner Classic Movies, this is different, in that it’s apparently intentional. We can do better than this.

-----

©Taylor Jones 2020

Have a question or comment? Share your thoughts below!

Twitter is trending

I'm a huge fan of dialect geography, and a huge fan of Twitter (@languagejones), especially as a means of gathering data about how people are using language. In fact, social media data has informed a significant part of my research, from the fact that "obvs" is legit, to syntactic variation in use of the n-words. In less than a month, I will be presenting a paper at the annual meeting of the American Dialect Society discussing what "Black Twitter" can tell us about regional variation in African American English (AAVE). So yeah, I like me some Twitter. (Of course, I do do other things: I'm currently looking at phonetic and phonological variation in Mandarin and Farsi spoken corpora).

Image of North America, entirely in Tweets, courtesy of Twitter Visual Insights: https://blog.twitter.com/2013/the-geography-of-tweets

Moreover, I'm not alone in my love of Twitter. Recently, computer scientists claim to have found regional "super-dialects" on Twitter, and other researchers have made a splash with their maps of vocatives in the US:

More and more, people are using social media to investigate linguistics. However, there are a number of serious dangers inherent to spatial statistics, which are exacerbated by the use of social media data.

Spatial statistics is developing rapidly as a field, and there are a number of excellent resources on the subject I've been referring to as I dig deeper and deeper into the relationship between language and geography. Any of these books (I'm partial to Geographic Information Analysis) will tell you that people can, and do, fall prey to the ecological fallacy (assuming that some statistical relationship that obtains at one level, say, county level, holds at another level -- say, the individual). Or they ignore the Modifiable Areal Unit Problem -- which arises out of the fact that changing where you draw your boundaries can strongly affect how the data are distributed within those boundaries, even when the change is just in the size of the unit of measurement.

The statistical consideration that most fascinates me, and seems to be the most likely to be overlooked in dealing with exciting social media data, however, is the problem of sampling.

Spatial Statistics aren't the same as Regular Statistics.

In regular statistics, more often than not, you study a sample. You can almost never study an entire population of interest, but it's not generally a problem. Because of the Law of Large Numbers, the bigger the sample, the more likely you are to be able to confidently infer something about the population the sample came from (I'm using the day-to-day meanings of words like "confidence" and "infer"). However, in the crazy, upside down world of spatial statistics, sampling can bias your results.

In order to draw valid conclusions about some kinds of spatial processes, it is necessary to have access to the entire population in question. This is a huge problem: If you want to use Twitter, there are a number of ways of gathering data that do not meet this requirement, and therefore lead to invalid conclusions (to certain questions). For instance, most people use the Twitter API to query Twitter and save tweets. There are a few ways you can do this. In my work on AAVE, I used code in Python to interact with the Twitter API, and asked for tweets containing specific words -- the API returned tweets, in order, from the last week. I therefore downloaded and saved them consecutively. This means, barring questionable behavior from the Twitter API (which is not out of the question -- they are notoriously opaque about just how representative what you get actually is), I can claim to have a corpus that can be interpreted as a population, not a sample. In my case, it's very specific -- for instance: All geo-tagged tweets that use the word "sholl" during the last week of April, 2014. We should be extremely careful about what and how much we generalize from this.

Many other researchers use either the Twitter firehose or gardenhose. The former is a real-time stream of all tweets. Because such a thing is massive, and unmanageagable, and requires special access and a super-computer, others use the gardenhose. However, the gardenhose is a(n ostensibly random) sample of 10% of the firehose. Depending on what precisely you want to study, this can be fine, or it can be a big problem.

Why is sampling such a problem?

Put simply, random noise starts to look like important clusters when you sample spatial data. To illustrate, this, I have created some random data in R.

I first created 1,000 random x and 1,000 random y values, which I combined to make points with random longitudes (x values) and latitudes (y values). For fun, I made them all with values that would fit inside a box around the US (that is, x values from -65 to -118, and y values from 25 to... Canada!). I then made a matrix combining the two values, so I had 1,000 points randomly assigned within a box slightly larger than the US. That noise looked like this:

" Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1. " "Never tell me the odds!"

Before we even continue, it's important to note two things. First, the above is random noise. We know this because I totally made it up. Second, before even doing anything else, it's possible to find patterns in it:

A density contour plot of random noise. Sure looks like something interesting might be happening in the upper left.

Even with completely random noise, some patterns threaten to emerge. What we can do if we want to determine if a pattern like the above is actually random is to compare it to something we know is random. To get technical, it turns out random spatial processes behave a lot like Poisson distributions, so when we take Twitter data, we can determine how far it deviates from random noise by comparing it to a Poisson distribution using a Chi-squared test. For more details on this, I highly recommend the book I mentioned above. I've yet to see anyone do this explicitly (but it may be buried in mathematical appendices or footnotes I overlooked).

This is what happens when we sample 100 points, randomly. That's 10%; the same as the Twitter gardenhose:

And this is what happens when we take a different 100 point random sample:

Another random 100 point sample from the same population.

The patterns are different. These two tell different stories about the same underlying data. Moreover, the patterns that emerge look significantly more pronounced.

To give an clearer, example, here is a random pattern of points actually overlaying the United States I made, after much wailing, gnashing of teeth, and googling of error codes in R. I didn't bother to choose a coordinate projection (relevant XKCD):

And here are four intensity heat maps made from four different random samples drawn from the population of random point data pictured above:

This is bad news. Each of the maps looks like it could tell a convincing story. But contrary to map 3, Fargo, North Dakota is not the random point capital of the world, it's just an artifact of sampling noise. Worse, this is all the result of a completely random sample, before we add any other factors that could potentially bias the data (applied to Twitter: first-order effects like uneven population distribution, uneven adoption of Twitter, biases in the way the Twitter API returns data, etc.; second-order effects like the possibility that people are persuaded to join Twitter by their friends, in person, etc.).

What to do?

The first thing we, as researchers, should all do is think long and hard about what questions we want to answer, and whether we can collect data that can answer those questions. For instance, questions about frequency of use on Twitter, without mention of geography, are totally answerable, and often yield interesting results. Questions about geographic extent, without discussing intensity, are also answerable -- although not necessarily exactly. Then, we need to be honest about how we collect and clean our data. We should also be honest about the limitations of our data. For instance, I would love to compare the use of nuffin and nuttin (for "nothing") by intensity, assigning a value to each county on the East Coast, and create a map like the "dude" map above -- however, since the two are technically separate data sets based on how I collected the data, such a map would be completely statistically invalid, no matter how cool it looked. Moreover, if I used the gardenhose to collect data, and just mapped all tokens of each word, it would not be statistically valid, because of the sampling problem. The only way that a map like the "dude" map that is going around is valid is if it is based on data from the firehose (which it looks like they did use, given that their data set is billions of tweets). Even then, we have to think long and hard about what the data generalizes to: Twitter users are the only people we can actually say anything about with any real degree of certainty from Twitter data alone. This is why my research on AAVE focuses primarily on the geographic extent of use, and why I avoid saying anything definitive about comparisons between terms or popularity of one over another.

Ultimately, as social media research becomes more and more common, we as researchers must be very careful about what we try to answer with our data, and what claims we can and cannot make. Moreover, the general public should be very wary of making any sweeping generalizations or drawing any solid conclusions from such maps. Depending on the research methodology, we may be looking at nothing more than pretty patterns in random noise.