For my dissertation I am investigating regional variation in African American English. The key baseline for comparison is the Atlas of North American English (from here on: ANAE) by Labov, Ash, and Boberg.

The original analysis in the ANAE was done by taking individual (point) observations and dividing them into classes — for example, looking at fronting of the tongue in /uw/ as in goose , they might divide the (normalized) observed frequencies in Hertz into five classes, plot the points, and color them one of five colors based on those divisions. Then, they’d draw a line around apparent clusters (the procedure elaborated in the Atlas is more complicated, but that’s the gist of it). This is not directly comparable with my data, since I have a different number of points in different locations, from a different ethnic group that is spread out with different population centers.

So the first step to make my data and the ANAE — the “gold standard” — comparable, is to do some statistical transformations that (1) take the researcher decisions out of it a bit more, and (2) interpolate values to create something like a heatmap.

After a few months of thinking and coding, I’ve finally got a procedure that does this. Here, I’ll compare the ANAE maps and my maps of the same data (the TelSur, or “telephone survey” data). I’m looking at the second formant (that is, “F2”) of /uw/ before non-coronals — meaning how far front or back the tongue is in the mouth in words like goose, school, fool, pool, etc., but not words like to, dew, toot, sue, etc. The best I can exaggerate it in writing is to say that it’s the difference between “cool” and “kyewl” but if you’re from the southeast, that may not be a meaningful distinction to you.

FIrst, here’s the ANAE data (mapped using the viridis color palette, so colorblind people can read it too). Warmer colors are higher values in Hertz, corresponding (roughly) to the high point of the tongue being further forward in the mouth:

And this is the ANAE analysis of fronting of /uw/, based on their division of point values into categories and visual inspection and demarcation of clusters. Note that the point locations are very inaccurate in the ANAE, it was a specific choice, for readability.

What I did next was to interpolate, assigning missing values based on their 10 nearest neighbors. This won’t change the outcome of subsequent steps, but makes it easier to run the same algorithm to process all the different measurements. So here’s the same data, but with missing values filled in based on their nearest neighbors:

Then, what I did was use Getis-Ord Gi* to smooth the data and highlight hot (and cold) spots. This gives a much clearer (but more abstract) picture of regional patterns. I used the 25 nearest neighbors in this map (I tested 5, 10, 25, and 50, and 25 gives the clearest picture without over smoothing).

Finally, I smoothed the map using Kriging (a method originally used in mining, and commonly used in weather maps — it is the best linear unbiased interpolation of intermediate values) and then assigned kriged values to counties.

This means that caution should be exercised, and the reader should not overinterpret the map. However, it gives a very readable, high level picture of regional variation.

What’s great about this is that it looks very similar to the findings in the ANAE, but it takes researcher decision making a little further out of it, and it’s now theoretically comparable to a map from a dataset that has a different number of point observations from different locations. So as I finish cleaning and analyzing my African American English accent survey, I now have a way of comparing regional variation in AAE with the dialect regions in white varieties of North American English.

-----

Have a question or comment? Share your thoughts below!