Data

Let’s first load the main data frame, which contains information about the headwords, the speaker, the vowel environment, and the acoustic features of nasality obtained from the Nasality Automeasure Praat script created by Will Styler:

data <- readRDS("Arabana_data.Rda")

head(data)

##                               filename item speaker headword    gloss V1_env
## 1 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 2 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 3 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 4 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 5 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
## 6 anthunha_pidla_my_name_Maltya_awarda    1  laurie anthunha my, mine    #_N
##   timepoint point_vwlpct point_time freq_f1 freq_f2 freq_f3 amp_f1 amp_f2
## 1         1            0   0.239728     861    1584    2619  39.98  38.71
## 2         2           10   0.251218     861    1584    2619  39.98  38.71
## 3         3           20   0.262708     560    1563    1982  35.70  40.18
## 4         4           30   0.274198     861    1561    2113  36.99  39.08
## 5         5           40   0.285688     581    1472    2057  36.89  38.04
## 6         6           50   0.297178     861    1617    1620  38.41  43.59
##   amp_f3 width_f1 width_f2 width_f3 amp_p0   a1p0 a1p0_compensated p0prominence
## 1   9.96      223       58      465  40.45  -0.47          1.57176         2.72
## 2   9.96      223       58      465  40.45  -0.47          1.57176         2.72
## 3  21.13      468      141     1403  43.47  -7.77         -5.72529         4.08
## 4  27.06      462      208     1791  44.66  -7.67         -5.62947         4.54
## 5  24.00      367      254      769  46.90 -10.01         -7.95671         6.32
## 6  43.59      640       84     2726  47.77  -9.36         -7.32675         7.88
##   amp_p1  a1p1 a1p1_compensated      a3p0 h1h2
## 1  39.98  0.00          4.93580 -30.48983 2.71
## 2  39.98  0.00          4.93580 -30.48983 2.71
## 3  34.73  0.97          3.43353 -22.34065 4.09
## 4  37.55 -0.56          2.15151 -17.60132 4.55
## 5  39.53 -2.64         -0.00981 -22.89949 6.32
## 6  38.41  0.00          2.53667  -4.18175 7.89

For each item, there are 11 separate measurements (i.e. 11 equidistant time points of measure throughout the vowel interval). To obtain a count of the total number of utterances produced by each speaker, we will need to divide the total number of observations for each speaker by 11. Let’s first look at the headwords produced by Laurie, along with their respective English glosses and total counts:

library(dplyr)

data %>% 
  filter(speaker=="laurie") %>% 
  group_by(headword, gloss) %>% 
  summarize(n=n()/11) %>%
  print(n=Inf)

## # A tibble: 63 × 3
## # Groups:   headword [63]
##    headword     gloss              count
##    <chr>        <chr>              <dbl>
##  1 akuru        over there             2
##  2 ampurdu      central, middle        1
##  3 ananthara    1pl.Nom                2
##  4 anpa         2sg.Abs                3
##  5 antha        1sg.Abs               18
##  6 anthunha     my, mine              22
##  7 anti         soon                   1
##  8 antu         2sg.Erg                2
##  9 apira        white gum              7
## 10 apityi       father                 1
## 11 apurlu       apple                  2
## 12 arabana      language name          3
## 13 athu         1sg.Erg               17
## 14 awarda       that                   9
## 15 kadlhu       liver                  1
## 16 kadlhumpa    lungs                  1
## 17 kadliti      a personal name        1
## 18 kadnha       stone                  6
## 19 kalka        evening                1
## 20 kalta        lizard sp              2
## 21 kangi        too much               2
## 22 kanta-lhuku  pour-Purp              1
## 23 kantha       broom                  1
## 24 kantimalyu   gum, resin             1
## 25 kantunya     wallaby sp             1
## 26 kantyara     grass                  9
## 27 karla        creek                  2
## 28 madla        dog                    9
## 29 madlanti     bad                    1
## 30 madli        cold                   4
## 31 maka         fire                   3
## 32 malaru       however                1
## 33 maltya       not                    1
## 34 manga-rnda   be.ashamed-Present     3
## 35 mangu        forehead               1
## 36 mani         to get                 4
## 37 mankarra     girl                   3
## 38 manta-li     tie-Habitual           1
## 39 manthara     wattle sp              2
## 40 manungka-rda remember-Pres          2
## 41 marka        to crawl               1
## 42 marna        mouth                  3
## 43 pangki       ribs                   1
## 44 panki-rda    happy-Pres             1
## 45 panpa-rda    taste-Present          1
## 46 panti-li     fight-Habitual         2
## 47 pantya       knee                   2
## 48 parnda       big                   15
## 49 thadla       frightened             1
## 50 thanthi      grandchild             4
## 51 wabma        carpet snake           5
## 52 wadna        yamstick               3
## 53 wadni        corroboree             1
## 54 wakarla      crow                   1
## 55 wangka       word                   1
## 56 wanka        to rise                1
## 57 wanpa        to distrust            1
## 58 wanta        to fly                 1
## 59 wanti        corkwood               2
## 60 wathili      proper, real           2
## 61 yalka        plant sp               5
## 62 yangkathara  thristy                2
## 63 yanhi        to speak               1

And now we’ll do the same for Sydney, who has much more data available:

data %>% 
  filter(speaker=="syd") %>% 
  group_by(headword, gloss) %>% 
  summarize(count=n()/11) %>%
  print(n=Inf)

## `summarise()` has grouped output by 'headword'. You can override using the
## `.groups` argument.

## # A tibble: 67 × 3
## # Groups:   headword [67]
##    headword          gloss                     count
##    <chr>             <chr>                     <dbl>
##  1 alantha           1dl.Nom.SM                   66
##  2 alhakiya          1dl.Nom.DM                   65
##  3 amanyi            grandmother                  31
##  4 anari             this way                     66
##  5 angka             alive                         6
##  6 anhaku            Dont know                    77
##  7 anthunha          mine                         40
##  8 arla-thi-rnda     appear-Pres                   5
##  9 kadlara           cloud                         7
## 10 kadnanundi        scorpion                      7
## 11 kadnha            stone                        65
## 12 kadnha-pardi      grasshopper sp                7
## 13 kadnhaardi        small stones                  6
## 14 kala-rrantja      lizard sp                     5
## 15 kalka             choose                        6
## 16 kalta             lizard sp                     5
## 17 kangi             persistent                   38
## 18 kantyara          grass                        10
## 19 karla             creek                         6
## 20 karlatharra       bush turkey                   7
## 21 kata              louse                        65
## 22 katharungka       white cockatoo               74
## 23 madla             bad, dog                     68
## 24 madla-yari        Sturts desert pea             6
## 25 madlanti          bad                           6
## 26 maka              fire                          7
## 27 malka-malka       stripey                       6
## 28 malya             gypsum                        7
## 29 mama-rnda         grab-Pres                    74
## 30 manarni           perhaps, maybe, meanwhile    33
## 31 mankarra          girl                          6
## 32 manuputu          ignorant                     43
## 33 mapaa-rnda        sit.together-Present          6
## 34 marna             mouth                         7
## 35 marna-kathu-kathu whispery                      6
## 36 mathapurda        old man                       7
## 37 ngadlha           cheek                        60
## 38 paka-rnda         dig-Pres                      8
## 39 palkali           snatch                       12
## 40 palkalya          leaves                        5
## 41 palyama-rnda      carry.on.shoulder-Pres        7
## 42 pangkara          reed                          4
## 43 pinya             vengeance party               2
## 44 thadlarra-ma-rnda fear-Caus-Pres                7
## 45 thalka            bilby                         6
## 46 wabma             carpet snake                  8
## 47 wabmarri          plain                        72
## 48 wadlampu          hungry                        7
## 49 wadna             yamstick                     72
## 50 wadna-rnda        take.off-Present              6
## 51 wadnakani         carpet snake                  6
## 52 wakarla           crow                          4
## 53 wakarra           back.of.neck                  5
## 54 walpathi-rda      hunt-Present                  7
## 55 wamparla          possum                        7
## 56 wangka parda-rnda answer-Pres                   5
## 57 wangka-matha-nta  speak-together-Reflexive      5
## 58 wangka-tyimpa-rda converse-Pres                 6
## 59 wanpa-rda         carry-Pres                    6
## 60 wanta-rda         fly-Present                   7
## 61 wapa-rnda         seek-Pres                    83
## 62 yakarra           teeth                         6
## 63 yalka             plant sp                      4
## 64 yampa             stranger                      5
## 65 yangka            hollow                        6
## 66 yangkatharra      thirsty                      12
## 67 yantakarra        west                          6

Additional acoustic measures

In addition to the 18 acoustic features of nasality from the Nasality Automeasure script, we add the first four spectral moments, a measure of nasal murmur based on spectral amplitude ratio, and 14 MFCCs. We are not able to make the original audio files publicly available due to privacy concerns, but we have provided here the code used to create the measurements, and we provide the derived data frame in the supplementary materials data repository:

# names of MFCCs 1-14
mfccs <- paste0("mfcc",1:14)

# prepare data frame
alldata <- c()

# loop through all files
for (file in unique(data$filename)) {
  
  # get the file data
  fdat <- data[data$filename==file,]
  
  # read the corresponding audio file
  audio <- tuneR::readWave(paste0('audio/',file,'.wav'))
  
  # total duration (in seconds)
  totaldur <- length(audio@left)/audio@samp.rate
  
  # extract audio and de-mean to remove DC offset
  snd <- audio@left - mean(audio@left)
  
  # audio pre-emphasis
  for (n in 2:length(snd)) {
    snd[n] <- snd[n] - 0.97*snd[n-1]
  }
  
  # replace the wave object audio samples
  audio@left <- snd
  
  # calculate MFCCs
  melcs <- tuneR::melfcc(audio, 
                         sr = audio@samp.rate, 
                         numcep = length(mfccs), 
                         wintime = 0.01, 
                         hoptime = 0.005)
  
  # get the actual time step (may be slightly different than "hoptime")
  timestep <- totaldur/nrow(melcs)
  
  # get the MFCCs samples nearest to the time points
  mfsamps <- round(fdat$point_time/timestep)
  
  # add the MFCCs to the file data
  fdat[,mfccs] <- melcs[mfsamps,]
  
  # create spectrogram
  spec <- signal::specgram(x = audio@left,
                           n = 1024,
                           Fs = audio@samp.rate,
                           window = 256,
                           overlap = 128
  )
  
  # get spectra
  P <- abs(spec$S)
  
  # convert to dB
  P <- 20*log10(P)
  
  # get the spectral time step
  timestep <- diff(spec$t[1:2])
  
  # get the spectral samples nearest to the time points
  specsamps <- round(fdat$point_time/timestep)
  
  # get first four spectral moments
  moments <- c()
  for (samp in 1:length(specsamps)) {
    moments <- rbind(moments, emuR::moments(P[,samp]))
  }
  colnames(moments) <- c("COG", "variance", "skew", "kurtosis")
  
  # add the moments to the file data
  fdat[,colnames(moments)] <- moments
  
  # nasal murmur (low/high ratio, 0-320 Hz : 320-5360 Hz) bands
  thresh1 <- which.min(abs(spec$f-320))
  thresh2 <- which.min(abs(spec$f-5360))
  
  # get the spectral amplitude means within the two frequency bands
  low   <- colMeans(P[1:thresh1,specsamps])
  high  <- colMeans(P[thresh1:thresh2,specsamps])
  
  # calculate the murmur ratio and add to the file data
  fdat$murmur <- low/high
  
  # add the file data to the combined data frame
  alldata <- rbind.data.frame(alldata,fdat)
}

saveRDS(alldata,"alldata.Rda")

And now we load the derived data frame, in order to continue with a replicable pipeline:

alldata <- readRDS("alldata.Rda")

Let’s take a quick look at the data to make sure everything looks correct:

summary(alldata)

##    filename              item        speaker            headword        
##  Length:17743       Min.   :   1   Length:17743       Length:17743      
##  Class :character   1st Qu.: 404   Class :character   Class :character  
##  Mode  :character   Median : 807   Mode  :character   Mode  :character  
##                     Mean   : 807                                        
##                     3rd Qu.:1210                                        
##                     Max.   :1613                                        
##                                                                         
##     gloss              V1_env            timepoint   point_vwlpct
##  Length:17743       Length:17743       Min.   : 1   Min.   :  0  
##  Class :character   Class :character   1st Qu.: 3   1st Qu.: 20  
##  Mode  :character   Mode  :character   Median : 6   Median : 50  
##                                        Mean   : 6   Mean   : 50  
##                                        3rd Qu.: 9   3rd Qu.: 80  
##                                        Max.   :11   Max.   :100  
##                                                                  
##    point_time        freq_f1          freq_f2        freq_f3    
##  Min.   : 0.040   Min.   :   0.0   Min.   : 432   Min.   :1222  
##  1st Qu.: 1.581   1st Qu.: 422.0   1st Qu.:1275   1st Qu.:2275  
##  Median : 3.606   Median : 562.0   Median :1440   Median :2407  
##  Mean   : 3.759   Mean   : 553.9   Mean   :1452   Mean   :2449  
##  3rd Qu.: 5.562   3rd Qu.: 656.0   3rd Qu.:1576   3rd Qu.:2553  
##  Max.   :16.537   Max.   :2320.0   Max.   :3830   Max.   :5231  
##                                                                 
##      amp_f1           amp_f2           amp_f3          width_f1     
##  Min.   :-48.66   Min.   :-48.17   Min.   :-51.10   Min.   :   3.0  
##  1st Qu.: 31.27   1st Qu.: 21.92   1st Qu.:  9.42   1st Qu.: 137.0  
##  Median : 36.90   Median : 28.67   Median : 16.40   Median : 249.0  
##  Mean   : 36.22   Mean   : 27.84   Mean   : 16.24   Mean   : 342.3  
##  3rd Qu.: 42.60   3rd Qu.: 35.01   3rd Qu.: 23.37   3rd Qu.: 438.0  
##  Max.   : 62.00   Max.   : 56.13   Max.   : 48.44   Max.   :6051.0  
##                                                                     
##     width_f2       width_f3          amp_p0            a1p0        
##  Min.   :   8   Min.   :   9.0   Min.   :-42.47   Min.   :-44.940  
##  1st Qu.: 118   1st Qu.: 208.0   1st Qu.: 36.57   1st Qu.: -7.700  
##  Median : 196   Median : 360.5   Median : 39.96   Median : -2.280  
##  Mean   : 303   Mean   : 496.3   Mean   : 39.68   Mean   : -3.467  
##  3rd Qu.: 343   3rd Qu.: 617.0   3rd Qu.: 43.47   3rd Qu.:  0.500  
##  Max.   :7932   Max.   :8519.0   Max.   : 59.97   Max.   : 46.730  
##                 NA's   :31                                         
##  a1p0_compensated    p0prominence         amp_p1            a1p1       
##  Min.   :-42.9355   Min.   :-19.570   Min.   :-52.93   Min.   :-38.09  
##  1st Qu.: -5.6284   1st Qu.:  3.605   1st Qu.: 15.65   1st Qu.:  7.54  
##  Median : -0.1619   Median :  7.630   Median : 22.09   Median : 13.66  
##  Mean   : -1.3417   Mean   :  8.106   Mean   : 22.38   Mean   : 13.84  
##  3rd Qu.:  2.8873   3rd Qu.: 11.980   3rd Qu.: 29.34   3rd Qu.: 19.70  
##  Max.   : 48.8675   Max.   : 43.290   Max.   : 55.58   Max.   : 53.83  
##                                                                        
##  a1p1_compensated      a3p0             h1h2            mfcc1        
##  Min.   :-35.55   Min.   :-61.97   Min.   :-44.15   Min.   :-110.10  
##  1st Qu.: 10.70   1st Qu.:-29.32   1st Qu.:  1.80   1st Qu.:  50.87  
##  Median : 16.41   Median :-23.41   Median :  6.71   Median :  56.58  
##  Mean   : 16.62   Mean   :-23.45   Mean   :  6.51   Mean   :  55.49  
##  3rd Qu.: 22.06   3rd Qu.:-17.38   3rd Qu.: 11.71   3rd Qu.:  62.20  
##  Max.   : 64.72   Max.   : 29.44   Max.   : 38.76   Max.   :  88.93  
##                                                                      
##      mfcc2             mfcc3             mfcc4              mfcc5         
##  Min.   :-13.390   Min.   :-20.677   Min.   :-24.3642   Min.   :-21.0224  
##  1st Qu.:  1.158   1st Qu.:  3.128   1st Qu.: -6.1410   1st Qu.: -7.8112  
##  Median :  3.598   Median :  6.624   Median : -2.5414   Median : -4.4083  
##  Mean   :  4.981   Mean   :  6.239   Mean   : -2.5221   Mean   : -3.8751  
##  3rd Qu.:  6.785   3rd Qu.: 10.099   3rd Qu.:  0.8234   3rd Qu.: -0.5944  
##  Max.   : 30.822   Max.   : 26.312   Max.   : 21.1995   Max.   : 27.9880  
##                                                                           
##      mfcc6              mfcc7              mfcc8             mfcc9        
##  Min.   :-27.9145   Min.   :-24.7785   Min.   :-38.379   Min.   :-33.792  
##  1st Qu.: -0.0594   1st Qu.: -3.8065   1st Qu.: -6.955   1st Qu.: -7.237  
##  Median :  6.2691   Median :  0.5021   Median : -2.263   Median : -1.572  
##  Mean   :  5.1518   Mean   :  0.5367   Mean   : -2.708   Mean   : -2.108  
##  3rd Qu.: 11.3075   3rd Qu.:  4.6862   3rd Qu.:  2.131   3rd Qu.:  3.274  
##  Max.   : 31.4070   Max.   : 31.3032   Max.   : 23.930   Max.   : 23.213  
##                                                                           
##      mfcc10            mfcc11            mfcc12             mfcc13       
##  Min.   :-19.478   Min.   :-26.742   Min.   :-29.3881   Min.   :-16.658  
##  1st Qu.: -1.336   1st Qu.: -5.843   1st Qu.: -9.1303   1st Qu.:  5.252  
##  Median :  2.700   Median : -1.673   Median : -4.0143   Median :  9.637  
##  Mean   :  2.851   Mean   : -1.838   Mean   : -4.1285   Mean   :  9.521  
##  3rd Qu.:  6.915   3rd Qu.:  2.326   3rd Qu.:  0.6071   3rd Qu.: 13.939  
##  Max.   : 29.241   Max.   : 20.220   Max.   : 26.8391   Max.   : 36.758  
##                                                                          
##      mfcc14             COG           variance          skew         
##  Min.   :-31.819   Min.   :220.0   Min.   :10517   Min.   :-1.06139  
##  1st Qu.: -3.233   1st Qu.:246.7   1st Qu.:24389   1st Qu.:-0.09699  
##  Median :  1.941   Median :255.0   Median :25565   Median :-0.02016  
##  Mean   :  1.668   Mean   :258.6   Mean   :25209   Mean   :-0.04533  
##  3rd Qu.:  6.845   3rd Qu.:262.7   3rd Qu.:26639   3rd Qu.: 0.06739  
##  Max.   : 26.120   Max.   :381.5   Max.   :32094   Max.   : 0.37271  
##                                                                      
##     kurtosis          murmur      
##  Min.   :-1.587   Min.   :-1.271  
##  1st Qu.:-1.355   1st Qu.: 1.375  
##  Median :-1.312   Median : 1.480  
##  Mean   :-1.264   Mean   : 1.485  
##  3rd Qu.:-1.254   3rd Qu.: 1.600  
##  Max.   : 1.435   Max.   : 2.786  
##

We can see that there are 31 NA values for the width_f3 measurement; these come from time points where Praat was not able to reliably estimate the F3 bandwidth in the process of running the Nasality Automeasure script. There aren’t that many of these occurrences, and they all come from Sydney’s data:

table(alldata$speaker[is.na(alldata$width_f3)])

## 
## syd 
##  31

…which accounts for only 0.2% of this speaker’s total data set:

table(alldata$speaker[is.na(alldata$width_f3)]) / nrow(alldata[alldata$speaker=="syd",])

## 
##         syd 
## 0.002010115

…so we’ll simply set these values to be equal to the overall mean for Sydney’s F3 bandwidth measure:

alldata$width_f3[is.na(alldata$width_f3)] <- mean(alldata$width_f3[alldata$speaker=="syd"], na.rm=T)

Delta features

In order to capture abrupt temporal changes in the acoustic measurements, we will also calculate the delta values for all of these features (i.e. sample-wise differences). We’ll first make a list of the feature names, and then use this to calculate the delta features:

features <- c('freq_f1','freq_f2','freq_f3',
              'amp_f1','amp_f2','amp_f3',
              'width_f1','width_f2','width_f3',
              'amp_p0','a1p0','a1p0_compensated','p0prominence',
              'amp_p1','a1p1','a1p1_compensated',
              'a3p0','h1h2',paste0('mfcc',1:14),
              'COG','variance','skew','kurtosis','murmur')

for (item in unique(alldata$item)) {
  it.dat <- alldata[alldata$item==item,]
  
  for (feature in features) {
    alldata[[paste0(feature,".d")]][alldata$item==item] <- c(0, diff(it.dat[[feature]]) )
  }
}

We’ll add these to the list of feature names, and we can see that we now have 74 features to work with:

features <- c(features, paste0(features,".d") )

length(features)

## [1] 74

Data re-sampling

Re-sampling of the data was crucial for creating balanced data sets across the six vowel environments, while also retaining as much of the data as we are able to retain. There are a number of processes from this point forward that will involve a random component, so we’ll use a random seed for replicability:

myseed <- 123
set.seed(myseed)

Re-sampling function

The re-sampling function is taken from the multivariate resampling procedure, which allows for over-sampling of data by creating Euclidean-distance-based weighted averages of feature vectors of randomly selected nearest neighbors in a multidimensional space. We replicate the function here, as accessed on 21 June, 2022, for the sake of transparency:

multivar_resamp <- function (inputdata, groupby, resampnum, features) {
  
  outputdata <- c()
  for (group in unique(inputdata[[groupby]])) {
    subdata <- inputdata[inputdata[[groupby]]==group,]
    
    if (nrow(subdata) > resampnum) { # more samples than the desired number?
      # undersample to the desired number
      newdata <- subdata[sample(nrow(subdata), resampnum, replace=F), ]
      
      # print a success message
      print(paste0("'",groupby,"' level named '", group, "' has been under-sampled from ", nrow(subdata), " to ", resampnum, " observations"), quote=F)
      
    } else {
      if (nrow(subdata) == resampnum) { # same number of samples as the desired number?
        # keep the original data as-is
        newdata <- subdata
        
        # print a success message
        print(paste0("'",groupby,"' level named '", group, "' has been kept at ", nrow(subdata), " observations"), quote=F)
        
      } else { # fewer samples than the desired number?
        # let's oversample!
        oversamp              <- resampnum - nrow(subdata) # number of new observations to use in oversampling
        sub.scaled            <- subdata # original data to scale
        means                 <- apply(as.matrix(sub.scaled[,features]), 2, mean) # get the original feature means
        stdevs                <- apply(as.matrix(sub.scaled[,features]), 2, sd) # get the original feature standard deviations
        sub.scaled[,features] <- scale(subdata[,features]) # scale the original features
        oversamp.data         <- c() # oversampled data to build
        
        for (samp in 1:oversamp) {
          # randomly choose an observation from the scaled feature matrix
          this.samp   <- sub.scaled[sample(nrow(sub.scaled), 1), ]
          
          # select all of the OTHER observations from the scaled feature matrix
          other.samps <- sub.scaled[which(row.names(sub.scaled)!=row.names(this.samp)), ]
          
          # calculate Euclidean distances between the selected observation and all other observations
          dists <- apply(as.matrix(other.samps[,features]), 1, function(x) sqrt(sum((x - as.matrix(this.samp[,features]))^2)))
          
          # sort by distance
          neighbors <- sort(dists)
          
          # while loop which ensures that no duplicate observations are created in the oversampling process
          n.check <- 0
          while (n.check == 0) {
            # select one of the neighbors from within a Gaussian PDF
            # possible duplicates are ignored in two steps
            n.dist    <- sample(neighbors[neighbors>0], 
                                prob = dnorm(1:length(neighbors[neighbors>0]),0,round(0.3413*length(neighbors[neighbors>0]))),
                                size = 1)
            neighbor  <- which(dists == n.dist)[1]
            
            # create a new observation by calculating a weighted average of the feature vectors 
            # associated with the selected observation and its selected neighbor
            s.dist <- (n.dist-min(dists))/diff(range(dists))
            
            new.features <- (
              s.dist * (sub.scaled[which(row.names(sub.scaled)==row.names(other.samps[neighbor,])), features]) +
                (1-s.dist) * (sub.scaled[which(row.names(sub.scaled)==row.names(this.samp)), features])
            )
            
            # convert weighted features back to their respective original scales
            # using the means and standard deviations of the original features
            new.features <- (new.features * stdevs) + means
            
            # replace old features with new features
            this.samp[,features] <- new.features
            
            # check if it is a duplicate oversampled observation
            dup.check <- duplicated(rbind(oversamp.data,this.samp))
            
            # if it is NOT a duplicate, exit the loop and add it to the data frame
            # if it IS a duplicate, run this loop again with a different neighbor
            if (length(dup.check[dup.check==TRUE])==0) {
              n.check <- 1
            }
          }
          # add new observation to data frame
          oversamp.data <- rbind(oversamp.data,this.samp)
        }
        # add oversampled data to original data frame
        newdata <- rbind(subdata,oversamp.data)
        
        # print a success message
        print(paste0("'",groupby,"' level named '", group, "' has been over-sampled from ", nrow(subdata), " to ", resampnum, " observations"), quote=F)
      } 
    }
    # add resampled data to output dataset
    outputdata <- rbind(outputdata,newdata)
  }
  return(outputdata)
}

Performing re-sampling

Before performing the data re-sampling, we first remove the absolute boundaries of the vowel interval in order to avoid the most extreme effects of co-articulation on the acoustic signal:

alldata <- alldata[!(alldata$timepoint %in% c(1,11)),]

We will begin with Laurie’s data:

speakers <- c("laurie","syd")

speaker <- speakers[1]

subdata <- alldata[alldata$speaker==speaker,]
table(subdata$V1_env)

## 
## #_C #_N C_C C_N N_C N_N 
## 369 441 288 450 180 171

We can see that there is an unbalanced distribution of data samples across the six phonetic environments, from 171 samples in N_N to 450 samples in C_N. Since Laurie’s data is quite limited compared to Sydney’s, and since it is not possible to obtain more data from this speaker, we will maximize the data available to us by over-sampling the five minority categories so that they have an equal number of samples as the C_N environment, i.e. 450:

subdata1 <- multivar_resamp(subdata, "V1_env", max(table(subdata$V1_env)), features)

## [1] 'V1_env' level named '#_N' has been over-sampled from 441 to 450 observations
## [1] 'V1_env' level named '#_C' has been over-sampled from 369 to 450 observations
## [1] 'V1_env' level named 'C_C' has been over-sampled from 288 to 450 observations
## [1] 'V1_env' level named 'C_N' has been kept at 450 observations
## [1] 'V1_env' level named 'N_C' has been over-sampled from 180 to 450 observations
## [1] 'V1_env' level named 'N_N' has been over-sampled from 171 to 450 observations

Now let’s look at the effects of the over-sampling, by comparing the separate probability densities of the original and over-sampled data for the first 20 acoustic features. We’ll plot the probabilities densities of the original data in blue and the over-sampled data in red:

par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))

for (feature in features[1:20]) {
  plot(density(subdata1[[feature]]), col='red',
       main=NA,xlab=feature,
       ylim = range(c(
         density(subdata[[feature]])$y,
         density(subdata1[[feature]])$y)))
  
  lines(density(subdata[[feature]]), col='blue')
}

We can see that the relative distributions of the over-sampled data are similar to the original data, which is a purposeful feature of the re-sampling algorithm. In order to see the most extreme effects of the procedure, let’s make the same comparison for only the N_N environment, which underwent the greatest amount of over-sampling from 171 samples to 450 samples, i.e. a 263% increase:

par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))

for (feature in features[1:20]) {
  plot(density(subdata1[[feature]][subdata1$V1_env=="N_N"]), col='red',
       main=NA, xlab=feature,
       ylim = range(c(density(subdata[[feature]][subdata$V1_env=="N_N"])$y,
                      density(subdata1[[feature]][subdata1$V1_env=="N_N"])$y)))
  
  lines(density(subdata[[feature]][subdata$V1_env=="N_N"]), col='blue')
}

We can see that even in this most extreme case, the over-sampling has largely retained similar distributions of the original acoustic features, with some minor deviations that are closer approximations to a normal distribution, e.g., for features that exhibit bimodality such as F1 frequency and MFCC 1.

Now let’s follow a similar procedure for Sydney’s data:

speaker <- speakers[2]

subdata <- alldata[alldata$speaker==speaker,]

table(subdata$V1_env)

## 
##  #_C  #_N  C_C  C_N  N_C  N_N 
## 1224 1980 5373 1071 1449 1521

Since there is much more data available for Sydney, we are not constrained as strictly by a need to retain as much information as possible. At the same time, we don’t want to completely discard a majority of the data by under-sampling everything to the smallest number, 1071. As a compromise, we will re-sample the data set to be equal to the median value, 1485:

subdata2 <- multivar_resamp(subdata, "V1_env", round(median(table(subdata$V1_env))), features)

## [1] 'V1_env' level named 'C_C' has been under-sampled from 5373 to 1485 observations
## [1] 'V1_env' level named 'N_N' has been under-sampled from 1521 to 1485 observations
## [1] 'V1_env' level named 'C_N' has been over-sampled from 1071 to 1485 observations
## [1] 'V1_env' level named 'N_C' has been over-sampled from 1449 to 1485 observations
## [1] 'V1_env' level named '#_N' has been under-sampled from 1980 to 1485 observations
## [1] 'V1_env' level named '#_C' has been over-sampled from 1224 to 1485 observations

Now let’s compare the probability densities in the same way that we did for Laurie’s data. We can see that the overall distributional properties are retained:

par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))

for (feature in features[1:20]) {
  plot(density(subdata2[[feature]]), col='red',
       main=NA, xlab=feature,
       ylim = range(c(density(subdata[[feature]])$y,
                      density(subdata2[[feature]])$y)))
  
  lines(density(subdata[[feature]]), col='blue')
}

If we look at the extreme cases of the re-sampling, we can first examine the C_N environment, which was over-sampled from 1071 to 1485 samples:

par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))

for (feature in features[1:20]) {
  plot(density(subdata2[[feature]][subdata2$V1_env=="C_N"]), col='red',
       main=NA, xlab=feature,
       ylim = range(c(density(subdata[[feature]][subdata$V1_env=="C_N"])$y,
                      density(subdata2[[feature]][subdata2$V1_env=="C_N"])$y)))
  
  lines(density(subdata[[feature]][subdata$V1_env=="C_N"]), col='blue')
}

And now we examine the C_C environment, which was under-sampled from 5373 to 1485 samples via random selection:

par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))

for (feature in features[1:20]) {
  plot(density(subdata2[[feature]][subdata2$V1_env=="C_C"]), col='red',
       main=NA, xlab=feature,
       ylim = range(c(density(subdata[[feature]][subdata$V1_env=="C_C"])$y,
                      density(subdata2[[feature]][subdata2$V1_env=="C_C"])$y)))
  
  lines(density(subdata[[feature]][subdata$V1_env=="C_C"]), col='blue')
}

In all cases, the re-sampling algorithm generates data which retain the distributional properties of the original data. Thus satisfied with the re-sampled output, we replace our data frame with the re-sampled data and verify that the six environments are now balanced for both speakers:

alldata <- rbind(subdata1, subdata2)

table(alldata$V1_env, alldata$speaker)

##      
##       laurie  syd
##   #_C    450 1485
##   #_N    450 1485
##   C_C    450 1485
##   C_N    450 1485
##   N_C    450 1485
##   N_N    450 1485

XGBoost

Oral and Nasal features

In order to train the XGBoost models, we first need to select training data associated with orality and nasality of the acoustic signal. We will do this by retaining samples at 10% and 90% of the vowel interval, adjacent to either oral or nasal consonants, as appropriate.

Let’s first begin with Laurie’s data:

speaker <- speakers[1]

subdata <- alldata[alldata$speaker==speaker,]

In order to split the samples into a training set and testing set (i.e. a set to generate the NAF values), we need to numerically label the individual observations:

subdata$obs <- 1:nrow(subdata)

We now create a training set using only 10% of the vowel interval (timepoint 2 of 11) and 90% of the vowel interval (timepoint 10 of 11):

training <- subdata[subdata$timepoint %in% c(2,10),]

We now randomly sample 80% of these observations to use in training the model:

set.seed(myseed)
training.samples <- sample(training$obs, round(0.8*nrow(training)))

training <- training[training$obs %in% training.samples,]

Our outcome variable will be a numeric variable 0 or 1: 0 will correspond to oral features, i.e. acoustic features associated with time points adjacent to oral consonants, and 1 will correspond to nasal features, i.e. acoustic features associated with time points adjacent to nasal consonants. We now assign these values based on whether the time points are adjacent to an oral consonant (0) or a nasal consonant (1):

training$nasality <- training$V1_env

# oral contexts
training$nasality[training$nasality=="C_C"] <- 0
training$nasality[training$nasality %in% c("#_C","N_C") & training$timepoint==10] <- 0
training$nasality[training$nasality == "C_N" & training$timepoint==2] <- 0

# nasal contexts
training$nasality[training$nasality=="N_N"] <- 1
training$nasality[training$nasality %in% c("#_N","C_N") & training$timepoint==10] <- 1
training$nasality[training$nasality == "N_C" & training$timepoint==2] <- 1

We now restrict our training set to only these 0 and 1 numeric labels, which will exclude time points at 10% of the vowel interval in word-initial contexts:

training <- training[training$nasality %in% c(0,1),]

Now that we have our training data, we make a testing set for generating the NAF values, which will include the retained 20% of the testing data, the excluded word-initial contexts, and all time points from 20% to 80% of the vowel interval:

testing <- subdata[!(subdata$obs %in% training$obs),]

nrow(training); nrow(testing)

## [1] 402

## [1] 2298

In total, 402 samples will be used for model training and 2298 samples will be used to generate NAF values for Laurie’s data.

We now make a new data frame for model training, which consists only of the [0,1] numeric nasality variable and the 74 acoustic features. We then make an XGBoost pointer which will tell the XGBoost model which data are to be used as predictor variables and which data is to be used as the outcome variable:

train.data  <- training[,c("nasality",features)]
dtrain      <- xgboost::xgb.DMatrix(data = as.matrix(train.data[,features]), label = train.data$nasality)

Tuning hyper-parameters

In order to tune the hyper-parameters of the XGBoost model to find the optimal parameter combination for Laurie’s data, we will do a full grid search through the following 2079 hyper-parameter permutations:

hyper_grid <- expand.grid(max_depth = seq(2, 10, 1),
                          eta = seq(0, 0.5, 0.05),
                          gamma = seq(0, 0.2, 0.1),
                          subsample = seq(0.3, 0.9, 0.1))

We now loop through each permutation of the hyper-parameters and log the RMSE error in a 5-fold cross-validation model built with the respective hyper-parameter values. We use stratified sampling to ensure that an equal number of oral and nasal items are used in each iteration of the model:

xgb_test_error  <- NULL

for (j in 1:nrow(hyper_grid)) {
  xgb_params <- list("objective" = "reg:squarederror",
                     "eval_metric" = "rmse",
                     "max_depth" = hyper_grid$max_depth[j],
                     "eta" = hyper_grid$eta[j],
                     "gamma" = hyper_grid$gamma[j],
                     "subsample" = hyper_grid$subsample[j])
  
  m_xgb_untuned <- xgboost::xgb.cv(
    params = xgb_params,
    data = dtrain,
    set.seed(myseed),
    nrounds = 25,
    set.seed(myseed),
    early_stopping_rounds = 3,
    nfold = 5,
    stratified = T,
    verbose = F
  )
  
  xgb_test_error[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
}

The ideal hyper-parameters for Laurie’s data are the ones which yield the lowest RMSE error value:

hyper_grid[which.min(xgb_test_error),]

##      max_depth  eta gamma subsample
## 1533         4 0.25     0       0.8

We then create a parameter list using these tuned values:

xgb_params <- list("objective" = "reg:squarederror",
                   "eval_metric" = "rmse",
                   "max_depth" = 4,
                   "eta" = 0.25,
                   "gamma" = 0,
                   "subsample" = 0.8)

Optimize final model

One final step is to optimize the number of iterations of model learning which strikes a balance between learning patterns in the data and avoiding over-fitting of the model to the training set. We will find this optimal iteration number by creating one final cross-validation model with early stopping based on lack of improvement in the testing loss function:

tst_model <- xgboost::xgb.cv(params = xgb_params,
                             data = dtrain,
                             set.seed(myseed),
                             nrounds = 1000,
                             early_stopping_rounds = 5,
                             nfold = 5,
                             stratified = T,
                             verbose = F)

Our final model is then created using all of the training data, the tuned hyper-parameters, and the optimized number of iterations:

bst_model <- xgboost::xgb.train(params = xgb_params,
                                data = dtrain,
                                set.seed(myseed),
                                nrounds = tst_model$best_iteration,
                                verbose = F)

Generate NAF values

The NAF values can now be generated by using the final XGBoost model to make response predictions from the acoustic data in the testing set:

testing$NAF <- predict(bst_model, xgboost::xgb.DMatrix(data = as.matrix(testing[,features])) )

We can investigate the NAF values for a single word. Let’s try the word kantha “grass”, which is a C_N vowel environment:

examp <- testing[testing$headword=="kantha",]

plot(examp$timepoint, examp$NAF, type='l')

We observe that the NAF values begin low and increase throughout the vowel interval, suggesting a pattern of anticipatory nasalization. Note that this item has no time points associated with either 10% (timepoint 2) or 90% (timepoint 10) of the vowel interval, indicating that this particular item did not include any of the 20% hold-out data that were excluded from the training set. In other words, the pattern that we observe in this item derives not only from data that were not included in the training data (as is the case for all of the NAF values), but also from data that had no representative timepoints in the model training at all, i.e. no data from timepoints 3-9 were ever included in the XGBoost learning.

We extract the relevant data from Laurie’s results to use for plotting:

plot.data1 <- testing[,c("point_vwlpct","NAF","V1_env",features)]
plot.data1$speaker <- speaker

Do it again!

We now repeat the entire process described above, but for Sydney’s data:

speaker <- speakers[2]

subdata <- alldata[alldata$speaker==speaker,]

subdata$obs <- 1:nrow(subdata)

training <- subdata[subdata$timepoint %in% c(2,10),]

set.seed(myseed)
training.samples <- sample(training$obs, round(0.8*nrow(training)))

training <- training[training$obs %in% training.samples,]

training$nasality <- training$V1_env

# oral contexts
training$nasality[training$nasality=="C_C"] <- 0
training$nasality[training$nasality %in% c("#_C","N_C") & training$timepoint==10] <- 0
training$nasality[training$nasality == "C_N" & training$timepoint==2] <- 0

# nasal contexts
training$nasality[training$nasality=="N_N"] <- 1
training$nasality[training$nasality %in% c("#_N","C_N") & training$timepoint==10] <- 1
training$nasality[training$nasality == "N_C" & training$timepoint==2] <- 1

training <- training[training$nasality %in% c(0,1),]
testing <- subdata[!(subdata$obs %in% training$obs),]

train.data  <- training[,c("nasality",features)]
dtrain      <- xgboost::xgb.DMatrix(data = as.matrix(train.data[,features]), label = train.data$nasality)

nrow(training); nrow(testing)

## [1] 1356

## [1] 7554

In total, 1356 samples will be used for model training and 7554 samples will be used to generate NAF values for Sydney’s data.

The hyper-parameters need to be tuned separately for Sydney’s data set:

xgb_test_error  <- NULL

for (j in 1:nrow(hyper_grid)) {
  xgb_params <- list("objective" = "reg:squarederror",
                     "eval_metric" = "rmse",
                     "max_depth" = hyper_grid$max_depth[j],
                     "eta" = hyper_grid$eta[j],
                     "gamma" = hyper_grid$gamma[j],
                     "subsample" = hyper_grid$subsample[j])
  
  m_xgb_untuned <- xgboost::xgb.cv(
    params = xgb_params,
    data = dtrain,
    set.seed(myseed),
    nrounds = 25,
    set.seed(myseed),
    early_stopping_rounds = 3,
    nfold = 5,
    stratified = T,
    verbose = F
  )
  
  xgb_test_error[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
}

#ideal hyperparamters
hyper_grid[which.min(xgb_test_error),]

##      max_depth  eta gamma subsample
## 1616         6 0.15   0.1       0.8

We subsequently build an optimized XGBoost model for Sydney’s data and generate the corresponding NAF values as predictions:

xgb_params <- list("objective" = "reg:squarederror",
                   "eval_metric" = "rmse",
                   "max_depth" = 6,
                   "eta" = 0.15,
                   "gamma" = 0.1,
                   "subsample" = 0.8)

tst_model <- xgboost::xgb.cv(params = xgb_params,
                             data = dtrain,
                             set.seed(myseed),
                             nrounds = 1000,
                             early_stopping_rounds = 5,
                             nfold = 5,
                             stratified = T,
                             verbose = F)

bst_model <- xgboost::xgb.train(params = xgb_params,
                                data = dtrain,
                                set.seed(myseed),
                                nrounds = tst_model$best_iteration,
                                verbose = F)

testing$NAF <- predict(bst_model, xgboost::xgb.DMatrix(data = as.matrix(testing[,features])) )

We now extract Sydney’s NAF data and add them to Laurie’s NAF data for plotting and statistical treatment:

plot.data2 <- testing[,c("point_vwlpct","NAF","V1_env",features)]
plot.data2$speaker <- speaker

plot.data <- rbind.data.frame(plot.data1, plot.data2)

Results

Qualitative results

We can now use the NAF data to generate the aggregate smoothed NAF values (Figure 4 from the paper):

library(ggplot2)

ggplot(plot.data, aes(x=point_vwlpct,
                      y=NAF,
                      col=V1_env,
                      group=V1_env,
                      linetype=V1_env,
                      fill=V1_env)) + 
  geom_smooth(method="loess") +
  scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
  scale_y_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(0.1,0.9,0.1)) +
  guides(col=guide_legend(title="V env"),
         fill=guide_legend(title="V env"),
         linetype=guide_legend(title="V env")) + 
  ggtitle("NAF predictions by vowel environment") + theme_bw() + theme(legend.key.width=unit(1,"cm"))

Individual speaker results

The NAF data can be plotted separately for Sydney’s data (Figure 5 from the paper):

ggplot(plot.data[plot.data$speaker=="syd",], aes(x=point_vwlpct,
                                                 y=NAF,
                                                 col=V1_env,
                                                 group=V1_env,
                                                 linetype=V1_env,
                                                 fill=V1_env)) + 
  geom_smooth(method="loess") +
  scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
  scale_y_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(-0.1,1.2,0.1)) +
  coord_cartesian(ylim=c(0,1)) +
  guides(col=guide_legend(title="V env"),
         fill=guide_legend(title="V env"),
         linetype=guide_legend(title="V env")) + 
  ggtitle("NAF predictions by vowel environment: Sydney") + theme_bw() + theme(legend.key.width=unit(1,"cm"))

## `geom_smooth()` using formula 'y ~ x'

The NAF data can be plotted separately for Laurie’s data (Figure 6 from the paper):

ggplot(plot.data[plot.data$speaker=="laurie",], aes(x=point_vwlpct,
                                                    y=NAF,
                                                    col=V1_env,
                                                    group=V1_env,
                                                    linetype=V1_env,
                                                    fill=V1_env)) + 
  geom_smooth(method="loess") +
  scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
  scale_y_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(-0.1,1.2,0.1)) +
  coord_cartesian(ylim=c(0,1)) +
  guides(col=guide_legend(title="V env"),
         fill=guide_legend(title="V env"),
         linetype=guide_legend(title="V env")) + 
  ggtitle("NAF predictions by vowel environment: Laurie") + theme_bw() + theme(legend.key.width=unit(1,"cm"))

## `geom_smooth()` using formula 'y ~ x'

Quantitative results

We also plot the probability distributions and box plots of all NAF values within the target vowels, i.e. all values combined, independent of the time course of the vowel interval (Figure 7 from the paper):

ggplot(plot.data, aes(y=V1_env, 
                      x=NAF, 
                      fill=V1_env)) + 
  ggdist::stat_slab(adjust=0.7, height=1, justification=-0.05, 
                    .width=0, lwd=0.5, point_colour=NA, color="black") + 
  geom_boxplot(width=0.2, outlier.shape=NA, show.legend=F, notch=T) +
  guides(color=guide_legend(override.aes=list(shape=21)),
         fill=guide_legend(title="V env")) +
  ylab("Phonetic environment of V") + 
  scale_x_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(-0.2,1.2,0.2)) +
  ggtitle("NAF prediction distribution by vowel environment") + theme_bw()

Linear mixed model

We use the NAF values as the dependent variable in a linear mixed-effects model with the vowel environment as a fixed effect and a full random effect by speaker.

library(lme4)
library(multcomp)

lme.mod <- lmer(NAF ~ V1_env + (1+V1_env|speaker), data = plot.data, 
                control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5))
)

## boundary (singular) fit: see help('isSingular')
## Warning message:
## Model failed to converge with 1 negative eigenvalue: -6.7e+01

The full model fails to converge, so we simplify the model structure to include an intercept-only random effect:

lme.mod <- lmer(NAF ~ V1_env + (1|speaker), data = plot.data, 
                control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5))
)

We compute Tukey pair-wise contrasts with the \(\alpha\) level adjusted using the Bonferroni method in order to reduce Type I error inflation in a maximally conservative manner (Table 2 from the paper):

summary(glht(lme.mod, linfct = mcp(V1_env="Tukey")), test = adjusted("bonferroni"))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lmer(formula = NAF ~ V1_env + (1 | speaker), data = plot.data, 
##     control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05)))
## 
## Linear Hypotheses:
##                 Estimate Std. Error z value Pr(>|z|)    
## #_N - #_C == 0  0.307530   0.008772  35.057  < 2e-16 ***
## C_C - #_C == 0 -0.222951   0.009029 -24.693  < 2e-16 ***
## C_N - #_C == 0  0.064788   0.009006   7.194 9.46e-12 ***
## N_C - #_C == 0  0.169078   0.008993  18.801  < 2e-16 ***
## N_N - #_C == 0  0.437206   0.008981  48.681  < 2e-16 ***
## C_C - #_N == 0 -0.530481   0.009030 -58.746  < 2e-16 ***
## C_N - #_N == 0 -0.242742   0.009007 -26.949  < 2e-16 ***
## N_C - #_N == 0 -0.138452   0.008994 -15.394  < 2e-16 ***
## N_N - #_N == 0  0.129676   0.008982  14.437  < 2e-16 ***
## C_N - C_C == 0  0.287739   0.009257  31.082  < 2e-16 ***
## N_C - C_C == 0  0.392029   0.009244  42.407  < 2e-16 ***
## N_N - C_C == 0  0.660157   0.009233  71.500  < 2e-16 ***
## N_C - C_N == 0  0.104290   0.009222  11.308  < 2e-16 ***
## N_N - C_N == 0  0.372418   0.009211  40.433  < 2e-16 ***
## N_N - N_C == 0  0.268127   0.009198  29.151  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)

Post-hoc analysis

P0-A1

In the post-hoc analysis of the paper, we investigate the time-varying patterns of nasalization suggested by two single-metric acoustic features of nasality. The first feature is formant-compensated A1-P0, which we invert as P0-A1 in order to directly compare with the rest of the results (Figure 8 from the paper):

ggplot(plot.data, aes(x=point_vwlpct,
                      y=-a1p0_compensated,
                      col=V1_env,
                      group=V1_env,
                      linetype=V1_env,
                      fill=V1_env)) + 
  geom_smooth(method="loess") +
  scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
  scale_y_continuous(name="P0-A1", breaks=seq(-5,10,2.5)) +
  guides(col=guide_legend(title="V env"),
         fill=guide_legend(title="V env"),
         linetype=guide_legend(title="V env")) + 
  ggtitle("P0-A1 by vowel environment") + theme_bw() + theme(legend.key.width=unit(1,"cm"))

## `geom_smooth()` using formula 'y ~ x'

F1 bandwidth

The second feature is F1 bandwidth (Figure 9 from the paper):

ggplot(plot.data, aes(x=point_vwlpct,
                      y=width_f1,
                      col=V1_env,
                      group=V1_env,
                      linetype=V1_env,
                      fill=V1_env)) + 
  geom_smooth(method="loess") +
  scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
  scale_y_continuous(name="F1 bandwidth (Hz)", breaks=seq(100,800,100)) +
  guides(col=guide_legend(title="V env"),
         fill=guide_legend(title="V env"),
         linetype=guide_legend(title="V env")) + 
  ggtitle("F1 bandwidth by vowel environment") + theme_bw() + theme(legend.key.width=unit(1,"cm"))

## `geom_smooth()` using formula 'y ~ x'

Supplementary Materials:
“An investigation of the dynamics of vowel nasalization in Arabana using machine learning of acoustic features”

Christopher Carignan

2022-06-27

Data

Additional acoustic measures

Delta features

Data re-sampling

Re-sampling function

Performing re-sampling

XGBoost

Oral and Nasal features

Tuning hyper-parameters

Optimize final model

Generate NAF values

Do it again!

Results

Qualitative results

Individual speaker results

Quantitative results

Linear mixed model

Post-hoc analysis

P0-A1

F1 bandwidth

Supplementary Materials: “An investigation of the dynamics of vowel nasalization in Arabana using machine learning of acoustic features”

Christopher Carignan

2022-06-27

Data

Additional acoustic measures

Delta features

Data re-sampling

Re-sampling function

Performing re-sampling

XGBoost

Oral and Nasal features

Tuning hyper-parameters

Optimize final model

Generate NAF values

Do it again!

Results

Qualitative results

Individual speaker results

Quantitative results

Linear mixed model

Post-hoc analysis

P0-A1

F1 bandwidth

Supplementary Materials:
“An investigation of the dynamics of vowel nasalization in Arabana using machine learning of acoustic features”