Let’s first load the main data frame, which contains information about the headwords, the speaker, the vowel environment, and the acoustic features of nasality obtained from the Nasality Automeasure Praat script created by Will Styler:
data <- readRDS("Arabana_data.Rda")
head(data)
## filename item speaker headword gloss V1_env
## 1 anthunha_pidla_my_name_Maltya_awarda 1 laurie anthunha my, mine #_N
## 2 anthunha_pidla_my_name_Maltya_awarda 1 laurie anthunha my, mine #_N
## 3 anthunha_pidla_my_name_Maltya_awarda 1 laurie anthunha my, mine #_N
## 4 anthunha_pidla_my_name_Maltya_awarda 1 laurie anthunha my, mine #_N
## 5 anthunha_pidla_my_name_Maltya_awarda 1 laurie anthunha my, mine #_N
## 6 anthunha_pidla_my_name_Maltya_awarda 1 laurie anthunha my, mine #_N
## timepoint point_vwlpct point_time freq_f1 freq_f2 freq_f3 amp_f1 amp_f2
## 1 1 0 0.239728 861 1584 2619 39.98 38.71
## 2 2 10 0.251218 861 1584 2619 39.98 38.71
## 3 3 20 0.262708 560 1563 1982 35.70 40.18
## 4 4 30 0.274198 861 1561 2113 36.99 39.08
## 5 5 40 0.285688 581 1472 2057 36.89 38.04
## 6 6 50 0.297178 861 1617 1620 38.41 43.59
## amp_f3 width_f1 width_f2 width_f3 amp_p0 a1p0 a1p0_compensated p0prominence
## 1 9.96 223 58 465 40.45 -0.47 1.57176 2.72
## 2 9.96 223 58 465 40.45 -0.47 1.57176 2.72
## 3 21.13 468 141 1403 43.47 -7.77 -5.72529 4.08
## 4 27.06 462 208 1791 44.66 -7.67 -5.62947 4.54
## 5 24.00 367 254 769 46.90 -10.01 -7.95671 6.32
## 6 43.59 640 84 2726 47.77 -9.36 -7.32675 7.88
## amp_p1 a1p1 a1p1_compensated a3p0 h1h2
## 1 39.98 0.00 4.93580 -30.48983 2.71
## 2 39.98 0.00 4.93580 -30.48983 2.71
## 3 34.73 0.97 3.43353 -22.34065 4.09
## 4 37.55 -0.56 2.15151 -17.60132 4.55
## 5 39.53 -2.64 -0.00981 -22.89949 6.32
## 6 38.41 0.00 2.53667 -4.18175 7.89
For each item, there are 11 separate measurements (i.e. 11 equidistant time points of measure throughout the vowel interval). To obtain a count of the total number of utterances produced by each speaker, we will need to divide the total number of observations for each speaker by 11. Let’s first look at the headwords produced by Laurie, along with their respective English glosses and total counts:
library(dplyr)
data %>%
filter(speaker=="laurie") %>%
group_by(headword, gloss) %>%
summarize(n=n()/11) %>%
print(n=Inf)
## # A tibble: 63 × 3
## # Groups: headword [63]
## headword gloss count
## <chr> <chr> <dbl>
## 1 akuru over there 2
## 2 ampurdu central, middle 1
## 3 ananthara 1pl.Nom 2
## 4 anpa 2sg.Abs 3
## 5 antha 1sg.Abs 18
## 6 anthunha my, mine 22
## 7 anti soon 1
## 8 antu 2sg.Erg 2
## 9 apira white gum 7
## 10 apityi father 1
## 11 apurlu apple 2
## 12 arabana language name 3
## 13 athu 1sg.Erg 17
## 14 awarda that 9
## 15 kadlhu liver 1
## 16 kadlhumpa lungs 1
## 17 kadliti a personal name 1
## 18 kadnha stone 6
## 19 kalka evening 1
## 20 kalta lizard sp 2
## 21 kangi too much 2
## 22 kanta-lhuku pour-Purp 1
## 23 kantha broom 1
## 24 kantimalyu gum, resin 1
## 25 kantunya wallaby sp 1
## 26 kantyara grass 9
## 27 karla creek 2
## 28 madla dog 9
## 29 madlanti bad 1
## 30 madli cold 4
## 31 maka fire 3
## 32 malaru however 1
## 33 maltya not 1
## 34 manga-rnda be.ashamed-Present 3
## 35 mangu forehead 1
## 36 mani to get 4
## 37 mankarra girl 3
## 38 manta-li tie-Habitual 1
## 39 manthara wattle sp 2
## 40 manungka-rda remember-Pres 2
## 41 marka to crawl 1
## 42 marna mouth 3
## 43 pangki ribs 1
## 44 panki-rda happy-Pres 1
## 45 panpa-rda taste-Present 1
## 46 panti-li fight-Habitual 2
## 47 pantya knee 2
## 48 parnda big 15
## 49 thadla frightened 1
## 50 thanthi grandchild 4
## 51 wabma carpet snake 5
## 52 wadna yamstick 3
## 53 wadni corroboree 1
## 54 wakarla crow 1
## 55 wangka word 1
## 56 wanka to rise 1
## 57 wanpa to distrust 1
## 58 wanta to fly 1
## 59 wanti corkwood 2
## 60 wathili proper, real 2
## 61 yalka plant sp 5
## 62 yangkathara thristy 2
## 63 yanhi to speak 1
And now we’ll do the same for Sydney, who has much more data available:
data %>%
filter(speaker=="syd") %>%
group_by(headword, gloss) %>%
summarize(count=n()/11) %>%
print(n=Inf)
## `summarise()` has grouped output by 'headword'. You can override using the
## `.groups` argument.
## # A tibble: 67 × 3
## # Groups: headword [67]
## headword gloss count
## <chr> <chr> <dbl>
## 1 alantha 1dl.Nom.SM 66
## 2 alhakiya 1dl.Nom.DM 65
## 3 amanyi grandmother 31
## 4 anari this way 66
## 5 angka alive 6
## 6 anhaku Dont know 77
## 7 anthunha mine 40
## 8 arla-thi-rnda appear-Pres 5
## 9 kadlara cloud 7
## 10 kadnanundi scorpion 7
## 11 kadnha stone 65
## 12 kadnha-pardi grasshopper sp 7
## 13 kadnhaardi small stones 6
## 14 kala-rrantja lizard sp 5
## 15 kalka choose 6
## 16 kalta lizard sp 5
## 17 kangi persistent 38
## 18 kantyara grass 10
## 19 karla creek 6
## 20 karlatharra bush turkey 7
## 21 kata louse 65
## 22 katharungka white cockatoo 74
## 23 madla bad, dog 68
## 24 madla-yari Sturts desert pea 6
## 25 madlanti bad 6
## 26 maka fire 7
## 27 malka-malka stripey 6
## 28 malya gypsum 7
## 29 mama-rnda grab-Pres 74
## 30 manarni perhaps, maybe, meanwhile 33
## 31 mankarra girl 6
## 32 manuputu ignorant 43
## 33 mapaa-rnda sit.together-Present 6
## 34 marna mouth 7
## 35 marna-kathu-kathu whispery 6
## 36 mathapurda old man 7
## 37 ngadlha cheek 60
## 38 paka-rnda dig-Pres 8
## 39 palkali snatch 12
## 40 palkalya leaves 5
## 41 palyama-rnda carry.on.shoulder-Pres 7
## 42 pangkara reed 4
## 43 pinya vengeance party 2
## 44 thadlarra-ma-rnda fear-Caus-Pres 7
## 45 thalka bilby 6
## 46 wabma carpet snake 8
## 47 wabmarri plain 72
## 48 wadlampu hungry 7
## 49 wadna yamstick 72
## 50 wadna-rnda take.off-Present 6
## 51 wadnakani carpet snake 6
## 52 wakarla crow 4
## 53 wakarra back.of.neck 5
## 54 walpathi-rda hunt-Present 7
## 55 wamparla possum 7
## 56 wangka parda-rnda answer-Pres 5
## 57 wangka-matha-nta speak-together-Reflexive 5
## 58 wangka-tyimpa-rda converse-Pres 6
## 59 wanpa-rda carry-Pres 6
## 60 wanta-rda fly-Present 7
## 61 wapa-rnda seek-Pres 83
## 62 yakarra teeth 6
## 63 yalka plant sp 4
## 64 yampa stranger 5
## 65 yangka hollow 6
## 66 yangkatharra thirsty 12
## 67 yantakarra west 6
In addition to the 18 acoustic features of nasality from the Nasality Automeasure script, we add the first four spectral moments, a measure of nasal murmur based on spectral amplitude ratio, and 14 MFCCs. We are not able to make the original audio files publicly available due to privacy concerns, but we have provided here the code used to create the measurements, and we provide the derived data frame in the supplementary materials data repository:
# names of MFCCs 1-14
mfccs <- paste0("mfcc",1:14)
# prepare data frame
alldata <- c()
# loop through all files
for (file in unique(data$filename)) {
# get the file data
fdat <- data[data$filename==file,]
# read the corresponding audio file
audio <- tuneR::readWave(paste0('audio/',file,'.wav'))
# total duration (in seconds)
totaldur <- length(audio@left)/audio@samp.rate
# extract audio and de-mean to remove DC offset
snd <- audio@left - mean(audio@left)
# audio pre-emphasis
for (n in 2:length(snd)) {
snd[n] <- snd[n] - 0.97*snd[n-1]
}
# replace the wave object audio samples
audio@left <- snd
# calculate MFCCs
melcs <- tuneR::melfcc(audio,
sr = audio@samp.rate,
numcep = length(mfccs),
wintime = 0.01,
hoptime = 0.005)
# get the actual time step (may be slightly different than "hoptime")
timestep <- totaldur/nrow(melcs)
# get the MFCCs samples nearest to the time points
mfsamps <- round(fdat$point_time/timestep)
# add the MFCCs to the file data
fdat[,mfccs] <- melcs[mfsamps,]
# create spectrogram
spec <- signal::specgram(x = audio@left,
n = 1024,
Fs = audio@samp.rate,
window = 256,
overlap = 128
)
# get spectra
P <- abs(spec$S)
# convert to dB
P <- 20*log10(P)
# get the spectral time step
timestep <- diff(spec$t[1:2])
# get the spectral samples nearest to the time points
specsamps <- round(fdat$point_time/timestep)
# get first four spectral moments
moments <- c()
for (samp in 1:length(specsamps)) {
moments <- rbind(moments, emuR::moments(P[,samp]))
}
colnames(moments) <- c("COG", "variance", "skew", "kurtosis")
# add the moments to the file data
fdat[,colnames(moments)] <- moments
# nasal murmur (low/high ratio, 0-320 Hz : 320-5360 Hz) bands
thresh1 <- which.min(abs(spec$f-320))
thresh2 <- which.min(abs(spec$f-5360))
# get the spectral amplitude means within the two frequency bands
low <- colMeans(P[1:thresh1,specsamps])
high <- colMeans(P[thresh1:thresh2,specsamps])
# calculate the murmur ratio and add to the file data
fdat$murmur <- low/high
# add the file data to the combined data frame
alldata <- rbind.data.frame(alldata,fdat)
}
saveRDS(alldata,"alldata.Rda")
And now we load the derived data frame, in order to continue with a replicable pipeline:
alldata <- readRDS("alldata.Rda")
Let’s take a quick look at the data to make sure everything looks correct:
summary(alldata)
## filename item speaker headword
## Length:17743 Min. : 1 Length:17743 Length:17743
## Class :character 1st Qu.: 404 Class :character Class :character
## Mode :character Median : 807 Mode :character Mode :character
## Mean : 807
## 3rd Qu.:1210
## Max. :1613
##
## gloss V1_env timepoint point_vwlpct
## Length:17743 Length:17743 Min. : 1 Min. : 0
## Class :character Class :character 1st Qu.: 3 1st Qu.: 20
## Mode :character Mode :character Median : 6 Median : 50
## Mean : 6 Mean : 50
## 3rd Qu.: 9 3rd Qu.: 80
## Max. :11 Max. :100
##
## point_time freq_f1 freq_f2 freq_f3
## Min. : 0.040 Min. : 0.0 Min. : 432 Min. :1222
## 1st Qu.: 1.581 1st Qu.: 422.0 1st Qu.:1275 1st Qu.:2275
## Median : 3.606 Median : 562.0 Median :1440 Median :2407
## Mean : 3.759 Mean : 553.9 Mean :1452 Mean :2449
## 3rd Qu.: 5.562 3rd Qu.: 656.0 3rd Qu.:1576 3rd Qu.:2553
## Max. :16.537 Max. :2320.0 Max. :3830 Max. :5231
##
## amp_f1 amp_f2 amp_f3 width_f1
## Min. :-48.66 Min. :-48.17 Min. :-51.10 Min. : 3.0
## 1st Qu.: 31.27 1st Qu.: 21.92 1st Qu.: 9.42 1st Qu.: 137.0
## Median : 36.90 Median : 28.67 Median : 16.40 Median : 249.0
## Mean : 36.22 Mean : 27.84 Mean : 16.24 Mean : 342.3
## 3rd Qu.: 42.60 3rd Qu.: 35.01 3rd Qu.: 23.37 3rd Qu.: 438.0
## Max. : 62.00 Max. : 56.13 Max. : 48.44 Max. :6051.0
##
## width_f2 width_f3 amp_p0 a1p0
## Min. : 8 Min. : 9.0 Min. :-42.47 Min. :-44.940
## 1st Qu.: 118 1st Qu.: 208.0 1st Qu.: 36.57 1st Qu.: -7.700
## Median : 196 Median : 360.5 Median : 39.96 Median : -2.280
## Mean : 303 Mean : 496.3 Mean : 39.68 Mean : -3.467
## 3rd Qu.: 343 3rd Qu.: 617.0 3rd Qu.: 43.47 3rd Qu.: 0.500
## Max. :7932 Max. :8519.0 Max. : 59.97 Max. : 46.730
## NA's :31
## a1p0_compensated p0prominence amp_p1 a1p1
## Min. :-42.9355 Min. :-19.570 Min. :-52.93 Min. :-38.09
## 1st Qu.: -5.6284 1st Qu.: 3.605 1st Qu.: 15.65 1st Qu.: 7.54
## Median : -0.1619 Median : 7.630 Median : 22.09 Median : 13.66
## Mean : -1.3417 Mean : 8.106 Mean : 22.38 Mean : 13.84
## 3rd Qu.: 2.8873 3rd Qu.: 11.980 3rd Qu.: 29.34 3rd Qu.: 19.70
## Max. : 48.8675 Max. : 43.290 Max. : 55.58 Max. : 53.83
##
## a1p1_compensated a3p0 h1h2 mfcc1
## Min. :-35.55 Min. :-61.97 Min. :-44.15 Min. :-110.10
## 1st Qu.: 10.70 1st Qu.:-29.32 1st Qu.: 1.80 1st Qu.: 50.87
## Median : 16.41 Median :-23.41 Median : 6.71 Median : 56.58
## Mean : 16.62 Mean :-23.45 Mean : 6.51 Mean : 55.49
## 3rd Qu.: 22.06 3rd Qu.:-17.38 3rd Qu.: 11.71 3rd Qu.: 62.20
## Max. : 64.72 Max. : 29.44 Max. : 38.76 Max. : 88.93
##
## mfcc2 mfcc3 mfcc4 mfcc5
## Min. :-13.390 Min. :-20.677 Min. :-24.3642 Min. :-21.0224
## 1st Qu.: 1.158 1st Qu.: 3.128 1st Qu.: -6.1410 1st Qu.: -7.8112
## Median : 3.598 Median : 6.624 Median : -2.5414 Median : -4.4083
## Mean : 4.981 Mean : 6.239 Mean : -2.5221 Mean : -3.8751
## 3rd Qu.: 6.785 3rd Qu.: 10.099 3rd Qu.: 0.8234 3rd Qu.: -0.5944
## Max. : 30.822 Max. : 26.312 Max. : 21.1995 Max. : 27.9880
##
## mfcc6 mfcc7 mfcc8 mfcc9
## Min. :-27.9145 Min. :-24.7785 Min. :-38.379 Min. :-33.792
## 1st Qu.: -0.0594 1st Qu.: -3.8065 1st Qu.: -6.955 1st Qu.: -7.237
## Median : 6.2691 Median : 0.5021 Median : -2.263 Median : -1.572
## Mean : 5.1518 Mean : 0.5367 Mean : -2.708 Mean : -2.108
## 3rd Qu.: 11.3075 3rd Qu.: 4.6862 3rd Qu.: 2.131 3rd Qu.: 3.274
## Max. : 31.4070 Max. : 31.3032 Max. : 23.930 Max. : 23.213
##
## mfcc10 mfcc11 mfcc12 mfcc13
## Min. :-19.478 Min. :-26.742 Min. :-29.3881 Min. :-16.658
## 1st Qu.: -1.336 1st Qu.: -5.843 1st Qu.: -9.1303 1st Qu.: 5.252
## Median : 2.700 Median : -1.673 Median : -4.0143 Median : 9.637
## Mean : 2.851 Mean : -1.838 Mean : -4.1285 Mean : 9.521
## 3rd Qu.: 6.915 3rd Qu.: 2.326 3rd Qu.: 0.6071 3rd Qu.: 13.939
## Max. : 29.241 Max. : 20.220 Max. : 26.8391 Max. : 36.758
##
## mfcc14 COG variance skew
## Min. :-31.819 Min. :220.0 Min. :10517 Min. :-1.06139
## 1st Qu.: -3.233 1st Qu.:246.7 1st Qu.:24389 1st Qu.:-0.09699
## Median : 1.941 Median :255.0 Median :25565 Median :-0.02016
## Mean : 1.668 Mean :258.6 Mean :25209 Mean :-0.04533
## 3rd Qu.: 6.845 3rd Qu.:262.7 3rd Qu.:26639 3rd Qu.: 0.06739
## Max. : 26.120 Max. :381.5 Max. :32094 Max. : 0.37271
##
## kurtosis murmur
## Min. :-1.587 Min. :-1.271
## 1st Qu.:-1.355 1st Qu.: 1.375
## Median :-1.312 Median : 1.480
## Mean :-1.264 Mean : 1.485
## 3rd Qu.:-1.254 3rd Qu.: 1.600
## Max. : 1.435 Max. : 2.786
##
We can see that there are 31 NA values for the width_f3
measurement; these come from time points where Praat was not able to
reliably estimate the F3 bandwidth in the process of running the
Nasality Automeasure script. There aren’t that many of these
occurrences, and they all come from Sydney’s data:
table(alldata$speaker[is.na(alldata$width_f3)])
##
## syd
## 31
…which accounts for only 0.2% of this speaker’s total data set:
table(alldata$speaker[is.na(alldata$width_f3)]) / nrow(alldata[alldata$speaker=="syd",])
##
## syd
## 0.002010115
…so we’ll simply set these values to be equal to the overall mean for Sydney’s F3 bandwidth measure:
alldata$width_f3[is.na(alldata$width_f3)] <- mean(alldata$width_f3[alldata$speaker=="syd"], na.rm=T)
In order to capture abrupt temporal changes in the acoustic measurements, we will also calculate the delta values for all of these features (i.e. sample-wise differences). We’ll first make a list of the feature names, and then use this to calculate the delta features:
features <- c('freq_f1','freq_f2','freq_f3',
'amp_f1','amp_f2','amp_f3',
'width_f1','width_f2','width_f3',
'amp_p0','a1p0','a1p0_compensated','p0prominence',
'amp_p1','a1p1','a1p1_compensated',
'a3p0','h1h2',paste0('mfcc',1:14),
'COG','variance','skew','kurtosis','murmur')
for (item in unique(alldata$item)) {
it.dat <- alldata[alldata$item==item,]
for (feature in features) {
alldata[[paste0(feature,".d")]][alldata$item==item] <- c(0, diff(it.dat[[feature]]) )
}
}
We’ll add these to the list of feature names, and we can see that we now have 74 features to work with:
features <- c(features, paste0(features,".d") )
length(features)
## [1] 74
Re-sampling of the data was crucial for creating balanced data sets across the six vowel environments, while also retaining as much of the data as we are able to retain. There are a number of processes from this point forward that will involve a random component, so we’ll use a random seed for replicability:
myseed <- 123
set.seed(myseed)
The re-sampling function is taken from the multivariate resampling procedure, which allows for over-sampling of data by creating Euclidean-distance-based weighted averages of feature vectors of randomly selected nearest neighbors in a multidimensional space. We replicate the function here, as accessed on 21 June, 2022, for the sake of transparency:
multivar_resamp <- function (inputdata, groupby, resampnum, features) {
outputdata <- c()
for (group in unique(inputdata[[groupby]])) {
subdata <- inputdata[inputdata[[groupby]]==group,]
if (nrow(subdata) > resampnum) { # more samples than the desired number?
# undersample to the desired number
newdata <- subdata[sample(nrow(subdata), resampnum, replace=F), ]
# print a success message
print(paste0("'",groupby,"' level named '", group, "' has been under-sampled from ", nrow(subdata), " to ", resampnum, " observations"), quote=F)
} else {
if (nrow(subdata) == resampnum) { # same number of samples as the desired number?
# keep the original data as-is
newdata <- subdata
# print a success message
print(paste0("'",groupby,"' level named '", group, "' has been kept at ", nrow(subdata), " observations"), quote=F)
} else { # fewer samples than the desired number?
# let's oversample!
oversamp <- resampnum - nrow(subdata) # number of new observations to use in oversampling
sub.scaled <- subdata # original data to scale
means <- apply(as.matrix(sub.scaled[,features]), 2, mean) # get the original feature means
stdevs <- apply(as.matrix(sub.scaled[,features]), 2, sd) # get the original feature standard deviations
sub.scaled[,features] <- scale(subdata[,features]) # scale the original features
oversamp.data <- c() # oversampled data to build
for (samp in 1:oversamp) {
# randomly choose an observation from the scaled feature matrix
this.samp <- sub.scaled[sample(nrow(sub.scaled), 1), ]
# select all of the OTHER observations from the scaled feature matrix
other.samps <- sub.scaled[which(row.names(sub.scaled)!=row.names(this.samp)), ]
# calculate Euclidean distances between the selected observation and all other observations
dists <- apply(as.matrix(other.samps[,features]), 1, function(x) sqrt(sum((x - as.matrix(this.samp[,features]))^2)))
# sort by distance
neighbors <- sort(dists)
# while loop which ensures that no duplicate observations are created in the oversampling process
n.check <- 0
while (n.check == 0) {
# select one of the neighbors from within a Gaussian PDF
# possible duplicates are ignored in two steps
n.dist <- sample(neighbors[neighbors>0],
prob = dnorm(1:length(neighbors[neighbors>0]),0,round(0.3413*length(neighbors[neighbors>0]))),
size = 1)
neighbor <- which(dists == n.dist)[1]
# create a new observation by calculating a weighted average of the feature vectors
# associated with the selected observation and its selected neighbor
s.dist <- (n.dist-min(dists))/diff(range(dists))
new.features <- (
s.dist * (sub.scaled[which(row.names(sub.scaled)==row.names(other.samps[neighbor,])), features]) +
(1-s.dist) * (sub.scaled[which(row.names(sub.scaled)==row.names(this.samp)), features])
)
# convert weighted features back to their respective original scales
# using the means and standard deviations of the original features
new.features <- (new.features * stdevs) + means
# replace old features with new features
this.samp[,features] <- new.features
# check if it is a duplicate oversampled observation
dup.check <- duplicated(rbind(oversamp.data,this.samp))
# if it is NOT a duplicate, exit the loop and add it to the data frame
# if it IS a duplicate, run this loop again with a different neighbor
if (length(dup.check[dup.check==TRUE])==0) {
n.check <- 1
}
}
# add new observation to data frame
oversamp.data <- rbind(oversamp.data,this.samp)
}
# add oversampled data to original data frame
newdata <- rbind(subdata,oversamp.data)
# print a success message
print(paste0("'",groupby,"' level named '", group, "' has been over-sampled from ", nrow(subdata), " to ", resampnum, " observations"), quote=F)
}
}
# add resampled data to output dataset
outputdata <- rbind(outputdata,newdata)
}
return(outputdata)
}
Before performing the data re-sampling, we first remove the absolute boundaries of the vowel interval in order to avoid the most extreme effects of co-articulation on the acoustic signal:
alldata <- alldata[!(alldata$timepoint %in% c(1,11)),]
We will begin with Laurie’s data:
speakers <- c("laurie","syd")
speaker <- speakers[1]
subdata <- alldata[alldata$speaker==speaker,]
table(subdata$V1_env)
##
## #_C #_N C_C C_N N_C N_N
## 369 441 288 450 180 171
We can see that there is an unbalanced distribution of data samples across the six phonetic environments, from 171 samples in N_N to 450 samples in C_N. Since Laurie’s data is quite limited compared to Sydney’s, and since it is not possible to obtain more data from this speaker, we will maximize the data available to us by over-sampling the five minority categories so that they have an equal number of samples as the C_N environment, i.e. 450:
subdata1 <- multivar_resamp(subdata, "V1_env", max(table(subdata$V1_env)), features)
## [1] 'V1_env' level named '#_N' has been over-sampled from 441 to 450 observations
## [1] 'V1_env' level named '#_C' has been over-sampled from 369 to 450 observations
## [1] 'V1_env' level named 'C_C' has been over-sampled from 288 to 450 observations
## [1] 'V1_env' level named 'C_N' has been kept at 450 observations
## [1] 'V1_env' level named 'N_C' has been over-sampled from 180 to 450 observations
## [1] 'V1_env' level named 'N_N' has been over-sampled from 171 to 450 observations
Now let’s look at the effects of the over-sampling, by comparing the separate probability densities of the original and over-sampled data for the first 20 acoustic features. We’ll plot the probabilities densities of the original data in blue and the over-sampled data in red:
par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))
for (feature in features[1:20]) {
plot(density(subdata1[[feature]]), col='red',
main=NA,xlab=feature,
ylim = range(c(
density(subdata[[feature]])$y,
density(subdata1[[feature]])$y)))
lines(density(subdata[[feature]]), col='blue')
}
We can see that the relative distributions of the over-sampled data are similar to the original data, which is a purposeful feature of the re-sampling algorithm. In order to see the most extreme effects of the procedure, let’s make the same comparison for only the N_N environment, which underwent the greatest amount of over-sampling from 171 samples to 450 samples, i.e. a 263% increase:
par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))
for (feature in features[1:20]) {
plot(density(subdata1[[feature]][subdata1$V1_env=="N_N"]), col='red',
main=NA, xlab=feature,
ylim = range(c(density(subdata[[feature]][subdata$V1_env=="N_N"])$y,
density(subdata1[[feature]][subdata1$V1_env=="N_N"])$y)))
lines(density(subdata[[feature]][subdata$V1_env=="N_N"]), col='blue')
}
We can see that even in this most extreme case, the over-sampling has largely retained similar distributions of the original acoustic features, with some minor deviations that are closer approximations to a normal distribution, e.g., for features that exhibit bimodality such as F1 frequency and MFCC 1.
Now let’s follow a similar procedure for Sydney’s data:
speaker <- speakers[2]
subdata <- alldata[alldata$speaker==speaker,]
table(subdata$V1_env)
##
## #_C #_N C_C C_N N_C N_N
## 1224 1980 5373 1071 1449 1521
Since there is much more data available for Sydney, we are not constrained as strictly by a need to retain as much information as possible. At the same time, we don’t want to completely discard a majority of the data by under-sampling everything to the smallest number, 1071. As a compromise, we will re-sample the data set to be equal to the median value, 1485:
subdata2 <- multivar_resamp(subdata, "V1_env", round(median(table(subdata$V1_env))), features)
## [1] 'V1_env' level named 'C_C' has been under-sampled from 5373 to 1485 observations
## [1] 'V1_env' level named 'N_N' has been under-sampled from 1521 to 1485 observations
## [1] 'V1_env' level named 'C_N' has been over-sampled from 1071 to 1485 observations
## [1] 'V1_env' level named 'N_C' has been over-sampled from 1449 to 1485 observations
## [1] 'V1_env' level named '#_N' has been under-sampled from 1980 to 1485 observations
## [1] 'V1_env' level named '#_C' has been over-sampled from 1224 to 1485 observations
Now let’s compare the probability densities in the same way that we did for Laurie’s data. We can see that the overall distributional properties are retained:
par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))
for (feature in features[1:20]) {
plot(density(subdata2[[feature]]), col='red',
main=NA, xlab=feature,
ylim = range(c(density(subdata[[feature]])$y,
density(subdata2[[feature]])$y)))
lines(density(subdata[[feature]]), col='blue')
}
If we look at the extreme cases of the re-sampling, we can first examine the C_N environment, which was over-sampled from 1071 to 1485 samples:
par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))
for (feature in features[1:20]) {
plot(density(subdata2[[feature]][subdata2$V1_env=="C_N"]), col='red',
main=NA, xlab=feature,
ylim = range(c(density(subdata[[feature]][subdata$V1_env=="C_N"])$y,
density(subdata2[[feature]][subdata2$V1_env=="C_N"])$y)))
lines(density(subdata[[feature]][subdata$V1_env=="C_N"]), col='blue')
}
And now we examine the C_C environment, which was under-sampled from 5373 to 1485 samples via random selection:
par(mfrow=c(5,4), mar=c(4.1,4.5,1.1,1.0))
for (feature in features[1:20]) {
plot(density(subdata2[[feature]][subdata2$V1_env=="C_C"]), col='red',
main=NA, xlab=feature,
ylim = range(c(density(subdata[[feature]][subdata$V1_env=="C_C"])$y,
density(subdata2[[feature]][subdata2$V1_env=="C_C"])$y)))
lines(density(subdata[[feature]][subdata$V1_env=="C_C"]), col='blue')
}
In all cases, the re-sampling algorithm generates data which retain the distributional properties of the original data. Thus satisfied with the re-sampled output, we replace our data frame with the re-sampled data and verify that the six environments are now balanced for both speakers:
alldata <- rbind(subdata1, subdata2)
table(alldata$V1_env, alldata$speaker)
##
## laurie syd
## #_C 450 1485
## #_N 450 1485
## C_C 450 1485
## C_N 450 1485
## N_C 450 1485
## N_N 450 1485
In order to train the XGBoost models, we first need to select training data associated with orality and nasality of the acoustic signal. We will do this by retaining samples at 10% and 90% of the vowel interval, adjacent to either oral or nasal consonants, as appropriate.
Let’s first begin with Laurie’s data:
speaker <- speakers[1]
subdata <- alldata[alldata$speaker==speaker,]
In order to split the samples into a training set and testing set (i.e. a set to generate the NAF values), we need to numerically label the individual observations:
subdata$obs <- 1:nrow(subdata)
We now create a training set using only 10% of the vowel interval (timepoint 2 of 11) and 90% of the vowel interval (timepoint 10 of 11):
training <- subdata[subdata$timepoint %in% c(2,10),]
We now randomly sample 80% of these observations to use in training the model:
set.seed(myseed)
training.samples <- sample(training$obs, round(0.8*nrow(training)))
training <- training[training$obs %in% training.samples,]
Our outcome variable will be a numeric variable 0 or 1: 0 will correspond to oral features, i.e. acoustic features associated with time points adjacent to oral consonants, and 1 will correspond to nasal features, i.e. acoustic features associated with time points adjacent to nasal consonants. We now assign these values based on whether the time points are adjacent to an oral consonant (0) or a nasal consonant (1):
training$nasality <- training$V1_env
# oral contexts
training$nasality[training$nasality=="C_C"] <- 0
training$nasality[training$nasality %in% c("#_C","N_C") & training$timepoint==10] <- 0
training$nasality[training$nasality == "C_N" & training$timepoint==2] <- 0
# nasal contexts
training$nasality[training$nasality=="N_N"] <- 1
training$nasality[training$nasality %in% c("#_N","C_N") & training$timepoint==10] <- 1
training$nasality[training$nasality == "N_C" & training$timepoint==2] <- 1
We now restrict our training set to only these 0 and 1 numeric labels, which will exclude time points at 10% of the vowel interval in word-initial contexts:
training <- training[training$nasality %in% c(0,1),]
Now that we have our training data, we make a testing set for generating the NAF values, which will include the retained 20% of the testing data, the excluded word-initial contexts, and all time points from 20% to 80% of the vowel interval:
testing <- subdata[!(subdata$obs %in% training$obs),]
nrow(training); nrow(testing)
## [1] 402
## [1] 2298
In total, 402 samples will be used for model training and 2298 samples will be used to generate NAF values for Laurie’s data.
We now make a new data frame for model training, which consists only of the [0,1] numeric nasality variable and the 74 acoustic features. We then make an XGBoost pointer which will tell the XGBoost model which data are to be used as predictor variables and which data is to be used as the outcome variable:
train.data <- training[,c("nasality",features)]
dtrain <- xgboost::xgb.DMatrix(data = as.matrix(train.data[,features]), label = train.data$nasality)
In order to tune the hyper-parameters of the XGBoost model to find the optimal parameter combination for Laurie’s data, we will do a full grid search through the following 2079 hyper-parameter permutations:
hyper_grid <- expand.grid(max_depth = seq(2, 10, 1),
eta = seq(0, 0.5, 0.05),
gamma = seq(0, 0.2, 0.1),
subsample = seq(0.3, 0.9, 0.1))
We now loop through each permutation of the hyper-parameters and log the RMSE error in a 5-fold cross-validation model built with the respective hyper-parameter values. We use stratified sampling to ensure that an equal number of oral and nasal items are used in each iteration of the model:
xgb_test_error <- NULL
for (j in 1:nrow(hyper_grid)) {
xgb_params <- list("objective" = "reg:squarederror",
"eval_metric" = "rmse",
"max_depth" = hyper_grid$max_depth[j],
"eta" = hyper_grid$eta[j],
"gamma" = hyper_grid$gamma[j],
"subsample" = hyper_grid$subsample[j])
m_xgb_untuned <- xgboost::xgb.cv(
params = xgb_params,
data = dtrain,
set.seed(myseed),
nrounds = 25,
set.seed(myseed),
early_stopping_rounds = 3,
nfold = 5,
stratified = T,
verbose = F
)
xgb_test_error[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
}
The ideal hyper-parameters for Laurie’s data are the ones which yield the lowest RMSE error value:
hyper_grid[which.min(xgb_test_error),]
## max_depth eta gamma subsample
## 1533 4 0.25 0 0.8
We then create a parameter list using these tuned values:
xgb_params <- list("objective" = "reg:squarederror",
"eval_metric" = "rmse",
"max_depth" = 4,
"eta" = 0.25,
"gamma" = 0,
"subsample" = 0.8)
One final step is to optimize the number of iterations of model learning which strikes a balance between learning patterns in the data and avoiding over-fitting of the model to the training set. We will find this optimal iteration number by creating one final cross-validation model with early stopping based on lack of improvement in the testing loss function:
tst_model <- xgboost::xgb.cv(params = xgb_params,
data = dtrain,
set.seed(myseed),
nrounds = 1000,
early_stopping_rounds = 5,
nfold = 5,
stratified = T,
verbose = F)
Our final model is then created using all of the training data, the tuned hyper-parameters, and the optimized number of iterations:
bst_model <- xgboost::xgb.train(params = xgb_params,
data = dtrain,
set.seed(myseed),
nrounds = tst_model$best_iteration,
verbose = F)
The NAF values can now be generated by using the final XGBoost model to make response predictions from the acoustic data in the testing set:
testing$NAF <- predict(bst_model, xgboost::xgb.DMatrix(data = as.matrix(testing[,features])) )
We can investigate the NAF values for a single word. Let’s try the word kantha “grass”, which is a C_N vowel environment:
examp <- testing[testing$headword=="kantha",]
plot(examp$timepoint, examp$NAF, type='l')
We observe that the NAF values begin low and increase throughout the vowel interval, suggesting a pattern of anticipatory nasalization. Note that this item has no time points associated with either 10% (timepoint 2) or 90% (timepoint 10) of the vowel interval, indicating that this particular item did not include any of the 20% hold-out data that were excluded from the training set. In other words, the pattern that we observe in this item derives not only from data that were not included in the training data (as is the case for all of the NAF values), but also from data that had no representative timepoints in the model training at all, i.e. no data from timepoints 3-9 were ever included in the XGBoost learning.
We extract the relevant data from Laurie’s results to use for plotting:
plot.data1 <- testing[,c("point_vwlpct","NAF","V1_env",features)]
plot.data1$speaker <- speaker
We now repeat the entire process described above, but for Sydney’s data:
speaker <- speakers[2]
subdata <- alldata[alldata$speaker==speaker,]
subdata$obs <- 1:nrow(subdata)
training <- subdata[subdata$timepoint %in% c(2,10),]
set.seed(myseed)
training.samples <- sample(training$obs, round(0.8*nrow(training)))
training <- training[training$obs %in% training.samples,]
training$nasality <- training$V1_env
# oral contexts
training$nasality[training$nasality=="C_C"] <- 0
training$nasality[training$nasality %in% c("#_C","N_C") & training$timepoint==10] <- 0
training$nasality[training$nasality == "C_N" & training$timepoint==2] <- 0
# nasal contexts
training$nasality[training$nasality=="N_N"] <- 1
training$nasality[training$nasality %in% c("#_N","C_N") & training$timepoint==10] <- 1
training$nasality[training$nasality == "N_C" & training$timepoint==2] <- 1
training <- training[training$nasality %in% c(0,1),]
testing <- subdata[!(subdata$obs %in% training$obs),]
train.data <- training[,c("nasality",features)]
dtrain <- xgboost::xgb.DMatrix(data = as.matrix(train.data[,features]), label = train.data$nasality)
nrow(training); nrow(testing)
## [1] 1356
## [1] 7554
In total, 1356 samples will be used for model training and 7554 samples will be used to generate NAF values for Sydney’s data.
The hyper-parameters need to be tuned separately for Sydney’s data set:
xgb_test_error <- NULL
for (j in 1:nrow(hyper_grid)) {
xgb_params <- list("objective" = "reg:squarederror",
"eval_metric" = "rmse",
"max_depth" = hyper_grid$max_depth[j],
"eta" = hyper_grid$eta[j],
"gamma" = hyper_grid$gamma[j],
"subsample" = hyper_grid$subsample[j])
m_xgb_untuned <- xgboost::xgb.cv(
params = xgb_params,
data = dtrain,
set.seed(myseed),
nrounds = 25,
set.seed(myseed),
early_stopping_rounds = 3,
nfold = 5,
stratified = T,
verbose = F
)
xgb_test_error[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
}
#ideal hyperparamters
hyper_grid[which.min(xgb_test_error),]
## max_depth eta gamma subsample
## 1616 6 0.15 0.1 0.8
We subsequently build an optimized XGBoost model for Sydney’s data and generate the corresponding NAF values as predictions:
xgb_params <- list("objective" = "reg:squarederror",
"eval_metric" = "rmse",
"max_depth" = 6,
"eta" = 0.15,
"gamma" = 0.1,
"subsample" = 0.8)
tst_model <- xgboost::xgb.cv(params = xgb_params,
data = dtrain,
set.seed(myseed),
nrounds = 1000,
early_stopping_rounds = 5,
nfold = 5,
stratified = T,
verbose = F)
bst_model <- xgboost::xgb.train(params = xgb_params,
data = dtrain,
set.seed(myseed),
nrounds = tst_model$best_iteration,
verbose = F)
testing$NAF <- predict(bst_model, xgboost::xgb.DMatrix(data = as.matrix(testing[,features])) )
We now extract Sydney’s NAF data and add them to Laurie’s NAF data for plotting and statistical treatment:
plot.data2 <- testing[,c("point_vwlpct","NAF","V1_env",features)]
plot.data2$speaker <- speaker
plot.data <- rbind.data.frame(plot.data1, plot.data2)
We can now use the NAF data to generate the aggregate smoothed NAF values (Figure 4 from the paper):
library(ggplot2)
ggplot(plot.data, aes(x=point_vwlpct,
y=NAF,
col=V1_env,
group=V1_env,
linetype=V1_env,
fill=V1_env)) +
geom_smooth(method="loess") +
scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
scale_y_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(0.1,0.9,0.1)) +
guides(col=guide_legend(title="V env"),
fill=guide_legend(title="V env"),
linetype=guide_legend(title="V env")) +
ggtitle("NAF predictions by vowel environment") + theme_bw() + theme(legend.key.width=unit(1,"cm"))
The NAF data can be plotted separately for Sydney’s data (Figure 5 from the paper):
ggplot(plot.data[plot.data$speaker=="syd",], aes(x=point_vwlpct,
y=NAF,
col=V1_env,
group=V1_env,
linetype=V1_env,
fill=V1_env)) +
geom_smooth(method="loess") +
scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
scale_y_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(-0.1,1.2,0.1)) +
coord_cartesian(ylim=c(0,1)) +
guides(col=guide_legend(title="V env"),
fill=guide_legend(title="V env"),
linetype=guide_legend(title="V env")) +
ggtitle("NAF predictions by vowel environment: Sydney") + theme_bw() + theme(legend.key.width=unit(1,"cm"))
## `geom_smooth()` using formula 'y ~ x'
The NAF data can be plotted separately for Laurie’s data (Figure 6 from the paper):
ggplot(plot.data[plot.data$speaker=="laurie",], aes(x=point_vwlpct,
y=NAF,
col=V1_env,
group=V1_env,
linetype=V1_env,
fill=V1_env)) +
geom_smooth(method="loess") +
scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
scale_y_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(-0.1,1.2,0.1)) +
coord_cartesian(ylim=c(0,1)) +
guides(col=guide_legend(title="V env"),
fill=guide_legend(title="V env"),
linetype=guide_legend(title="V env")) +
ggtitle("NAF predictions by vowel environment: Laurie") + theme_bw() + theme(legend.key.width=unit(1,"cm"))
## `geom_smooth()` using formula 'y ~ x'
We also plot the probability distributions and box plots of all NAF values within the target vowels, i.e. all values combined, independent of the time course of the vowel interval (Figure 7 from the paper):
ggplot(plot.data, aes(y=V1_env,
x=NAF,
fill=V1_env)) +
ggdist::stat_slab(adjust=0.7, height=1, justification=-0.05,
.width=0, lwd=0.5, point_colour=NA, color="black") +
geom_boxplot(width=0.2, outlier.shape=NA, show.legend=F, notch=T) +
guides(color=guide_legend(override.aes=list(shape=21)),
fill=guide_legend(title="V env")) +
ylab("Phonetic environment of V") +
scale_x_continuous(name="Predicted degree of nasalization (NAF)", breaks=seq(-0.2,1.2,0.2)) +
ggtitle("NAF prediction distribution by vowel environment") + theme_bw()
We use the NAF values as the dependent variable in a linear mixed-effects model with the vowel environment as a fixed effect and a full random effect by speaker.
library(lme4)
library(multcomp)
lme.mod <- lmer(NAF ~ V1_env + (1+V1_env|speaker), data = plot.data,
control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5))
)
## boundary (singular) fit: see help('isSingular')
## Warning message:
## Model failed to converge with 1 negative eigenvalue: -6.7e+01
The full model fails to converge, so we simplify the model structure to include an intercept-only random effect:
lme.mod <- lmer(NAF ~ V1_env + (1|speaker), data = plot.data,
control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5))
)
We compute Tukey pair-wise contrasts with the \(\alpha\) level adjusted using the Bonferroni method in order to reduce Type I error inflation in a maximally conservative manner (Table 2 from the paper):
summary(glht(lme.mod, linfct = mcp(V1_env="Tukey")), test = adjusted("bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lmer(formula = NAF ~ V1_env + (1 | speaker), data = plot.data,
## control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05)))
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## #_N - #_C == 0 0.307530 0.008772 35.057 < 2e-16 ***
## C_C - #_C == 0 -0.222951 0.009029 -24.693 < 2e-16 ***
## C_N - #_C == 0 0.064788 0.009006 7.194 9.46e-12 ***
## N_C - #_C == 0 0.169078 0.008993 18.801 < 2e-16 ***
## N_N - #_C == 0 0.437206 0.008981 48.681 < 2e-16 ***
## C_C - #_N == 0 -0.530481 0.009030 -58.746 < 2e-16 ***
## C_N - #_N == 0 -0.242742 0.009007 -26.949 < 2e-16 ***
## N_C - #_N == 0 -0.138452 0.008994 -15.394 < 2e-16 ***
## N_N - #_N == 0 0.129676 0.008982 14.437 < 2e-16 ***
## C_N - C_C == 0 0.287739 0.009257 31.082 < 2e-16 ***
## N_C - C_C == 0 0.392029 0.009244 42.407 < 2e-16 ***
## N_N - C_C == 0 0.660157 0.009233 71.500 < 2e-16 ***
## N_C - C_N == 0 0.104290 0.009222 11.308 < 2e-16 ***
## N_N - C_N == 0 0.372418 0.009211 40.433 < 2e-16 ***
## N_N - N_C == 0 0.268127 0.009198 29.151 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
In the post-hoc analysis of the paper, we investigate the time-varying patterns of nasalization suggested by two single-metric acoustic features of nasality. The first feature is formant-compensated A1-P0, which we invert as P0-A1 in order to directly compare with the rest of the results (Figure 8 from the paper):
ggplot(plot.data, aes(x=point_vwlpct,
y=-a1p0_compensated,
col=V1_env,
group=V1_env,
linetype=V1_env,
fill=V1_env)) +
geom_smooth(method="loess") +
scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
scale_y_continuous(name="P0-A1", breaks=seq(-5,10,2.5)) +
guides(col=guide_legend(title="V env"),
fill=guide_legend(title="V env"),
linetype=guide_legend(title="V env")) +
ggtitle("P0-A1 by vowel environment") + theme_bw() + theme(legend.key.width=unit(1,"cm"))
## `geom_smooth()` using formula 'y ~ x'
The second feature is F1 bandwidth (Figure 9 from the paper):
ggplot(plot.data, aes(x=point_vwlpct,
y=width_f1,
col=V1_env,
group=V1_env,
linetype=V1_env,
fill=V1_env)) +
geom_smooth(method="loess") +
scale_x_continuous(name="Percentage of vowel interval", breaks=seq(10,90,10)) +
scale_y_continuous(name="F1 bandwidth (Hz)", breaks=seq(100,800,100)) +
guides(col=guide_legend(title="V env"),
fill=guide_legend(title="V env"),
linetype=guide_legend(title="V env")) +
ggtitle("F1 bandwidth by vowel environment") + theme_bw() + theme(legend.key.width=unit(1,"cm"))
## `geom_smooth()` using formula 'y ~ x'