data.mat: a Matlab file with reading times from the Dundee corpus, different surprisal estimates, and more (see below) data.csv: contains much of the same data, but without data points that were removed from the analysis and with centered independent variables. Confusingly, the variables names sometimes differ between the two files. ----------------------------------- Corpus data ----------------------------------- objects : cell array of words (with any punctuation attached) as presented in the eye-tracking experiment pos_per_obj(w) : number of pos-tags in object w char_per_word(w) : number of characters in word (i.e., object with punctuation removed) w of Dundee corpus line_pos(w) : serial number of word w in line during presentation sent_pos(w) : serial number of word w in sentence notletters(w) : is 1 iff word token w contains any non-letters (e.g. because it is attached to punctuation) or more than one capital letter logwordprob(w) : log probability of word w logforwprob(w) : log of forward probability of word w logbackprob(w) : log of backward probability of word w These probabilities are based on unigram and bigram frequencies in the BNC and Dundee corpus seperately, and then averaged over the two. ----------------------------------- Eye-tracking data ----------------------------------- RT[type](w,p) : reading time on word w for subject p. Zero reading times (e.g., nonfixations) are NaN. type = fpass -> first pass time: total fixation time on word before fixating on any other word rb -> right-bounded past time: total fixation time on word before fixating on any later word (i.e, fixations on earlier words are allowed, but not included) gopast -> total fixation time on any word starting from first fixation on current word, until fixation on any later word (i.e., fixations on earlier words are allowed and included) but ignoring fixations on earlier sentences Current fixation does not count if there was an earlier fixation on a later word from the same sentence. prevnonfix(w,p) : is 1 iff word w-1 was not fixated by subject p (i.e., isnan(RT(w-1,p))) nextnonfix(w,p) : is 1 iff word w+1 was not fixated by subject p (i.e., isnan(RT(w+1,p))) bad_obj(w,p) : is 1 iff object w should be ignored for subject number p because: - it is the first or last on a line (i.e., line_pos(w)==1 | line_pos(w+1)==1) - it is not fixated (i.e., isnan(RT(w,p))) - it contains any non-letter of more than one capital letter (i.e., notletters(w)==1; cf. Demberg & Keller, 2008), this includes words attached to punctuation, like the last of the sentence. - it is "cannot", which receives two surprisal values (i.e., pos_per_obj(w)==2; other such cases are already removed because they are attached to punctuation or contain a non-letter) ----------------------------------- Surprisal data ----------------------------------- Surprisal of w-th pos-tag according to: surp_psga(w,:) : the 4 PSG-a models (see PsycSci paper for explanation) surp_psgs(w,:) : the 4 PSG-s models surp_mm(w,n) : Additively smoothed Markov model of order n-1 surp_sgt(w,n) : Simple Good-Turing smoothed Markov model of order n-1 surp_wb(w,n) : Witten-Bell smoothed Markov model of order n-1 surp_esn(w,n,:) : the 3 ESNs with 100n hidden units If an object consists of multiple pos-tags, the corresponding surprisal values are summed (but note that all these cases are removed from the analysis). avsurp_ : average surprisal ('linguistic accuracy') estimated by model (weighted by sum(bad_word(w,:)==0))