Satoru Ozaki (ikazos)

Research

I'm interested in:

The syntax/semantics of scope and binding in ellipsis and Right Node Raising.
Investigating how (large) language models acquire abstract generalizations over superficially different constructions, and verifying if these generalizations are human-like.
Understanding the place of symbolic grammars in sentence processing.

Here is a more detailed list of my specific interests.

Mismatch in VP Ellipsis

A phrase can only be elided if it is "identical" to an antecedent. However, VP Ellipsis is known to tolerate certain cases of "mismatch" between the syntactic forms and/or the semantic contents of the antecedent and elided objects. What do these examples of mismatch tell us about the identity condition on VP Ellipsis?

I have looked at a construction I call conditional wh-questions with VP Ellipsis:

What Kraftwerk song would Bill play if he were asked to Δ?
If Bill were asked to Δ, what Kraftwerk song would he play?

The interpretation of the elided object differs in (1) vs. (2). In (1), it is a variable bound by the wh-phrase. In (2), it is a non-specific indefinite. These interpretations are paraphrased in (3) and (4):

What¹ Kraftwerk song would Bill play t₁ if he were asked to <play it₁>?
If Bill were asked to <play some Kraftwerk song>, what Kraftwerk song would he play?

In my NELS 55 poster (poster, paper) and GLOW 47 talk (handout, slides), I show that the VP Ellipsis in these constructions are difficult to account for, no matter if you use a syntactic or a semantic identity condition. A syntactic identity condition prevents the antecedent object (a trace) from vehicle changing into the elided object (a pronoun). A semantic identity condition doesn't tolerate the scope mismatch between the antecedent object (a wide-scope wh-phrase) and the elided object (a narrow-scope indefinite).

In a paper accepted at Natural Language Semantics (preprint), I describe an analysis based on a semantic identity condition that models the elided non-specific indefinite using Skolemized choice functions. This analysis successfully derives a connection between the possible interpretation of the elided object and the relative height at which the if-clause is attached to the wh-question: the elided object has a bound reading iff the if-clause is inside the wh-question.

Satoru Ozaki, accepted at Natural Language Semantics. ``Conditional wh-questions with VP Ellipsis.'' Long talk at GLOW 47 (handout, slides), poster at NELS 55 (poster, paper), preprint.

Structure-building metrics predict difficulty of ambiguity resolution

Garden-path effects (i.e., a slowdown at, e.g., the word remained while reading (5) but not (6)) are the poster child for sentence processing difficulties. Surprisal theory is known to underpredict garden-path effects (Huang et al. 2024).

The girl fed the lamb remained relatively calm.
The girl who was fed the lamb remained relatively calm.

On the other hand, quantitative metrics that reflect structure-building operations improve prediction of processing difficulty in eye-tracking and neuroimaging data above and beyond surprisal. Are such structure-building metrics a good predictor of sentence processing difficulty in self-paced reading? What about garden-path effects?

In a SCiL 2024 talk (joint work with Aniello De Santo, Tal Linzen and Brian Dillon), we show that a structure-building metric derived from a wide-coverage CCG parser predict sentence processing difficulty above and beyond surprisal. This metric is called incremental node count; for each word $w$ , it counts the number of CCG rules applied after processing $w$ in a maximally left-branching parse (i.e., the most eager parse) of a given sentence. We combine this metric with a reprocessing account of garden-path effects and test its prediction for garden-path effects using three kinds of constructions (MV/RR, NP/S, NP/Z). It still underpredicts garden-path effects for two out of three constructions, as well as the variation in the strength of garden-path effects across the three constructions.

Our results add to the converging evidence from the literature that surprisal isn't the single causal factor behind processing difficulty; the cost of algorithmic operations that build structures could be another factor. A search for an independently motivated, good predictor of garden-path effects based on structure-building metrics is for future work.

Satoru Ozaki, Aniello De Santo, Tal Linzen and Brian Dillon, 2024. "CCG parsing effort and surprisal jointly predict RT but underpredict garden-path effects." Talk at SCiL 2024 (slides).

(Large) language models and human-like abstract generalizations

Language is full of cases where superficially distinct constructions share common properties. A linguist's account for such cases is often to posit an abstract construct shared by these constructions, and to derive these shared properties as the consequences of the presence of this construct. Such an account could be right or wrong, but a matter of fact is that humans tacitly make correct abstract generalizations that correctly relate superficially different constructions in various ways. For language models to be considered a viable grammar or cognitive model, it has to make the same human-like abstract linguistic generalizations. Do they?

Filler-gap dependencies

An example of this abstract generalization stuff is filler-gap dependencies. English has various kinds of filler-gap dependencies (e.g., wh-questions like (7) and tough-movement (8)). They look different (e.g., wh-questions need a wh-phrase, tough-movement needs a tough-predicate), but behave similarly in two ways: the filler-gap dependencies they create are unbounded; i.e., they can cross any number of bridge predicates, but they show island effects.

Who did you meet { t / *a person who knows t in your family }?
They were difficult to meet { t / *a person who knows t in your family }.

Wilcox et al., 2018 show that two LSTM language models learn Complex NP and wh-islands for embedded wh-questions, a specific kind of filler-gap dependencies. **How do these results generalize to other kinds of filler-gap dependencies?

In a SCiL 2022 paper (joint work with Dan Yurovsky and Lori Levin), we show that these two models learn these island effects to various degrees for four other kinds of filler-gap dependencies (it-clefts, comparatives, topicalization, tough-movement), but all are worse than embedded wh-questions. This suggests that these language models aren't capable of generalizing their acquisition of island effects across constructions.

Satoru Ozaki, Dan Yurovsky and Lori Levin. "How well do LSTM language models learn filler-gap dependencies?" Proceedings of SCiL 2022 ( doi).

Parasitic gaps

Yet another example is parasitic gaps. A parasitic gap always appears with another gap, as in (9). But not all sentences with two gaps have a parasitic gap, like (10); this is an across-the-board (ATB) extraction. Then there are more controversial parasitic gaps like (11); not all syntacticians think this is a parasitic gap.

Which novel did the reviewer read t before criticizing t?
Which novel did the reviewer read t and criticize t?
This is the novel that every reviewer who reads t criticizes t.

Syntactic priming is speakers' tendencies to reuse syntactic constructions recently encountered. Momma et al., 2024 show that prototypical parasitic gaps like (9) don't prime ATB examples like (10), but they do prime more controversial parasitic gaps like (11). These results could suggest humans form an abstract generalization over (9) and (11) that excludes (10). Do language models behave the same way?

In a HSP 2025 poster (joint work with Shota Momma), we show that 12 large language models (GPT-2, GPT-Neo, DialoGPT and Mamba) don't show human-like priming effects between parasitic gaps and non-parasitic gaps. For LLMs, parasitic gaps prime parasitic gaps (9) and ATB extraction (10) alike. Parasitic gaps either don't prime examples like (11), or they do but with a reduced priming effect. This suggests that these LLMs are too sensitive to superficial similarities and differences between constructions and don't reliably form human-like generalizations.

Satoru Ozaki and Shota Momma, 2025. "Evaluating LLMs for abstract linguistic generalizations using English parasitic gaps." Poster at HSP 2025.