Alex Huth – 2019 CCN Workshop at Dartmouth College


– Hey everyone, yeah, excited to talk to you about this stuff. So, I’m Alex Huth. I’m a professor at UT Austin. I feel like I’m having deja vu ’cause I gave a talk in this room from the same podium
almost exactly a year ago at the summer school here.
But this is not the same talk, so if any of you were here. Okay, so the title of my talk today is, Beyond Distributional Embeddings for Modeling Brain Responses to Language. So this is talking about a couple different ways that we can go further than just using distributional
word embeddings, which have kind of become
a pretty standard tool for trying to understand how the brain represents words in language. So, I’m going to start with
(laughs) a very high level thing that is really maybe not that
meaningful in any real way, but the question that we
want to address here is how do we understand language? How do we take these words
that are coming into our ears, our brain does something with them, something happens inside our heads, there’s an idea maybe, a picture, something happens that is like
the byproduct of language. How does this process work? I don’t know that I can
explain what understanding is, so I’ll leave that for a later point. But I’m going to operationalize
this into a very specific sort of mathematical problem that is much easier to attack than trying to understand understanding. So I’m going to show you a bunch
of pictures of brains today, and I want to sort of orient you on how to read cortical flat maps which is what all these pictures are going to be. So let me start by doing that. So all the work I’m going
to show you is using fMRI to look at how the brain is
processing natural language, and to visualize this data, we’re really only interested in cerebral cortex for the most part. I’m only going to show you
data from cerebral cortex, and cerebral cortex is of
course a sheet of neural tissue. So to visualize this we can take, from a high resolution 3-D MRI scan. We can reconstruct three-dimensional
version of the surface. We can make relaxation cuts
into the cortical surface after inflating it and then flatten it out so that we can see the
entire cortical surface all at once like so. So this is showing all of
the cortex for one subject. This is the left hemisphere
and the right hemisphere. This is the occipital cortex
is in the back of the brain is in the middle here, and then pre-frontal cortex
is out on either end. Areas that we know about in cortex, areas that we have good localizers for, we’ve outlined in white here. So we have like early visual cortex, this is V1 through V4. There’s a lot of visual, sort
of category selective areas around visual cortex. This is auditory cortex,
this is my motor cortex, Broca’s area and so on. But there’s a lot of parts of cortex that aren’t outlined here and
we don’t have very good ideas or we don’t easy ways to localize them, and yet, we’ll find they
still respond to language. So, the kind of experiment that I do is a little different from a lot of, sort of neuroscience,
psychology experiments. Instead of having a controlled setting, where you might have people in a number of conditions, sort of look at very specific stimuli, I just have people lay in an MRI scanner and listen to stories. That’s all. So they’re going to lay there, and listen to somebody tell them a story. It’s going to be an interesting story. We took stories from
The Moth, “Radio Hour,” which is wonderful storytelling podcast. If any of you are familiar
with it, or radio show. They’re fun, so this actually a kind of fun experiment to be a subject for, unlike pretty much every
other fMRI experiment I’ve even been a subject for. They’re awful. But this is just you lay there and listen to a podcast. This is like what we
do all the time anyway. I don’t know. Some of this. So I want to show you a little bit of what this data looks like before I talk about how
we’re going to analyze it. So it’s going to be on
a flap map like this. Here we go. This is a different subject’s brain. What I’m showing here is not the activity at each point in time
that we record with fMRI. It’s not neural activity, anyone who calls it
neural is lying to you. fMRI is blood flow. It is a kind of crappy indicator of something that might be vaguely neural, but it’s the best we can
get for whole human brains. Red spots are places that
have above average activity. Blue spots here have
below average activity, where average is just defined as like the average over this scan. It’s going to play out slowly, because of course we’re
reporting blood flow, which changes much more slowly than actual neural activity. And you’re going to hear the story of the person’s hearing
and see the brain activity. This is actually not one of the stories that we use in this main experiment. It’s just for illustrative purposes. This is 15-second snippets
of a few different stories, stuck together. And the game that I want you guys to try to play, is to
listen and look at the words that are coming here, and try to figure out what the mapping is between these words
and the brain activity, and see if that means anything to you. So let’s give this a shot. – [Narrator] What a crazy world we’re bringing our children into. He thought it sounded
like the kind of statement that brings people closer together, pointing as it did to their common fate. But the sexy mom just glared at him and took the healthy living supplement too without asking. – [Narrator] He put Lily
in charge of the party while he was gone, and then he walked downstairs and there must have been
5000 people milling around, wrapped in furs or long overcoats, or ski parkas, leather jackets, high school, college kids, heavily champagned 60 year-olds, liking arms and singing. – [Narrator] Only the two front windows with white shades lowered
were not somehow blanketed. Your eye was constantly drawn to where the material
converged mid-ceiling, punctured by a dazzling, pink spotlight that looked like it might have
just vaporized a flamingo. She… – All right, so this is kind of tricky. I’ve stirred up this kind of thing a lot. It’s very hard to do by eye, but it turns out we
have mathematical tools that can come to the rescue. Okay, so going back to
the original question, which I posed as this very broad like, how does the brain understand language? I’m going to operationalize
that question here, as a mathematical object, which is how well can we
predict, bring responses to natural language,
which is what this is, using some kind of model, some kind of mathematical
quantitative model. This doesn’t tell us
exactly how the brain… It’s very far from telling us how the brain actually
understands language, but it’s something that
we can try to optimize. It’s something that we
can try to improve on, and we can compare different models. So let me tell you how
this actually works. Okay, we use a technique
called Voxelwise modeling, that’s been developed over
the past decade or more in Jack Gallant’s lab. I’m going to run you
through the basic flow of how this goes. We take some natural stimuli. In this case, it was natural stories, we play them for our subject
and record fMRI data. Notice that this is sort of the experimental data collection part of this procedure, and there’s actually
no hypothesis embedded in the experiment whatsoever. The experiment itself is hypothesis-free, which is actually very powerful, because that means that
we’re not sort of constrained to testing specific hypotheses by our data collection procedure. But it also is a little bit difficult, because then we have to figure out how to actually instantiate hypotheses. That comes in this sort
of second stream here. So we take these same natural stimuli, we have some hypothesis
about some kind of feature that we think might be
represented in the brain. We extract that feature,
or that set of features, that feature space from the stimuli, and then in this first
stage of Voxelwise modeling, which we call the estimation stage, we’re going to estimate these
Voxelwise regression models, which predict the response of each voxel, in the fMRI data, that’s
each sort of point that we measure fMRI bold responses at, using the hypothesized
features, and the fMRI data. So these two come together to give us the Voxelwise
regression models, that I’ll tell you more
detail about in a moment. Then in a second stage, we’re going to take new natural stimuli, we’re going to play them for our subjects, and get new fMRI data. So this is stimuli that the
model hasn’t seen before, and the subject hasn’t seen before. We’re going to project those stimuli into our same hypothesized feature space, extract features from
them in the same way, and then use the regression models that we fit in the earlier stage to predict these new fMRI data. And we can measure how
good these predictions are. We can measure the correlation between our predicted and
actual response time courses for some voxel, and
that will give a score, a very quantitative score, of how well we’re doing
at actually predicting this piece of brain. How well we’re doing at
predicting what it does in response to natural language, which is very good benchmark for understanding what it
does in general, I think. Okay, so if this prediction works well, we can then try to understand what is it about the features that actually drives the responses in
something particular voxel. We can try to figure out which features maybe have high weights
in a regression model and use that to do some interpretation of what the different parts
of the brain are doing. That’s fine, and I’ll show
you an example of that later. But another thing we can
do which is very powerful, is that we can actually take our stimuli and project them into
different feature spaces. We’re not constrained to
just one feature space here. We’re not constrained to
testing one hypothesis, because our data didn’t
have one hypothesis embedded in it when we
collected it in the first place. So we can extract these
different sets of features, go through this whole procedure with each set of features, and then see which one
explains the brain data better. Which one of these is a better fit to the actual brain data? And the sort of underlying
assumption here, is that the feature space that
best predicts brain activity, is the closest match to what the brain is actually computing. This has a lot of caveats in it, but this is sort of broadly the philosophy of the Voxelwise modeling procedure. Okay, so let’s actually
talk about a specific model for this data. So this is a strong man model, but that’s what makes it fun. So a very simple model might be that each voxel responds
some amount to each word. So here I have a voxel
response time course, this Ri of T, and my model
for that is going to be a weighted sum, of the w’s are
going to indicator variables for different words. So the words are going to be
indexed, j equals one through n, and this is like, suppose
one of the words is penguin, then the indicator variable
for penguin is zero, whenever you hear a
word that’s not penguin. Then it’s one whenever
you hear the penguin. Now beta is going to be
the weight on that word. So this saying like how
much does this voxel respond to the word penguin? And we’re saying that
the response time course of the voxel is a weighted
sum across all the words. So we’re summing across all
of these different words. So this is like saying
that each part of the brain responds a little bit, maybe response goes up,
maybe the response goes down. Maybe it doesn’t do anything at all, in response to every word you hear, which is very simplistic and
kind of crappy assumption, and that’s why I’m calling
that the simplest model. Okay, so we know what the words are, because we know what the stories are that we played for our subject. We know what the responses are, because we recorded them, using fMRI. So the only thing we need is the betas. So how do we get that? We’re going to do regularized regression, which I’m presenting here in
a sort of Bayesian framework, because that makes another
explanation easier later on. So the beta-hats, which
are estimated betas, these weights for each voxel, are the betas that maximize the product of a likelihood function, which is the probability
of observing this response, give the beta and the weights. This is just squared error, more or less, or e to the minus squared error. The product of this and prior, which is something that we think we know about beta already. We use ridge regression, which
is regularized regression, so to solve for beta, which means that we
have a prior that is… Essentially we assume
that these betas are, come from a normal distribution, that has zero mean and
identity covariance, which means that every word is unrelated to every other word. That’s sort of the core
assumption in this model, is that all these words are
independent from each other. Okay, now we can fit this model, and we can test this model. So we fit it on a whole bunch of data. In the examples I’m going to show, we fit it on about two hours of data of subjects listening to stories, and then we have one
story that we held out, about 10 minutes long, and we can predict responses to that story given the models that we fit. And then we can compute the correlation between our predicted
and actual responses, and that’s what this is. So here I’ve colored each voxel, according to its correlation between the predicted and actual response. So it’s how well we’re doing at predicting that piece of brain. The colors are gray for voxels
that are non-significant, red for voxels that are kind of crummy, and yellow for voxels
that are pretty good. You can see that this model, it does kind of okay in
some parts of the brain. So we have like this sort of
higher auditory cortex stuff, does reasonably well. This like, Broca’s area
stuff does reasonably well. A little bit of other
junk in here, but it’s… I don’t know if you have
anything to compare this to, but this is not super good. Like we can do better. So how do we improve this model? How do we go further than just saying that we have some response to each word, then they’re different,
and that’s all we know. So one thing that I’m
interested in is semantics. I think that’s why we’re all here. So one thing that we can do
to try to improve this model, is to make a guess
before we fit the model, which is that we might
see similar responses to words that have similar meanings. So for instance, the words
month, week, and hour, all have related meanings. They all correspond to durations of time, so we might expect that we should see some part of the brain responds
a lot to the word month. It might also respond to the word week, and to the word hour. That would be a sort of
reasonable supposition that we can make. Similarly for other categories of words. How do we get these similarities, where do these come from? This is kind of the key question. So it turns out we get
this from word embeddings. So we get them from looking
at how words are used, across the the large corpus of text. This is now a very standard method. Goes back 30 years, I believe, to the earliest sort
of word embedding work, although the sort of
distributional hypothesis, this is based on, it
goes back even further. The exact embedding method
that we use doesn’t matter, so in this case I used embeddings that are about 1000 dimensions for each word. The only thing that matters
is that the similarities between words are computed
as the dot-products between these embedding vectors. And this was actually… I’ll come back to this. Okay, so this is how we estimate sort of which words are similar. It’s based on how these
words are used in text. It’s actually based on the similarity between the contexts, in
which these words appear. That’s what we’re using as our proxy for similarity of meaning, based on, of course, this
distributional hypothesis, “You shall know a word
by the company it keeps.” Okay, so, now how do we
actually incorporate this into our model? How do we actually add in this information about which words mean similar things to this model that
predicts brain activity? So this was the model-fitting procedure that I told you about a moment ago, where we find the betas
that maximize the product of this likelihood function, which is just how well they fit the data, and a prior, which is what we think we know about them beforehand, and I told you that we use
this sort of dumb prior before, which is just that each
word is independent, but that the weights are small, but we can swap that out
for this better prior. So let’s, instead of assuming that all the words are independent, we can capture this intuition that we should have similar responses to words with similar meanings,
by changing our prior, by replacing with now,
a normal distribution where the covariance is given by the similarity of words, according to these word embeddings. This, this explanation
of what we’re doing, this is not actually the explanation, that when we wrote a paper about this, that we used to describe this method. We only realized later that this is actually sort
of the underlying thing that’s happening here. What we did originally,
was we actually fit models, instead of fitting a weight per word, we fit a weight per, sort
of word embedding dimension, and to do that, we have to talk about semantic features, and then things get weird, because these semantic
features that you get out of word-embedding spaces, they don’t necessarily make sense. They don’t mean anything on their own. But it turns out that sort of hiding under this, was this really
nice mathematical fact that what we were doing implicitly was using the word embedding as to kind of regularize the
models that we were fitting. Anyway this is exciting. We just had a paper come out recently where we had sort of
described this new way of looking at this problem. And I want to point out here that, this is not representational
similarity analysis. It has a kind of flavor of that, it looks like that, we
have a similarity space, but this is not our say, and in fact we have a paper that is written but not submitted yet, where we argue that there are like sort of deep statistical problems with RSA, especially when you use it to compare models. Stay tuned for that. I don’t know, someday. All right, so fit this new model that’s like the improved model that incorporates semantic similarity, and then we test it as we did before. And it turns out that it
works much better than before. And to remind you, here’s the… This is what the map looked like when we had the sort of
independent words model, and this is what it looks like when we incorporate word similarity, when we allow the semantic similarity of words according to this
distributional hypothesis to influence the weights that we fit. We do much, much, much better at predicting brain responses. All right, so we take
this as kind of evidence, that, this is the same
correlation map shown in 3-D, ’cause it’s pretty. We take this as evidence that
because these brain areas are much better predicted by this model, when we include semantic information, when we include information
about word meaning in that model, this is evidence that these brain areas
actually represent something about the meaning of these words. Or, maybe more broadly, that these brain area are
fulfilling some function that’s related to the
meanings of the words. Okay, that’s a specific
thing that we can get into at some point, whatever. So let’s briefly go through this question of model interpretation. So now that we have models that fit well, we can ask like, what are
they actually representing? What do different voxels
represent in terms of words? One thing we can do, is we can just pick out one voxel and look at the weights for that voxel, like which words have high
weight for this voxel, and which words have low weight? So this is one voxel
from pre-frontal cortex. It responds to words like
eight, mile, upwards, maximum, nearly, meters. It’s the M & M voxel. It really doesn’t respond to words like politics, religious,
political, appreciated, response, culture, whatever. This voxel probably has something to do with processing like
spatial relationships, maybe, numbers, something like that. That seems to be what you can get out of this list of numbers,
rather this list of words. This is just one voxel out of thousands of well-predicted voxels in one subject, out of multiple subjects. And I don’t think I
mentioned that explicitly, but we’re doing all this
on individual subjects. Everything I’ve shown you is
like is on individual people. There’s not combining of data so far. Okay, so doing this is
laborious and kind of crummy, so let’s try to just
summarize this information in a way that we can understand it easily. I’m going to skip some steps here and just show you the punchline. We use principal components analysis to reduce these models down
to just three dimensions. We can turn those three
dimensions into a color, by turning them into the red,
green, and blue components of a color, and then we
can visualize each voxel according to this color. I was actually going to
take this out of this talk, because whatever, I
didn’t think I had time, but then after Ray’s talk this morning, I thought it was really interesting. So the first principle
component here, more or less, it separates what I often call perceptual, but what is really spatial
things from social things. Like that is the largest
distinction in the brain, in terms of like, which
brain areas respond to which words. It is perceptual or
spatial, versus social. Voxels that are green here, they really respond more to the like, perceptual, spatial things, and voxels that are red or pink respond more to the social things. Sort of crossing that axis, interestingly, we have words that are both perceptual and social, which is
like words for violence, and words for body parts are often like one corner of this space. And the other corner is words that are neither perceptual nor social, and that’s like time words, time and sort of dynamics words, which are interesting. Anyway, okay, so I don’t
want to go into detail here, but we see that these brain areas that were well-predicted by this model, are representing or responding to all different kinds of concepts. We can use this as a sort of map to try to understand what
these brain areas are doing, and maybe even more excitingly, these maps are really
consistent across subjects. So this is three different
subjects, left hemispheres, and their 3-D brains up top. And we see very similar patterns across these different brains. Maybe unsurprising, but this
is showing us a lot of detail. I’m trying to kind of impress upon you, like the level of detail that we can get out of these models. Okay, but let’s ignore sort of, what these models are
telling us about selectivity, and let’s ask, can we do
better than this model? Can we do better than this
word distribution model? Yes, definitely. Definitely we can. So one thing that this
model completely ignored was context, the context
in which words appear. It assumes essentially
the response to each word is independent of the other
words around that word. And that’s wrong. That is not factually how language works. So with my grad student, Shailee Jain, we started working on a way
to try to solve this problem, to try to introduce
context into these models. And we were really
inspired by a set of papers over the last few years, where people have used
neural network models as a way to try to get
at brain representations. So one example that’s very nice is this paper form Michael Eikenberg. They used a neural network that was trained to
recognize objects and images. This is a multi-layer
neural network model. You can pull out
representations from each layer of this neural network, and then use these as features in this Voxelwise and coding model, and it turns out that
the lower level features from the neural network
predict earlier visual cortex. They predict like V1, V2. The mid-levels predict like V3, V4, and the higher levels
predict like a low V7, higher level stuff. So actually the hierarchy
of the neural network, mapped on to the hierarchy
of visual cortex, and we were like, “Ah, that’s so cool. “That’s a cool result.” There’s also this paper
from just last year, from Alex Kell and Josh McDermott, and a bunch of folks. They took a really similar approach. They used a neural
network that was trained to recognize words in audio, and also do a music thing,
which is not important for this, but it was a multi-layer neural network, and then they tried to
predict voxel responses in auditory cortex from the activations at different layers of this network, and they found a similar kind of pattern. So they have like early auditory cortex, corresponds to low levels
in this neural network, and high auditory cortex
responds to higher levels in this neural network. Okay, so we’re kind of looking
for this kind of pattern. But both of these things, sort of they’re going from
one kind of representation to another. In this case, you’re going
from an image that has pixels to like the name of a category. In this case, ‘we’re going from, they’re going from like
a spectrogram of sound to what word that sound corresponded to. What is sort of, what
can we do with language, with words that might correspond to this? So one thing that’s become really popular in the NLP world in the
last year, two years, is that idea of using language models to sort of pre-train for
many different tasks. So a language model is
just a neural network that is trained to predict
the next from previous words. You give it a set of words, and it gives you a
probability distribution over the next word in the sequence. We trained a three-layer,
recurrent neural network model, a long, short-term memory,
LSTM language model, on a big corpus of text, to predict the next word
from the previous words, and then we used the internal states of this language model,
to model the brain. Okay, seems like
straightforwardly, vaguely similar to what these other folks have done, and I think you might hear
about similar things later from Layla, and I know Maria
had a really cool paper on this last year, and yeah, I don’t know. So this approach, it’s
been getting attraction. So one thing that we can vary here also, is we can vary the number of words that the neural network reads
before it makes a prediction. So we can vary the context
length that it has access to which gives a nice sort
of know that we can turn to try to understand how
important is context, ’cause that’s sort of our theory here, is that we want to add
context into this model, so we can ask like how much does it help? How much does context actually
affect these representations? So that’s what we’re showing here. So this is, now instead of showing you a bunch of different flat maps with model performance on them, I’m just taking essentially the sum across the flat map, to
just have one number, which is the total variance explained by each of these models, and that’s shown along the y axis here, and along the x axis, we
have the context length. So how many words the model reads before it makes a prediction. And well, we’re actually
using the prediction, but we’re using its internal state, like what is going on in
this neural network’s brain, and using it to try to
model our subjects’ brains. Okay, so the black, dotted line here, is the embedding model. That’s the semantic model
that I showed you earlier, and then these three colored lines are the three different
layers of our neural network. So the first layer, the green one here, it doesn’t show a really
strong context effect. It’s actually quite a bit better than the embedding model, and it improves a little bit with context, but doesn’t change too much. The second layer is this pink line. It actually starts off a little worse, but then improves a lot more, and it actually is our
best performing model here, it does really well at
predicting the brain activity. Then the third layer is this blue line, and it starts off a lot worse, and it improves, but it doesn’t reach the same
level as the other two here, which is kind of surprising. We’re like, “What’s going on here?” Andrew? – [Andrew] What are
the numbers in the box? – Sir this is the total
variance explained. So we essentially take the
r squared in each voxel, and then sum that across the entire brain. – [Andrew] The units? – The units of essentially
number of voxels variance that we’ve explained. So 700 would mean that we’ve
explained the equivalent of 700 voxels worth of variance. Yeah, this is really equivalent, taking like the mean correlation
across the whole brain. They’re going to be monotonically related. Yeah, so this a control analysis, so if we scramble the context, if we reorder the words
before the current word, then it turns out that that
makes these models worse, which is good, ’cause that’s
what we should expect. Let me show you a comparison
to this on the brain. So this is another flat map, where the voxels are colored according to their performance
with the semantic model and the context model. So voxels that are
white, are well-predicted by both models. Voxels that are blue are
much better predicted by the semantic, like
word embedding models, than the context model. And voxels that are red
are much better predicted by the context, than the semantic model. You’ll notice there are
very few blue voxels, and quite a few red voxels, which means this supports
the thing you saw in the previous graph, which is that this model
that incorporates context, does a much better job at
explaining the brain activity, does a much better job at predicting how the brain is going
to respond to language, than the model that did
not incorporate context. We can also ask this question about context length and how
that is affecting things. So for each voxel, we can compute a sort of context length preference index. So here, voxels are colored blue if they do better with short context, if they’re better
predicted by short context. And you can see an example
of such a voxel here. Interestingly, so these
voxels in auditory cortex, tend to be much better predicted by short context. So if you don’t include a lot of words, that is actually a better model for what’s happening in this sort of higher auditory context areas, whereas voxels in a lot of these, maybe higher order areas, this one’s in inferior temporal cortex, are much better predicted by long context. Okay, so this is kind of interesting. This is kind of cool. This corroborates other findings that we’ve seen in the field. Now let me show the weird finding that came out of this that
we still don’t really know how to explain. But let’s look at layer preference. So remember I told you this model had three layers in it. We expected to see something
like these other papers that saw us for a progression from low level to high
level across these layers. This is what we see. So here the voxels are colored according to how well they’re predicted by these three different layers. Voxels that are green,
they’re much better predicted by layer two than the others. Red are much better
predicted by layer one, and blue are much better
predicted by layer three. And the pattern is, it’s
subtle, and it’s weird. What we actually see is that in the sort of higher level areas, these areas that liked
the very long context, they tend to actually be best predicted by maybe layer two and layer one, or somewhere in between those two, whereas these voxels that are
in maybe lower level areas, like auditory cortex, are actually better predicted
by layers one and layer three. So this doesn’t match this thing that people had seen in other studies. So in these other studies,
they essentially found, you take some input, you put it in. You have this low-level representation. The neural network learns to interpolate between that and whatever
task you were training it on. You get a high-level representation there, and it pulls out these
intermediate representations that are useful for explaining
intermediate processing in the brain. So sort of the neural
networks have learned to interpolate between an input and whatever the task output is, which kind of makes sense. It’s nice, and it worked really well in both of these cases. But in our case, that
doesn’t really map nicely to what’s going on. So in our case, we have a language model where we have an input, comes in. It spreads through these different layers, but then from the output
of this third layer, it’s actually trying to
predict the next input. It’s sort of loopy. It’s recurrent, but it’s
also like the output is actually in the same
space as the input. You’re not mapping from
one kind of representation to a different one, you’re mapping from one representation, like back to the same representation. And that means that we don’t
see this kind of hierarchy. We don’t see this low-level
to high-level hierarchy, across this neural network. And I know that Maria and
Layla reported the same thing in their paper, and we’re
like, “Ah that’s great, “that’s cool that they’re
corroborating this,” but I don’t know how to explain this. I don’t know like really what it means, so if anybody has ideas,
I’d love to hear it. This, that’s what I just said. Okay, so this one way that we’ve tried to improve upon these word embedding, through distributional
word embedding models. Now let me tell you about a second way. It’s related to visual grounding. So this is work done by my
grad student, Jerry Tang, who’s here and is giving a
poster on this this afternoon, you should check out,
’cause he will tell you in more detail, more precisely
what’s he’s done than I can. Okay, so the basic idea… Of course many of you are
familiar with the idea of visual grounding of language. When you hear the word dog, you don’t just dredge
up sort of associations between the word dog and
other words that you’ve heard. Of course this is related also to things like pictures of dogs that you’ve seen. This is my dog, she’s very sweet. So there are a lot of
concepts that we maybe learn about visually, and
not just through language. So how can we capture
that in these models? How can we incorporate
that into these models? So what Jerry did was, he sort of took a page out of this Eickenberg-style analysis. What we can do, is we
can take lots of images, images from Imagenet, map them through some neural network that was trained to
recognize objects and images, pull out a representation
from that neural network, that we think we know
actually from other studies, is representative is how visual cortex, at some intermediate to high level represents these images, and then look at similarities
of these representations, rather than similarities of
sort of distribution properties of the words. So essentially each word gets mapped to a collection of images. Each collection of images gets pushed through this neural network, and then evaluate the similarity between different words, based on sort of how similar
the related images are. So this gives us a measure of similarity derived from visual properties instead of from
word-distributional properties. But of course this is only, this is not really naturally a model for how we learn about
things in their totality. We do learn about things using both words, and sort of visual input, as well as other modalities. So what we actually end up using is a combination of these two. So we will more or less
concatenate these two sets of features, we get
distributional features for each word, and these
visual features for each word, and then combine them into this visually grounded semantics base. So let me show an example
of what this looks like and maybe give you a little bit intuition. So this is a bunch of words, projected into just a
two-dimensional space. This is, I’ve taken these embeddings and mashed them down into two dimensions using principle of components analysis, and what we’re showing here
is two different categories, that I think have some kind
of intuitive relationship to what’s going on in these spaces, so we can explain what’s going on. So, the sort of blue-green words here, they’re words for people. And I’m sorry, I’m starting off here with the distributional space. So this is based on how
these words are used in written text in general. The green words, blue-green words here are words for people. So even though all these
people may be look very similar to each other, they’re pretty
far apart in this space, because they occur in
quite different contexts. You don’t use the word boy and engineer that often, like those are not replaceable with each other in general. So they are quite far apart in this space. In pink is words for items of clothing. And it turns out that these things, we talk about them often pretty similarly. They’re used in similar ways, even though they look very different. In contrast, the people
who look pretty similar, like a person is a person. An engineer and a doctor probably look much more similar to each other, than like, I don’t know,
a doctor and a shoe. But yeah. So all the clothing items
are kind of packed together in this one little lump right here, while the people are really spread out in this distributional space. But now if we interpolate from here through the grounded
space where everything is a little more complicated, out to our purely visual space, so now the similarities here are just based on how visually similar each pair of words are, and now we have all these, all the people are kind
of more packed together in one corner of the space, because people are people, whereas the clothing items are much more spread out. So this is just to sort of illustrates what the difference is between our visually grounded space and our distributional space. Okay. This is the same thing, I’m just showing both at the same time. Okay, so now we have two different priors. We have visually grounded prior that says that words should
have similar representations, if they correspond to
things that look similar, and we have the distributional prior, which is that words should
have similar representations in the brain if they correspond to things that are, often occur in the
same context in language, and we can compare these two. Right, we can fit modes with each one, and we can ask which model predicts better in each part of the brain. So this is showing a
flat map for one subject with a similar kind of color map to what I showed before. Here black voxels are poorly
predicted by both models. White voxels are well predicted by both. Red voxels are better predicted by the visually grounded model, and blue voxels are better predicted by the distributional model. We can see that this is more mixed than the other picture. So it’s not like the
visually grounded model is better for everything. But what we see is that
actually in the places that are close to visual cortex, so especially in these
areas that are near the edge of visual cortex, we have glomerations of these voxels that are better predicted by the visually grounded model, than by the distributional model, whereas in pre-frontal cortex, we some of those as well, and this actually lines up with places, you know voxels that are selective, for kind of visual concept, but we see a lot of voxels here that are better predicted by
the distributional features than the visual features. Right, and this especially true close to these sort of known visual areas. This is the parahippocampal place area, extrastriate body area, and
retrosplenial complex, cortex. This is just showing a bar graph that is like average
across a couple subjects of the model performance, near each of these ROI’s. This is not in HRY, but this is for voxels that are near each of these known ROI’s. So for the visual ROI’s, we tend to see improvement of the visually grounded model over the distributional model, whereas in these other ROI’s that are language-selective things, but not visual, this is SPMV’s, this is sort of frontal
language area and Broca’s area, is here. We don’t see that same distinction. We see actually maybe a weak preference for the distributional model. Okay, so Jerry can tell
you more about this, and there’s more analysis
to be done with this, which are pretty cool, that like relating words
that are very concrete to nearby, they’re closest
neighbor who is abstract in how those things are represented, which is pretty exciting. But yeah, so we’re left
with some questions from this too, especially this last thing, so can we also do this
grounding other modalities, can we ground in like tactile features? What would that look like? Can we extract tactile features from words that seems tricker. We don’t have good
techniques for doing this. But, they’re things. And also how can we combine this approach with context models, ’cause I think that would be a nice merger
of these two different streams of research. So that’s it. I want to end by thanking
all the people involved here, especially Shailee and Jerry, and Wendy de Heer and Anawar Nunez, and Jack Gallant’s lab. Thank you. (audience applauds) – Thank you very much. It looks like you have a
question, Jim, or a comment. (laughs) – So beautiful work. I really liked the
context model, especially. But the thing that always, the question that I always come up with when I see these kinds of models, based on lexical units, words, when they’re listening to a story, is what about the discourse level? It’s not, and the contextual model, using the recurrent, is a
recurrent neural network, still doesn’t get at,
they’re hearing a story, and there are… People come back, scenes come back. Episodes are referred to again. There’s all this discourse information, which is just what the
people are really listening, representing, and I
suspect you’re thinking about this too, and trying to think of how can you incorporate representation of semantic information that exists at the
discourse narrative level, rather than at the single word level, and what I just, what are
your thoughts about this? – Well so, absolutely I
think this is of course, what we’re trying to work toward, that’s what we want to
understand in a way, is how these higher level, larger concepts that are communicated by sentences or whole stories, are represented and what’s actually going on there. For one, we just don’t have
mathematical models of this, like we don’t have a
way to extract features that capture sort of
discourse-level elements. Maybe the best we can do right now, is we can use these same
kinds of language models, like the context model here, but with very long context lengths. So we’ve been experimenting
with this recently, going out to more than 100 words, which is not discourse,
whole discourse level, but it’s better than 20
words that we had here. Interestingly what we found there, is that this is super preliminary, that the amount of data that
we’re training the models on, starts to really matter, so the data that I
showed you, let’s see… Wait. – You started us off talking
about the recapitulation of stuff Ray was showing us
with space versus social words. And then ended on the context versus visual models, so I was wondering if you have thought about
creating a similarity space prior, based on sociality or
something of with the words. – Yeah, that’s interesting. No, we haven’t really thought about that. Yeah, how would that work? Yeah, we’d get like– – You can do some polling. Just get people to make judgements, along different social
dimensions I suppose. – [Alex] Yeah, that can
be really interesting. – Some other kind of– – Grounding and social
interaction like that would be– – [Andy] Some like that. – Yeah, interesting. – Yes sir, one thing I like about this, and there’s and as a
comment and a questions, is, that it’s a way of comparing among these word-embedding spaces, is because you have
this external referent, which is the brain, and you can say, “Okay, which is, predicts
brain activity better?” Essentially across the brain. My question is when it
comes to the second part, which is sort of interpreting
those dimensional spaces and what goes with what, aren’t you getting out
kind of what you put in? Because you’re constraining the estimates and the model to be similar, based on these prior covariance matrices, so you kind of have to see that when you look at the
representations coming out, right? – Yes, in a way, absolutely. Like we are imposing semantic
smoothness on the word space in the sense that… If we go back to the early stages here, yeah, so, when we fit this model, the weights for month and week actually cannot be that different. They can’t, one can’t be
positive and the other negative. Like, that’s just
impossible under this model. Of course this does sort of smooth things and bring them down into essentially some lower dimensional space than the total dimensionality
of the set of words, but so we tried, in doing this sort of principal components analysis, which, let me skip forward to that one. In getting to this point, we did do a test for exactly that issue, which was we can ask about the… We can do the same principal
components analysis on the stimuli themselves,
so on the stories, which should have all of that
same semantic smoothness, all of that same built-in, sort of junk, that is being pushed through
into the final models, here. And then we compared how
much variance was explained in the brain models, from the principal components
of the stimuli themselves, versus the principal
components of the brain data. And it turns out that these
first three dimensions here, explain significantly more variance than the corresponding
pieces of the stimuli, which we took as evidence that like, these are dimension that are actually, there is effect of the
feature space itself, sort of buried in here, but the fact that like this
is the first dimension, is not something that you would get from just the stimulus itself. – [Tor] And just to quickly follow up, but what about hand-coding
those categories, like in some of your earlier papers? Have you left that behind now, or you think that’s
still maybe even better. – Hand-coding things is like the worst. It really (all laugh)… I spent, I don’t know, two months as a second-year grad student, like hand-labeling images, which was totally worth it. It was great, but I just don’t… I like really don’t want to do that again. (audience laughs) Yeah, so I mean, I really like to rely on these sort of stimulus
computable models, like something where we define a function that you can apply to a stimulus, and that gives you a set of numbers, which also makes it practical to move to these very large data
sets like we’re doing now. Like I hand coded two hours of video, but I can’t even imagine doing that for like 20 hours of
video, which is whatever, 20 hours of stimuli like we have now for these subjects which
is, would be nightmarish. So yeah. – I had a couple of I guess,
semi-technical questions. And maybe this is because
I don’t know the full range of your work as well as you do,
which is why I’m asking you. Maybe it’s laziness, but to what extent does the choice of word embedding matter? Okay, if you looked at a
bunch of different options, and in sort of a similar vein, if you just code words by their identity, like have a one hot matrix of words, how much do you get on top of that from having, let’s say GloVe embeddings or even transformed or create embeddings. – Yeah, so this is essentially
the identity model. This is like a one hot word embedding, where each word is just
an independent vector. That was there when the old model here that I’m comparing to. So that is a lot worse than
these word embeddings, but– – Well, is an embedding
the same as just one hot? I think that they’ll behave
completely differently, if you have a vector embedding that’s got a bunch of real numbers versus something that’s just like, you know, small. – I thought that, that was
your second question, though, is like how does this compare
to an identity embedding, or identity… – [Adina] Okay, yes.
– Yeah. Yeah, so I think this is exactly that. This is the identity, yeah, matrix, as a word embedding. We’ve compared a bunch of
different word embeddings. This was actually a road
that I started to go down when I started my own lab, which was like, “Ah, let’s use this
method to actually like, “compare embeddings, “and we can do something cool with that.” It turns out, most things
work really similarly, like really similarly. In fact, even the dimensionality
doesn’t matter too much. You can squeeze things down to in, maybe like 100 dimensions, and it still works pretty
okay, with GloVe embeddings. Yeah, and in terms of sort of
what the representations are, that doesn’t change too much either. Like all of that is kind of the same. That’s true for like all these
modern embeddings actually, so for Word2Vec, for GloVe, for the embedding method that I use here, which is some ad hoc thing, which is actually very
similar to the embedding that Marcel and Tom Mitchell used in their 2008 paper, which was a whatever,
small embedding as well, these all seem to work pretty similarly, and they work much better than the sort of old-school embeddings, so like LSA or HAL, those don’t work nearly as well. So there is a big step from, like the sort of simple embeddings to these modern embeddings, I don’t know. I don’t… I could say more technical
things, but yeah. Yeah, in general, like
they’re pretty similar. – Yeah I had one sort of
clarification question, and another sort of speculative comment. But the recurrent neural network model that you used for context, I just want to clarify the amount, the kind of gear of how far
back in time it considers, and the layers are two independent, sort of like architectural choices, like the fact that it has three layers, but for example, does that middle layer behave differently in any way, depending on how far back… – [Alex] Well let me pull that over here– – I mean so what I’m wondering is, it sort of strikes me as
almost each time point, is almost like auto encoder, but the next stage, but it’s not the input
word, it’s the next word. And so that’s why it seems to me like that middle layer would
be the important one in the same way that it
is in the auto encoder, because you’re essentially going from a word to an intermediate
representation to a word. Is that sensible? – So the middle layer, the input to the middle layer is already an intermediate representation. So the word comes into the bottom layer. The word actually there was an embedding, and then the word goes
into the bottom layer, and then the bottom layer
goes into the middle layer. The middle layers goes into the top layer. The top layer goes into like
an output embedding space, and then outputs from there. Honestly, this whole like issue, made me so deeply question the idea of like layers in
recurrent neural networks, which when you think about it deeply, it’s like what, what? Why is this a good idea? It turns out with one layer
in recurrent neural network, you can compute exactly the same things that you can compute with multiple layers. It’s just a matter of like how you connect the units together, which we wrote a paper
about, that whatever. We went down like a sort of
weird, diverted path there of like what the hell do layers mean? In like feed forward neural networks, it’s like very simple. It’s like this layer
feeds into that layer, but in recurrent network, each layer gets its own
inputs from the past, and from the previous layer. So it’s…
– [Sam] So it’s a little more. – It’s messier. – Yeah, I only have worked
with recurrent neural nets that have one layer, right, and I don’t know how to
interpret it otherwise, but yeah okay. – Yeah, I mean stacked recurrent nets have become like really common in LP land. It turns out it’s just
a method of regularizing what is essentially a
single-layer recurrent net, but yeah, and so I don’t
there’s anything necessarily, like that makes it architecturally obvious that this middle layer should
be the best in some way. The like vague idea I have in my head, is something that like… Come on. That, you know, because the
output has to kind of be aligned with the next input, this layer is kind of the freest, in a way to represent things in a way that doesn’t need to be
trivially mapped from the input, and it doesn’t need to be trivially mapped to the word embeddings. So it can have this, some kind of weird, high-level representation. That might be why it works so well. But yeah. – I’ll save my second one for… – So I just have a question about, did I think is inherited
both with a semantic and a context model to
mention how your building two similarities in meanings of words, and I’m just curious about what happens with common errors and
similarities and meanings and whether that’s
something that is public, the model is oblivious to it. It can’t discern, so it might be one of just annoying philosophy questions, to what extent is this
really semantic knowledge, rather than just beliefs
about the meanings that the agents actually have, but if you take like those
common speaker errors, that might be prevalent
across the community, but are still not really
tracking the meaning, how does that, is that
something that just gets… The model is just going to be blind on? – That’s a good question. I never looked at that. That’s really fun. Like… – Like for instance, factoid, people oftentimes think that
doesn’t have to be false– – [Alex] But that is what factoid means. – It’s not fact, yeah. – Yeah, or like using “begs the question” – [Una] Exactly, yeah. – to mean like raises the question. – [Una] Right. – I’m sure there are some subjects where you just get like an anger response somewhere in the brain where that happens. Yeah, we haven’t looked at this. I suspect, so whatever these models sort of say the representation is, is kind of what… It’s based on how that
word is used on average in the corpus that we put in. So if factoid is used mostly
to mean like small fact. – [Una] Right, that’s
what I would guess, yes. – False fact, which it
almost certainly is, then that’s what the model would… – Right, and so you wouldn’t get something that’s meaning, but something’s that like belief about meaning, and you can also get this
with accidental correlations in say phonological similarities. Or words that sound the same, you might assuming the same, but don’t. I say those would also be interesting to– – Yeah so we have compared to phonological similarity measures. And that predicts a
very specific brain area that is essentially just
the superior temporal gyrus and that’s it. Yeah, so at least that… It can be decorrelated to some extent. – I wonder if you thought
about the ways in which, your conclusions and your big picture are dependent on the fact
that you’re explaining neural responses to Moth stories. There is more to, there are
other kinds of concepts, other kinds of discourse. Imagine a political science lecture. It’s going to have much less, I think, visual content. And each kind of domain of discourse of language, of conceptual space has its own kind of characteristics, and you’re obviously exploring
the semantic characteristics of this space. And oh by the way, Jim’s
very interesting suggestion about event structure, is an element of, a possibly
very important element of story of Moth story of– – Because they are narrative stories. Right they have events, whereas a lecture about
something does not. So we’ve been exploring this lately, so like I mentioned, we have data for at least one subject now, of up to 20 hours of stimuli
for that one subject. This is not all Moth stories. It’s about half Moth stories, but we’re also, we trying to, we build a little sort of factorial matrix of the different kinds of stimuli that we wanted to explore. It’s like narrative versus factual, written versus sort of
spoken off the cuff, and single speaker versus multi speaker. So we’re raising like
stories from the Atlantic, as a sort of more factual version of this. Turns out these models work extremely well for those also. – No new dimensions, no new elements. – We haven’t explored
that in enough detail yet, so I definitely can’t say that. I don’t know. But that definitely
doesn’t have the same kind of event structure, and still, at least like the basic
word embedding model still works really well for that data. And actually building
a model on like stories from the Atlantic and then testing it on Moth stories, still works very well. So there’s something that
is sort of core there, to just like the meanings of these words and the semantic
dimensions that they span. – Do you think it would work if you developed an embedding model based on Atlantic stories? – Haven’t tried that yet. That’d be interesting. – [Man In Background] I
think I would let you–

Leave a Reply

Your email address will not be published. Required fields are marked *