NeuroSAT: Learning a SAT Solver from Single-Bit Supervision

NeuroSAT: Learning a SAT Solver from Single-Bit Supervision


>>It’s my great pleasure
to introduce Daniel Selsam. Daniel is a student, a PhD candidate at Stanford. He was interning for
us one and a half years ago I have been collaborating with him
for like three years, in completely different topics. Last time he visited
us he was using former methods to verify
Machine Learning algorithms. And today he is
showing the progress he made in trying to
learn a SAT Solver, to have a neuro network
that learns SAT.>>Thanks Leo. Hi, everybody
thanks for coming today. I’ll be showing you how
to learn an entire SAT Solving Algorithm from
Single-Bit Supervision. And this is all joint work with Leonardo and several others
back at Stanford. So, as most of you know
neural networks have beaten the state-of-the-art on
many important problems just in the last few years. Most notably object recognition
and speech recognition. But there have also been
major successes in diverse, and I think unexpected domains including machine translation,
social network analysis, heuristics for board games
like chess and Go, even predicting
properties of molecules, optimizing data
center efficiency, and surprisingly even learning high-performance
indexing data structures. And yet we still know
very little about what neural networks
can and cannot do. In this work we consider
the following question, can a Neural Network learn to perform precise
logical reasoning? And more specifically, can a neural network learn
to solve SAT problems? Here’s our setup. We’re going to train a neural network as a classifier on a data set
of input output pairs, where the input is
a SAT problem P, and the output is a Single-Bit indicating whether or
not P is satisfiable. At test time we’re
going to feed problems into our trained model
which we call NeuroSAT. And it’s going to predict
as it was trained to do whether the problems
are satisfaible or not. As we’ll see later in the talk, it’s able to predict
with high accuracy and when it does
predict satisfiable, we can almost always
decode an actual solution, a satisfying assignment from the internal activations
of the network. And we’ll see how
this works shortly. For some background on SAT, I think most of you will be familiar with this
but just in case. A propositional formula can be represented as an AND of ORS. So here’s an example formula, not X one or X two. And not X two or X
one, some jargon. (X one not X one), (X two not X two) were
all called literals. And C one and C two are
both called clauses. Now a formula is satisfiable if there exists
a satisfying assignment. So the formula above is
satisfiable because there exists a satisfying
assignment specifically, the assignment that maps
X one and X two both to true satisfies it because the first clause is satisfied
because X two is true. And the second clause is
satisfied because X one is true. And the SAT problem
is given a formula, determine if it is
satisfiable and if it is, find a satisfying assignment. Now, our goal is to
train a neural network to predict satisfiability
and we’re going to talk about decoding
solutions and actually solving the problems later on. And there are two main
design challenges, what problems do we train on? And with what kind
of architecture? Let’s talk about
training data first. So, the issue is that some problem
distributions might be arbitrarily easy to classify, based on superficial properties. But, a network that’s trained on such a data set
probably will exploit those properties and
won’t generalize and we won’t really have learned
anything semantic about SAT. So we want to construct
difficult problems to force it to learn
something general, to be more systematic about
exploring the SAT problem. So we define a distribution
that we call SR of N over problems such that the problems
come in pairs, one unSAT, one SAT, and they differ
by negating a single literal and a single clause but otherwise have
the same statistics. So, the idea is that no simple superficial
statistical analysis will be able to distinguish
between the two. And specifically
to sample a pair of problem from SR of N we keep sampling uniformly
random clauses until the problem
is unsatisfiable. At which point we flip a single literal in the final clause
to make it SAT, and we return the pair before and after we
flip that literal. This distribution
is not guaranteed to be hard at any complexity
theoretic sense, but it’s intuitively
statistically hard. For small values of
N it’s still easy for traditional SAT solvers but it will be very hard for
a standard classifier. Now let’s talk about
Network Architecture. So we could treat SAT problems
as character strings.>>You said this are Clauses?>>Yeah.>>You keep sampling
random Clauses?>>As if they were
learned case you know, as if it is uniform
learned on SAT.>>And then you add
clauses. You can->>We start with
N variables and no clauses.>>Okay.>>We’ll sample
a random uniform clause on the N variables, and check clause SAT over
to see if it’s satisfiable.>>Right.>>If it’s satisfiable
we do it again. We keep adding
clauses like this.>>So you keeping adding
clauses, okay that wasn’t clear here so.>>Yeah. Sorry keep
sampling random clauses and adding them until.>>And keep adding
them until unSAT.>>Yeah.
>>Right.>>Sorry that wasn’t clear.>>The point is that,.>>It’s going a little fast.>>So we know that before the last clause is introduced the problem
is still satisfiable. So we know that
flipping any literal in the last clause will preserve the model that was
satisfying previously. So we know we can recover a satisfiable instance by flipping a literal in
the final clause yeah. Sorry if I’m going
too fast please anyone stop me at anytime.>>All right I get
it now. Okay thanks.>>So, let’s talk about
Network Architecture. We could treat SAT problems as character strings and use
standard sequence models. We saw that, I wrote
a SAT problem above it’s characters so the AST standard
models can be applied. But there are many invariances
that we want to enforce that the logic of
satisfiability admits. For example, SAT problems are permutation invariant
in a few ways. You can permute literals, you can permute
clauses and you can permute literals
within a clause. And you don’t want to
have a Sequence Model have to learn all of
this from scratch. We can just build
it into the model. There are also
negation invariant. Which means that you can
negate every occurrence of any specific variable, also without affecting
satisfiability. So we’re going to try to build these invariances into
our Neural Network Model. And the approach is given
a formula like this. We’re going to construct
an undirected graph with a node for every literal so there’s a node for (X1 not X1), (X2 not X2) and a node for
every clause C1 and C2. There’s going to be
one type of edge between every literal and
every clause it appears in. So there’s an edge between
X1 and C1 and X2 and C1 because C1 Contains
the literals X1 and X2. And there’s also a separate type of Edge which I did note with a dotted edge here between a
literal and its complement. So X2 and not X2 are
connected by the dotted edge, and X1 and not X1 are
connected by that edge. Now the basic idea of the model
which we call NeuroSAT, is we’re going to maintain
a vector embedding some inscrutable collection of floating point numbers at
each node in the graph. And we’re going to
iterativly of we pass messages along
the edges of the graph. So, a literal is going
to pass messages to the clauses it appears in
and its complement literal. Clauses are going
to pass messages back to the literals
that occur in them. After teatime steps, we’re
going to map literals into scalar votes using
a learned network. And then we’re just
going to average the votes to compute
a single logic that’s going to guess
whether the entire problem is satisfiable or not. Now I’m going to go
into the architecture in great detail in
the next few slides. But if you have any high level questions
I’ll take them now. Yeah.>>It’s related to survey->>I’m going to discuss every propagation a
little at the end. Tree propagation does
not look like this. There are several
architectural differences. We’ll cover Survey
Propagation with this. In Survey Propagation
the messages that the clauses
sends to the literal, depends on which literal it is. Like you marginalize
out the other literals, which you can’t
express in this way. But it’s structurally similar in the sense that it’s
a message-passing algorithm. So, here’s the NeuroSAT model
in more detail. Where we have a bunch
of learned parameters. So, two vectors, the
L init and C init to initialize
the embeddings for each of the literals and clauses. Three multi-layer
preceptrons, L msg, C msg and L vote. I’ll explain all
of these shortly, and then two layer-norm LSTMs, L update and C update. Now, we’re going to initialize, one matrix that stores all
of the literal embeddings. So, it’s 2N by D, so every row contains the embedding for
one of the literals. And we’re going to
just initialize all of the rows to L init, the learned vector and
similarly for clauses. Every row of C is going to be the embedding
for one of the clauses. And we’re going to run
T iterations of message passing. So, every iteration of message passing
consists of two stages. Stage one, literals are going to pass messages
to the clauses. So, here all of
the literals have embeddings and
they’re going to send them to the clauses
that they occur in. The clauses are going to sum
the incoming messages and also get its previous state with a started edge and use the rnn, the C update layer-norm LSTM
to update its own embedding. And here’s the matrix notation. It’s hard to read but it
just says what I just said. You’ve got the
literal embeddings. You apply this learned MLP to compute the messages,
one for each literal. M is going to be
the binary adjacency matrix. So, this is just summing the messages and sending
them to the clauses and then the clause also get its previous hidden state
and it calls the update LSTM. Now, the stage 2 is similar. Clauses pass messages
to the literals. The only difference here is that the literals get messages
from the clauses, it gets its old state and it also gets
a different kind of message from its complement because of the extra edges
in the graph. Intuitively, these dotted edges
encode the constraint that they should probably
not be equal. Yes.>>What is a message?
What are the messages?>>So, every node has some, at every time step
that every node has some collection of
floating point numbers that represent something potentially inscrutable
about that node. So, at every time step I
have some vector here which represents something hopefully meaningful
about the literal. Now, that’s represented in
this matrix L of t. L msg is a learned Neural Network that
takes those embeddings and computes some other vector of floats that
represents a message. But I have no a
priori knowledge of what that actually
encodes semantically. It has the freedom to choose
what those floats represent.>>And what are these sizes
of the spectrum, again?>>I don’t think I said yet.
We eventually run it with 128 but is pretty arbitrary.>>It’s a good number.>>If we had bigger
hardware we’d run it.>>Got it. Okay, super.>>As big as you can get.>>It’s related to
problem size potentially.>>Potentially,
unclear actually.>>Oh, okay, okay.>>Because reduction sometimes
[inaudible] you don’t always want. I mean having a smaller vector potentially versus
something distractive.>>Conventional wisdom in
machine learning is used as big a vector as
you can get away with in terms of
hardware and performance. We also found that it trains, the bigger we made D
the faster it trained. But it’s not major. You
can train these with 16 even with eight first.>>Okay.>>Yeah.>>And you’re going
to describe how you do the learning itself later?>>Yeah, although
the short answer is background. But I’ll explain
the signal in a second. So, here’s the literals
and it’s similar, I won’t go through
this in detail. It’s identical except for this, except for getting
the compliment of the literals as
a separate message is the same as the clauses. Now, after T iterations, we’re going to compute
a scalar “vote” for each literal by applying this learned multi-layer
perceptron on L vote, that maps each literal
embedding into a single scalar. We’re going to then
average the votes to get a logit which is
just a single scalar for the entire network. And then we’re going to
predict satisfiability using the logit by just
taking the sigmoid function, which is going to return
something between zero and one. Now, we train with
the cross entropy loss. So, we’re using
back propagation to minimize the classification error on a training dataset. Does that->>Where is the input
to your training?>>SAT problems
represented as graphs and a single bit of supervision
for each problem indicating one if it’s SAT, zero if it’s unSAT.>>So you use these pairs that you
were mentioning earlier. Sets of these pairs,
large sets of these pairs.>>Yes. Exactly. But typically, we don’t tell anything
about solutions. We’re just giving it single bit supervision
for each problem. Let’s talk about
experiments now. So, we’re going to train
on SR uniform 10, 40. The uniform part is
not actually essential. It’s just intuitively nice to use a curriculum here because it’s going to be
harder to induce structure on bigger problems. But we train things
without the curriculum and it works just as
well. And we’re going to->>Are those variables
clauses, 10, 40?>>No, sorry. Uniform meaning, we’re sampling the number
of variables from a uniform distribution
between 10 and 40. The number of
clauses is determined by how many we can sample
before it get answered.>>Answered, right.>>And then we’re going
to test on SR(40). Just some perspective on SR(40). There’s 40 variables,
so there are a trillion possible assignments. About 200 clauses per problem. Roughly 1000
literally occurrences, so about five literals
per clause. And all of the clauses
are uniformly random, so every literal is
treated symmetrically. Including X I and not X I are
both treated symmetrically. And every single problem
in the distribution is a single bit away from flipping
between SAT and unSAT. And just a caveat,
because 40 is small, it is still easy for
state of the results. So many SAT can solve SR(40) in almost [inaudible] number
of milliseconds. And results NeuroSAT predicts
with 85 percent accuracy. We’re going to make that number
a lot better shortly, but first let’s
understand how it even beat 50 percent. Yes.>>So, intuitively
what you learn here, is for any pair of formulas, whether you have
the original unSAT or if there’s one flip
which become SAT. That’s what you are
learning here, correct?>>Well, we don’t tell
NeuroSAT that these two, one is SAT one is unSAT. They just get random problems
for distribution.>>I see. So, when you
feed for the training, you do not say that this one was unSAT one and this was SAT?>>Correct. We just feed them, feed the individual problems.>>But they’re labeled.>>They’re labeled but
we don’t tell NeuroSAT.>>So, you do tell them.>>We don’t tell NeuroSAT that they should look at this pair.>>So, input is not a pair.>>Every single
individual problem.>>No, it’s a SAT but he
says it’s effectively what. Okay, that’s not a criticism, I’m just trying to see
what the intuition, if there’s an intuition here. What’s the intuition that
you had in your head? What’s actually
been learned here…>>We have actually a very good explanation
of what’s being learned, which I’m going
to get to shortly.>>Okay.>>So, we were all really surprised by this number
and we trained a few baselines including LSTMs and we could not break
50 percent on any of them. We couldn’t train
a model but even when you fixed the number
of variables and fixed the representation could even distinguish on
small training sets, the difference between
the SAT and unSAT problems. So, the best way, we actually bombarded this
with experiments and made no progress until we just visualized what it was doing
and then it became clear. So, this is an example run of NeuroSAT on a problem
from SR(20). Here, this is the matrix of scalar literal votes
at each time step. So, this is a 20 by two matrix, because there’s 20 variables. So, the Ith row is going to be the scalar vote for X I and then not X I in
the second column. Here, white is zero, red is positive, and
blue is negative. Okay? So, red means SAT, white means unsure,
blue means unSAT. Now, as time goes on, there’s some red and blue here from the
initialization which is an artifact of
the initialization vectors. But very quickly
all of the literals start voting unSAT
with low confidence. So, light blue means
roughly logit of negative one. So, assigning, say 16 percent probability
just to it being SAT. And you can see there’s a lot
happening each time step. A lot of the dots are
changing and moving around but consistently almost all of the literals are voting SAT
with low confidence. And we see this on
every problem it runs on. And this happens for a while, unSAT with no confidence. Quite a few things
are happening, a few speckles of very, very light red that
don’t make any difference because they get washed
out by the averaging. But then, all of a sudden there’s a very distinct
phase transition. All of the literals
start voting SAT with extremely high confidence
and the network converges. So, these light blues are
about negative one logit. These reds are about plus 40. So, this is 99.9999999 percent sure that these problems are satisfiable and
the network convergences. We see this for every problem
that it runs on. If it’s unSAT, it looks
like this forever, and if it’s SAT it either
looks like this for a long time or eventually
end, sometimes, usually, eventually has
this phase transition where it converges. Yes.>>Just to make sure because
that’s where we’re at. This is when you test basically, not when
you learn, here. Right?>>Yes, this is a problem
that was not trained up.>>Okay, okay.>>Yes. So, somehow around here it becomes extremely confident that this problem
is satisfiable. Now, we weren’t exactly sure what it found
that convinced it. But if you look carefully at the literal votes
after convergence, is a little hard to see
because the resolution is poor. But you can maybe make out a pretty distinct
checkerboard pattern. Meaning that there
are only roughly two, maybe three shades of red here. And for each variable, there’s one literal that is distinctly darker
than the other one. So, you can see
a bit pattern here. Do you guys all see it roughly?>>[inaudible]>>You wouldn’t call it what?>>Sorry, go ahead, go ahead.>>So, we hypothesized that maybe this is encoding
is satisfying assignment. And that it had become so confident because
it had found it. So when we made
it a script to just try to just read this off and see if it’s
a satisfying assignment, and it turns out that it usually is a
satisfying assignment. Which suggests that the reason why I became so
confident is actually because it had detected
a satisfying assignment, and knew that that indicated that the problem
must be satisfiable. Now, at first I thought maybe the literals that are true in the assignment are
voting SAT with more confidence even though I didn’t know why
it would do that, it turns out that
is not the case. Sometimes, the literal
that is voting lower, that is voting SAT
with less confidence, is the one that’s true
in the solution. Yeah.>>[inaudible]>>You try both possibilities. So, you try the thing
that maps the one that’s lower to true everywhere, and then you try
the one that’s- Yeah. But this has turned out to be an unreliable way to
decode the solutions. It turns out that
this pair pattern is just an artifact of the fact that the higher dimensional
literal embeddings cluster according to the assignment. So at convergence, in
the 128 dimensional space, the literals that are true are going to be in one cluster, the literals that are
false are going to be in another cluster. And this pair pattern is just an artifact of the projection preserving that clustering
most of the time.>>Excuse me. If
you have this set, you have two [inaudible]
and run it through this. And one cluster could be the true and the other
could be the false.>>Am not.>>And suppose that you
have two disjoint sets of literals that appear
disjoint sets of clauses, then would your learning necessarily correlate when is light verses dark in
one cluster or the other?>>That’s a good question. In that case, it is conceivable that you would
end up with four clusters. The true ones for problem one, the false ones for problem one. My guess is that we didn’t test that SR40s are just
fully connected. My guess is that
the two clusters for the two sets of true literals are going to
be closer together, and then for two cluster, you’ll still get
the right solution. But I don’t have
a proof of that. But the clustering
analysis suggests a much more reliable way
to decode solutions. Which is just directly cluster the higher dimensional
embeddings. So here we can watch the two dimensional
PCA projections of the high dimensional
and embeddings on a run from SR40. Now, I’ve colored the literal, these are the
literal embeddings. And I’ve colored
them according to the solution that
eventually finds. So you can see exactly what
it’s doing over time. These are just the
last nine times steps leading up to convergence. And you see it doesn’t cluster
at all, it’s all mixed up. It’s all mixed up. It’s sort
of starting to separate. Here it’s starting to separate, but one literal is
still incorrect. And then it pulls away and then that convergence falls
into two distinct clusters. And now this method of
decoding is very reliable. And 96 percent of the problems in
the test distribution for which I guess is set, this simple trick
of two clustering will recover the
satisfying assignment. So let’s talk about
solving. Oh yeah.>>Because it doesn’t really use any supervision for
finding that out.>>Right.>>Do you have to
any sort of [inaudible]>>Yeah. So, as I said before
the training distribution. The satisfiable problems and the unsatisfiable problems
are statistically identical. There is no way
for it to predict that bit based on
superficial properties. It learned that
the best way to predict that bit is to search for
satisfying assignments. And not to get SAT until it
found one at which point, guess that with
extreme confidence and suffer no objective
function cost. Yeah. I think it’s
an artifact of the distribution not permitting any other way of
explaining the data. We’re going to talk
about training it on slightly different distributions
later in the talk, and you’ll see how
the training data affects the procedure
that it learns. So let’s talk about
solving the rest. Sometimes, there are
satisfiable problems for which it does not guess SAT. It will continue to
guess on SAT with low confidence.
Here’s an example. Just keeps wandering around, but because NeuroSAT is
inherently iterative, we can just run it for more iterations and
see what happens. So we run this, run this,
run this for a long time. Fast forward to around 137. We only trained
and tested so far on 26 rounds of message passing. So this is five or
six times more rounds than it ever performed
during training. And all the literals are still voting on SAT with
low confidence. Few speckles of red and
then round 150, bam! Face transition, extremely
confident in SAT, the literal embeddings converge. And once again this bit pattern will usually encode
a satisfying assignment. And the clustering technique is even more reliable as always. Now, here’s a graph showing the percentage
of problems from SR(40)SR that it solved as a function of the number of
rounds of message passing. So it solves about 70 percent at 26 rounds because that
is what we trained it on. But as we run it for longer, it continues to
solve more and more. By about a thousand rounds
of message passing, it’s solving
just about 100 percent of the problems in
the test distribution SR(40)SR. And this is solving end to end. This is not just
prediction accuracy. We are decoding solutions for almost 100 percent of the problems in
the test distribution. And curves like this are, I think it’s fair to say fairly unprecedented in
machine learning. Usually, algorithms
are not iterative. And even when they
are, they don’t just continue to do better
when you run longer.>>So, what answer can you
give the SAT over there?>>We’re going to
talk about this later. Right now, no. Right now
it wanders around forever. When trained on SR(40)SR, it will never become
confident in on SAT, but we’re going to
talk about this later. There are a few
cool experiments. Now, it would be
implausible to solve much bigger or
harder problems with only 26 rounds of
message passing. You can imagine that
it’s just not possible to reliably solve big hard problems
with so few rounds. But because we can
run for more rounds, and it continues to
search effectively, it’s conceivable
what we can actually solve problems that are harder. So, let’s talk about
scaling to bigger problems. Here, this purple line is the same line I just showed
you on the previous graph. And we are plotting
the same lines for SR of N for larger
N. So SR(80)SR, SR 120, SR 160, and SR 200. And we see similar curves
where for SR(80)SR, instead of 70 percent,
it’s only able to solve maybe 40 percent when you
only run it for 26 iterations. But as you run it
for up to a thousand, it’s solving about 90 percent
of the problems. And even SR 200, which is substantially
harder problem, and many SAT is now starting
to noticeably slow down on. It’s solving… It solves under five percent of the number
of rounds it was trained on. But it gets up
to almost 30 percent when you run it for
a thousand rounds. Now again, although from
a machine learning perspective, the fact that these curves
go up at all when you extrapolate the number of rounds is interesting and impressive. But it also suggests that the algorithm as is
is highly incomplete. It’s possible that this curve spikes up, we didn’t
run it for longer. But it looks like
it’s asymptotic. Yeah.>>What was your training size, how many SAT problems, how many examples do you fit
your training algorithms?>>It’s a stream. So, I don’t actually have
the exact number. Probably.>>Roughly?
>>A few million.>>A few million?>>Yeah.>>Okay.>>But none of the problems
are bigger than SR(40)SR. We are training on
SR 10 to SR(40)SR. So our hypothesis for
why it flattens out here is that during training, it’s essentially performing
some kind of local search, and it never has to
restart during training. SR(40)SR is easy enough. That is heuristics
are essentially complete without needing to
revert when it gets stuck. So when we watch it run, it very quickly
makes every step. The first few steps,
it will flip many, many literals at
a time and very quickly satisfy over 99 percent
of the clauses. And then, it will start taking much more cautious steps
flipping only a few literals. And we see for
these longer runs and the big problems that it
just stops being aggressive. Now, we think
an architectural change that would make it much more reliable when run on
harder problems for many iterations is giving it access to a source
of randomness. So it has some way to
learn when it’s stuck and when it should inject
noise into its embedding. So, we haven’t tried
this yet. Yeah.>>It’s true what
you see around this is all what you mean
by the iterations? So you have this
recurrence relation between CI and CI plus one, H and these are
the transformations that you buy, and the X-axis?>>Yes. Every iteration
here involves the literals passing messages to the clauses and the clauses passing messages
back to the literals.>>How fast is that dimensions
in the iteration cost?>>Sorry.>>Go ahead.>>So, let me go back
and show you the updates. So, it’s applying
a single neural network here, which is as fast as
applying, that is. In this case, it’s
a three-layer neural network. It’s just basically a couple of matrix modes as far as
matrix modes for the graph, for the message passing, and then the C update is
another matrix mode. So, it depends on the dimension and
the number of literals, but it’s not expensive. Yes.>>But compared
to sets, methods, space, and message
passing of which. I guess it would be
permutation, is one example, and the difference then is
the matrix modification, and where the
modifications come from?>>So, Survey Propagation
is not learned at all. It’s derived mathematically. The structure of
the messages is different. So as I said before,
a clause will pass different messages to
the literals that occur in it. Here, that clause that pass the same messages
to every literal, so there are a couple architectural
differences like that. Numerically, the very
difference of Survey Prop often fails for
numerical reasons that you’re dividing by zero. And here, there’s
no division. So it’s-.>>But, I thought that
philosophical difference would be these matrix multiplications
that you get from your end.>>I don’t know if that’s a philosophical
difference, but that’s a big difference. That there’s a lot more
floating point computation being done per iteration here
than in Survey Propagation. Survey Propagation, it’s a few multiplications
and divisions on a single float for
each literal and clause. Yes.>>I’m just trying
to suit the comment that you made earlier
about the training. So, for each of these five here, we have different training
sets that you used?>>No, this is all- sorry. Everything we’re talking
about today until the very end is just trained
on SR uniform 10 and 40.>>Okay.>>This is all extrapolation
to bigger problems. It has not been retrained.>>So, SR 10? What is SR 10 and 40?
What is that?>>Sorry, SR uniform 10, 40, curriculum, training on SR, somewhere between 10 and 40. The end is randomly
sampled between 10 and 40.>>So, for 80 for instance,
so the training set. Basically, what’s going on? The algorithms is picking patterns that are learned from smaller instances
between 10 and 40, like if you get it to 200, you tried to detect
these patterns somehow, correct?>>I don’t think that’s
the best way to think of it. It’s not the way I think
about it. I think that there are very few patterns
that’s exploiting. And rather than
looking for patterns, it is synthesized
a local search algorithm, which is what makes it generalized where
the patterns would->>What incidents
do you have for your interpretation?
So for instance->>Yes. So, I think if we go
back to the video in action. I think this is actually-.>>This is a visualization of the duration
of the algorithm. It doesn’t show to study
the pattern learned, and it is being identified
in a test sample. I mean how do you see it?>>I’m not sure I
understand the distinction. I think this is
rock solid evidence that it learns that it needs to find somehow a search
for satisfying assignments, and that is the best way to
explain the supervision bit.>>But how does it find this?>>That, I can only speculate.>>It’s by pattern recognition, that’s why I’m asking
where you’ve learned this.>>I have one relevant
experiment here. Trying to tell whether it
was doing a local search or global search. It’s not clear. So, we can decode
the solution by doing this two clustering
technique at the end. It’s not clear
that if we cluster, it’s meaningful
before convergence. But, we hypothesized
that maybe it is an inconsistent assignment
before convergence, and we tested what properties
that assignment had. So when we do this, we see that the number of
clauses satisfied increased monotonically very quickly
and rarely go down, and we see that the number of literals that are flipped at each time step starts out high, and as the number of
clauses satisfied goes up, the number of literals
flipped, goes down. So qualitatively, it looks
a lot like local search, some kind of an nearly. But, it’s not clear
what stronger evidence would constitute
because ultimately, the algorithm is inscrutable. I can’t read
the matrix instance to understand specifically
what it’s doing.>>And we know somehow that it’s also getting stuck from
the graphs and the margins.>>Yes.>>Because it appears
to be plateauing.>>Yes.>>So, it’s not learning
enough from backtracking, enough to go search some
other space because it seems so limited in what
it’s scope, somehow.>>Yes.>>Because of the size.>>I think the lack of needing to do any kind of
restart during training. it’s just not prepared
for such hard problems. Are you satisfied? Do you have another hypothesis
for what it’s doing?>>Well, I think the experiment that might be interesting. If you were using
just a sequence to sequence based on strings, string encoding
of volume sensors, how would- can you approximate
your result? Is there->>We implemented
a baseline and could not be 50 percent on the
classification testing.>>I see.>>There’s two reasons
why I think those approaches can’t
get off the ground. One is that, there’s
no obvious way to deliberate. I mean to deliberate, to run for a long time. A crucial part of
neuroscience is that, it’s iterative, that
it can think longer. A single pass over a sequence is very unlikely to be able to solve than big,
hard problems. But also, it’s just has to spend all of its time learning the invariances
that we build in. It’s another thing. There’s so many different
common permutation invariances, exponential number of that.>>But, if you were to normalize the inputs
for instance by, you sort for each
clause literals by name, and you could get part of that, and still
had this straight. So that would perhaps help test the pattern
theory so to speak, the simple fact
of recognition so far would prove powerful for
its context. I don’t know.>>Maybe. I would
be very surprised if somebody could be
50 percent with something that didn’t have
this iterative formulation, but I would be thrilled to be proved
wrong about that. Yes.>>So, with regards to this
picking up the patterns, you mentioned that you used this SR distribution because you don’t want to pick up
some bad node terms.>>Yes. We designed it to
try to prevent it from, to be able to
exploit the patterns.>>So, did you actually do the other experiment
of completely running a
randomly-generated data, so that you see what
are the accuracies?>>Unfortunately, no. We’ve trained on
regular random set at the empirical threshold, and it does perform well also. But that’s almost
as patternless as our distribution because all of the literals are of
the same statistics. Even though there’s not
a single bit separated, most of the problems is at the empirical threshold
are going to be-.>>Which data set this is? Which data set this is?>>So, the data set
we generate SR event.>>Right.
>>It is very similar to the standard uniform
random set of problems.>>Okay.>>The only difference is that, we adjust the number of
clauses specifically for each problem to find the exact threshold where
it flips from set to onset. But what you can
do is, you can just sample for a given end. That threshold is usually at
the same number of clauses, roughly 4.27 clauses
per variable ratio. So, if you just sample that number of clauses
for a fixed end, you’re going to be a few flips away from flipping
my construction. We trained on that, and it
also has a similar quality.>>What I’m trying to say that, if your hypothesis is true, then the other way
training on that, the other data set should
do significantly better because you can pick up the patterns specific
to the data set?>>Yes. So, I don’t
know if the random set has such patterns
close to the threshold. But, I think one of the promises
of this technique is, if there is a distribution
of set problems that has special structure that
you care about in practice, that this could
learn a less generic, more constant procedure to exploit the patterns
in that data set. Yes. But, you have to be
careful with the data set. If you can predict
satisfiability on the data set based on
superficial patterns. If you have some degenerate data set where
your satisfiable problems come from planning and your unsatisfiable problems
come from scheduling, it could potentially predict with perfect accuracy based on patterns without learning that it needs to search for
satisfying assignments. So, it wouldn’t be useful.
Ideally, the usefulness of the patterns
is limited enough, that it has to learn to search, but it uses the patterns
to help search better. There’s a lot of art in
the data set design I think, and we’re going to see
networks that are trained on a different data set shortly
to really stress this point. Let me fast forward. Okay. Let’s talk about
generalizing to other domains. So, all of the test problems so far have been uniform random, which may have
its own statistical idiosyncrasies or may just, it’s just inherently
uninteresting to some extent. There are an
incredible diversity of possible SAT problems. Almost hard to conceive how
the simple formalism can express such a great variety of different kinds of problems. Any problem in NP can
be reduced to SAT, and the structural and
statistical properties can really vary widely. So, an actual question is, does NeuroSAT trained
on the random problems generalize to other
domains or is it just exploiting properties
of random SAT? Now to test this, we generated a suite of diverse SAT problems. So first, we generated random graphs from a bunch of different random
graph distributions, Spare Bósi, Erdos-Renyi,
Forest Fire, and a few others. Small graphs all of
them look roughly like this about 10
Nodes, 17 Edges. For each graph, we generated
K-coloring problems, K- dominating set problems, K- clique problems
and K-cover problems. All four range of K. Where we choose K so that
of at the hard end, under 50% are satisfiable and at the easy end all of
them are satisfiable. And if you look at this graph, yeah if I remember correctly
this is a four clique, and I don’t think
there’s a five clique. You can look at least
with pen and paper and some thought you can solve all of these problems but they’re not
perceptually obvious. Require reasoning and some kind of search even for humans. But again trivial for us, that’s all we’re just
because of the size. And just to really
stress the diversity, I want to show you
a few visualizations. So this is a visualization
of a problem from SR40. Here every red node
is a literal; the red edges are between a
literal and its complement. And there’s a blue edge between any pair of literals that
appear in the same clause. So this is a lousy
representation because we’re marginalizing
of the clauses. You can see that there’s
just no structure it’s just a big tangled mess. Here is a, this beast
is a problem from SR200. You can see again
there’s no structure. It’s just an unbelievable
tangled mess. Now just a reminder, NeuroSAT gets graphs like
this that takes time, where flipping a single edge changes whether
it’s SAT or UNSAT. And it’s able to
look at this and compute at least
30% of the cases, whether it’s SAT
or UNSAT correctly. And actually, return
the satisfying assignment that
explains that bit. Now in contrast here’s
a graph coloring problem, which has this crystalline
structure also very few edges. Here’s a quick detection problem
where you’ve got this special pattern for
each of the five cliques. There’s a dominating
SAT problem also has a very distinct structure
that is distinctly non-random, and vertex cover problem also has its own
little distinct flair. Just to stress how
different these domains are. Now let’s look at the results. So in total over
this suite of problems, there were almost 5,000
satisfiable problems. Average number of
clauses is a 532. So about two and a half times bigger than the biggest problems
seen during training. And NeuroSAT solves 85% of the satisfiable
problems end to end. So again it’s just trained on small random problems
for 26 iterations, and now it’s solving, and
it’s finding solutions to 85% of problems that are bigger across a suite
of different domains. And in contrast,
the percent solved by Survey Propagation end
to end is zero percent. So Survey Propagation just
simply doesn’t work on these. It usually doesn’t converge, or it keeps oscillating
between one, zero and then zero, one. It’s not completely a fair
comparison because Survey Prop is not designed to solve
problems end to end, it’s usually used in decimation, where it just provides a heuristic for which
variables to SAT. But I keep these numbers here to stress that as far as I know, we don’t know of
a message passing algorithm with qualitatively
similar behavior to NeuroSAT. So it hasn’t just relearned an algorithm
that we know about, it’s doing something new that we don’t know how to explain. Yeah.>>But for the random problems,
solving problems.>>Yeah random problems. I haven’t run it
on SRM but on run CMF so every Prop will
solve a decent fraction.>>Okay.>>It doesn’t work. So ironically Survey Propagation is derived
mathematically without reference to a random SAT. And doesn’t generalize to non-random SAT problems as well. They are actually
trained on random SAT. But there’s already
something more general that lets it transfer. Okay now, let’s
talk about. Yeah.>>How do your results compare to the classic
the overall results of phase transition for
random SAT instances.>>So our SR of n is designed to the phase transition is only really hold asymptotically as the number of variables
goes to infinity. So for small n and
like n equals 40. If you find the
empirical threshold which is something like 4.62, 50% will be satisfiable, and 50% will be unsatisfiable. You are not actually
going to be one little away from flipping in general. As n goes to infinity, adding a single clause with high probability will flip
you from SAT to UNSAT. But for smaller n,
the actual threshold has higher variance. So SR of n is basically just a way of
correcting for that. So that every problem is exactly at its own
personal threshold.>>So you’re saying that
for small problem instances, the threshold like clause
density SAT that one would use to guess this high accuracy doesn’t work well,
or it’s too noisy.>>Yes. This variance
about where that threshold is going
to be for small n.>>But on the other hand, is there possible
explanations of how using randomized
style algorithm, random algorithm that
would explain that these somehow perhaps these randomly generated
problems that you consider are could easy to solve from this probabilistic
point of view. Like equivalent
of explaining how the old phase
transition phenomenon appears basically.
Something equivalent.>>I have empirical
evidence against that. Which is that as
you would expect, since it’s essentially
at the threshold, there are many SAT
will really start to grind and then gets bigger. Because our Ks are bigger than three on average,
about five on average. Many SAT will start
to grind more than it would even for a random Pre SAT
as n gets bigger.>>So what about SAT comes from problems people
use to practice, do you rub it? Is
that competition?>>We didn’t. I think the current version trained only the way we’ve trained it, it’s unlikely to be
able to deal with such big problems without
additional training.>>Just because of
that graph. Okay.>>Yeah. I’ve already told you about all of
the problems we’ve run about.>>Yes.>>I think the breathe is
strong enough to suggest that it’s generalizing to different kinds of SAT problems. I don’t think it’s close to
being useful on big problem.>>The quick thing is perhaps your reference to mean SAT I don’t know
exactly what he does. But it’s not a proof that
such an algorithm thing doesn’t exist that [inaudible] Its just
I don’t know how it is related to actually what I
have just asked but anyway, so I say is there
randomize algorithm that could basically explain
the why these clause of random programming instances
that you consider is easy or hard to solve or to guess
whether it’s SAT or not. Then you answer all
these [inaudible]>>I’m not so familiar with the theory on the difficulty of randoms setting and stuff. My understanding and
maybe Rio can correct me is that at least
as n goes to infinity, at the threshold, the problems are thought to
be exponentially difficult. Now n is small here, so we don’t have
any specific proof for n, we’re not actually
sampling at threshold, we’re sampling at
this problems specific threshold, but it’s believed, I believe that there are
if not theoretical results, theoretical conjunctions.>>So it had to be
a hard problem in general. I have no idea in terms of
time check what you just said.>>No.
>>But I don’t want to.>>When the thread, if
you’re sampling the number of clauses below the threshold
it is known to be easy. So Survey Propagation
with decimation can solve.>>Okay, thank you.>>Like 99 percent of
the problems with n I think 100,000 if the clause to variable ratio was
just a little bit under the threshold. We
know those are easy.>>I see. Okay, thank you.>>But we are trying
to make these as hard as possible by sampling at the threshold for smaller n. But I don’t have
any proof these are hard, and I’m not relying
on it for any of the takeaways from dual set.>>What is this threshold again?>>Yes. So, if you
have n variables, and you’re sampling clauses
uniformly at random, let’s say with all clauses
of three literals in them. If you only sample
a couple of clauses with high probability it’s probably
going to be satisfiable. If you sample a million clauses with high probability it’s
going to be unsatisfiable. The point is that for
every fixed number of variables, there are some number
of clauses where there’s a 50 percent chance
it will be satisfiable, and a 50 percent chance
it will be unsatisfiable. And those are the hard regimes. So, if there are a million clauses it’s probably going to be
unSAT for a simple reason. You probably sampled
just some small unSAT core. And if there’s no clauses it’s satisfiable
for a simple reason, anything you try is
going to be work, and it’s difficult
right at the threshold. But again this is not
particularly relevant to NeuroSAT or the claims or
takeaways from this project. Okay. Let’s talk about
detecting unsatisfiability, this is what you
asked about before. So NeuroSAT at least
trained on SR uniform (10, 40) is never confident in unSAT. It searches indefinitely. It searches poorly indefinitely, meaning it stopped taking big moves and it gets stuck
but it searches indefinitely. And it always guesses low confidence unSAT
until it finds a solution. Now, detecting SAT and unSAT
are very different problems. Now, proofs of SAT are
just variable assignments. In particular, they
fit in NeuroSAT memory. NeuroSAT has a memory as a couple of floats
for each literal. And it just needs to
represent a bit mask on those embeddings to be able to encode
the satisfying assignment. Whereas proofs of unSAT vary. So there are some
trivial proofs of unSAT. So, any formula no
matter how big if it has the clause X1 and the cause
not X1 is unsatisfiable. You don’t have to
look at the rest of the problem to know
it’s unsatisfiable. But others require
exponential time or space to establish
unsatisfiability. In some proof calculus
where the objects are potentially unbounded
in particular will not fit in
NeuroSAT’s memory. So even if it was able to
use some kind of stack, data structure we wouldn’t
be able to trivially decode proofs because they won’t even fit in the final
embeddings for NeuroSAT. But, we hypothesize it if the unsatisfiable problems in the trainings set
have short proofs, NeuroSAT will
learn to find them. And to test this we generated the data set where like SR of N, the data comes in pairs, one SAT and one unSAT, and they differ by negating a single literal and
a single clause, but this time by construction, the unSAT problems
have small unSAT cores. And we do this in a simple way, we define this new
distribution SRC(n, u) where u is an unSAT core. And we start with
the small unSAT core u, over a small number of
variables k less than n. We’re going to flip one
literal to make it SAT. We pick the unSAT core
so that this works. We add clauses as long as
it remains satisfiable. So until it becomes unSAT. We flip the same literal
back in the unSAT core, to reintroduce the unsat core. So we have a satisfiable
problem here, and then we reintroduce
the unsat core by flipping a single literal to construct
the unsatisfiable problem. And we return the pair before
and after that final flip. And we train NeuroSAT
on SRC( 40, u), where u is sampled from
a small fixed SAT of core. I think we used four cores. Pigeon hole 2,1 pigeon hole 3, 2, the rivest unsat
core and one other.>>The core…>>We are going to
talk about this. So, NeuroSAT is only
trained as a classifier, and we’re going to see if
we can decode the cores.>>Oh yes. So you
don’t [inaudible]>>NeuroSAT, the interface
doesn’t a priori regime.>>Sorry, I forgot
that you also for the SAT we could use this cycle.>>Yes. And the results we get 100 percent accuracy on the test that in
five minutes training. It’s a much easier problem than the first distribution
which you would expect. Now let’s see how
it actually works. So here’s a run. We call this NeuroUNSAT by
the way when we train it on this just to
keep them separate. This is the same kind of
visualization we saw before. And the first six rows correspond to the six variables that are part of the unsat core. So this is P.P. three
pigeonhole principle three, two. And you see that these
are voting unsat. It clearly recognizes that those six variables
are worth looking at. It does some kind
of computation, and eventually concludes
that it’s not an unsat core. And then all the literal
start voting SAT. In contrast, here’s
the same problem but with the unsat core
reintroduced. It starts out
almost identically, doing some computation on the six variables
involved in the core. And eventually concludes that
actually is an unsat core. And all of the variables in the core vote unsat with
extremely high confidence. This is black. This is
like negative 100 logit. So this is extremely
high confidence and completely outweighs
these plus one, plus twos. So this suggests that it
has learned to at least memorize the unsat cores that we planted in
the data set. Yeah.>>So the variables are the literals set up
to clause variables.>>Yes. These are
the literal votes.>>So to select
the unsat core you have to figure out what are
the clauses within…>>Yes. But I’m going to
talk about this right now. So, first let’s talk about just recovering the literals
that are in the unsat core. As you can guess,
you can just look at which ones are voting
unsat, the most confidence. And that is usually
going to consist of exactly the literals
in the unsat core. But as before,
a more reliable approach is to cluster the literal
embeddings of the final timestep. And when we do that, we see
that there is a problem in SR 100 with same pigeonhole
principle core implanted. And the 188 literals that
are not in the core are in their own cluster and
the 12 literals that are part of the core are
in their own cluster. And this decoding trick
works 98 percent of the time on the test set SRC(40, u). But as you raise, this is not
exactly what you want. You actually want
the unsat core. So, you can use this as long as the number
of literals are small, you can still construct
a proof efficiently by just knowing that you
should branch on these first. The reason why we
don’t get unsat cores is that we’re using, the literal embeddings are
forced to be more semantic because we’re using them to
vote about satisfiability. If the architecture would… If our aim was to
get unsat cores, we would predict based on the clauses instead
of the literals. And then the clauses
would be more semantic and would cluster. But we don’t test this
partly because during training we batch together problems that have
the same number of variables and different
numbers of clauses. So it’s just, the way we set
things up it was easier to batch where we only predict from the literals since the number of
clause has changed. But in principle, with a more clever system you can predict based
on the clauses or both, and then potentially get
the unsat core directly. But it clearly has
the right semantic information. It clearly knows
about the unsat core. Let me summarize
the results so far. We train NeuroSAT as a
classifier on random problems for a few dozen iterations per problem and it works
well as a classifier. It learns to solve
the problems on its own in order to explain
the supervision bit, and we can almost always decode a solution
when it guesses SAT. It extrapolates well, so we
can run for more iterations. It can solve
much bigger problems, not with perfect reliability
but quite often, and it can solve problems
from a wide range of domains. Lastly, it learns
different algorithms depending on the training data. So in SR40, where we don’t think there
are small UNSAT cores, it learns to search for
satisfying assignments and in src40 where we
apply the UNSAT cores, it learns to detect
them instead of searching for
satisfying assignments. Now, I want to share
what I think are the main lessons and
takeaways from NeuroSAT. First to answer the question
I posed in the beginning. Yes a Neural network
can learn to perform precise
logical reasoning. It might not be
perfect, it might have some extrapolation errors, but for the most part
it’s clearly doing, it clearly knows when it has found a satisfying assignment. And this is a bit
sensitive property. It’s sensitive to
individual bit flips in the input and so forth. Two, I think this is
the most interesting part, stochastic gradients descent
or back propagation alone, can find such programs without any discreet search even
with very weak supervision. So, it’s almost amazing but the search algorithm emerges smoothly and gradually
from back propagation. There’s no moment where
it learns to search. The accuracy just gradually goes up and up during training. And in by the end, it has synthesized
a search procedure but there’s no moment
when it does that, there’s no discrete search,
it’s a smooth process. It’s such an alien
computational model that it’s hard to even imagine
but this is what we found. Third, message passing
architectures are, I think, surprisingly
powerful and flexible. So the same
architecture was generic enough to learn two very different algorithms depending on the training data. One is a search
procedure and one is this subgraph detector. And in fact, we have
other experiments where we train with literal level
supervision and it seems to learn an entirely
different algorithm that’s more like a pattern
recognition system with no phase change, no evidence of detecting
satisfying assignments. So I think it’s
a really generic architecture that can encode many different
kinds of computation. And fourth, random training
data can be useful. So conventional wisdom is that, random training data prevents learning patterns in that
if you train on random data, it’s extremely unlikely to generalize to problems
that you care about. But we actually make
use of this property. We use random training
data to prevent learning patterns exactly to force it to learn an algorithm. We force it to be
more systematic. We don’t think that there are enough patterns to exploit in SR40 to be able to predict that bit without
being more systematic, and we use this
to our advantage. And finally, I want to just talk about SAT solving in
practice very briefly. So modern SAT solvers as most of you know are extremely
well engineered. They can solve
real-world problems with literally millions
of variables. They’re really among
the technological marvels of our time. In contrast, NeuroSAT at least as is trained in the regime
that we trained it, is vastly less reliable and efficient and scalable
than state-of-the-art. I want to make that clear.
However, we still have no idea what NeuroSAT is
and is not capable of. So, we really only trained
on toy scale so far. We train on
small random problems, for a few dozen iterations, for a few hours,
on a single GPU. And conventional wisdom
in machine learning is that scaling alone can
be extremely powerful, and it’s a mistake to assume you know an upper bound
on how well your system is going to do until you really scale it up on more data
and bigger hardware. And also, there’s a lot of potential room for improvement in the architecture itself. We’re talking about giving
it a source of randomness as one way that might
lead synthesize a much richer search procedure. And lastly, the
hardware projections are extremely favorable for
this kind of approach. So, hardware used by traditional SAT solvers
as you all know, has stagnated and
the free lunches are over. In contrast, the Neural
network hardware renaissance is just beginning and one day a descendant
of NeuroSAT might actually be able to
beat state-of-the-art, although we have no evidence of this right now and I
don’t think it’s anywhere. I don’t think it’s
right around the corner, but 10 years out if the hardware
improves by a 1000X, I think it’s possible. And that’s all I
have, thank you.>>So, in a very abnormal knots, people have been adding
all sorts of black memories, and I mean the obvious thing
is the stack,.>>Yeah.>>Right? And so do you have>>That’s a great question.>>Converting, adding
new memories to this?>>So before we came up with
the unSAT core experiment, we were trying to get it to do exhaustive search and we gave it access to a Neural stack, and it didn’t use, on SR40. I think we could
have come up with training data regimes where it was an incentive to use it, but SR40, it’s able to explain it by
searching in a number of times steps we allow but
I think it’s a good idea. Intuitively it’s what
it need to be, yeah.>>Right, and in general I mean, I think when you’re
learning programs. I mean, there’s
a whole thought that will, to make the Neural net
more comprehensible, let’s give it some basic data structures
we do understand, and use this but
I don’t know how successful that
also control flow and program synthesis
sort of says, well let’s try to use
machine learning and get the readable program
out and well, that’s all they’re topic.>>I think the algorithm
it learns could not be converted to a simple
reasonable program.>>Right. Most Neural
net like we don’t have a chance of extracting
these things right now though. There’s been some work on
getting decision trees, I don’t know that, I
wouldn’t follow that.>>I think this is, this project is
some evidence that we might still be
underestimating how powerful the algorithm that it learn the
incomprehensible ones that it learns might mean even
without being able to extract.>>So there’s also something
interesting which is like you are doing
fix point iteration, whereas a lot of the uses of the training models
are one shop, right? And so, there’s also, you’re inherently
doing a fix point with no conversion guarantee but well that is also
true now, isn’t it? Because in a lot of applications we just simply don’t
have the time right? We need to look at the picture
and very quickly classify. There’s a pedestrian
industry right? But here, you’re working
in a regime where, well, that’s all we can take time we can also just figure. I mean you have a model
that’s iterable.>>Yes. Partly that
the problems are hard enough to merit
the extra computation.>>Right, exactly.>>There have been a lot of projects in the
last few years that use a similar message
passing architecture.>>Yes.>>But they usually only run
for three or four rounds. They’re doing
something juristic, not procedural like they’re predicting links selling on social network graphs
or something like that, where it’s not clear that
the extra computation helps, and they’ve shown extrapolation to more rounds improving
performance at all. I think part of this is. Going back to the unSAT
core visualization. In Neuro UNSAT, it does not make use
of more rounds at all. If you run it for
one more round, it loses this information. It doesn’t even converge. It’s learned this very
precise algorithms end up here at the right time.>>Yes.>>And it has no reason to
preserve that information. So it’s really an artifact of the training regime that it was incentivized to learn something that would extrapolate
to more rounds. And part of the reason is that, these UNSAT cores is
the same UNSAT core each time. So it’s only four different
subgraph detectors it needs. It can time how long it takes. Whereas, for searching for
sense when it’s timing, it’s going to find
a different iteration numbers. So it needs to learn to
preserve that information. So it learns to
converge around those. Even though it’s not
trained for more iterations, it least learns convergence. Because sometimes it’s
going to find it on iteration five,
sometimes iteration 15, and it’s penalized
very heavily for losing the solution
that it counts.>>Okay, cool. So, if you threw randomness into this
and parallelize, you might bump up these curves. A bit that you showed for the larger instances
that’s purely to be used.>>I think for us, I start to wonder we can get 100% accuracy.>>Really? It’s randomization,
just adding randomly.>>And just training
on sr100, SR200.>>Oh sure, sure, you’re
doing the training there.>>Yeah.>>But it would be
nice to just train on smaller instances
and then probably just have some
technique to boost.>>Yeah. So I don’t know how, I think empirically, 40 is
not hard enough to make, it needs to learn something
that can do restart. Maybe at 100 it can
learn something really general that would extrapolate
to much bigger problems. I don’t think you
need to train on exactly the size you
are going to test on. But you need to train
on something that is qualitatively in some
similar ballpark of difficulty. Or at least it’s hard
enough to require you to learn the tricks you need to solve the bigger problem. One of the things I really
want to stress is just how sensitive what it learns
is to the training data. It will learn essentially
the simplest thing that lets it predict well
on the training data, and if that simple thing is not what you need to
solve the harder problems, it will not solve
the harder problems. Yeah. Oh, were you
raising your hand? Sorry, I thought this was. Yeah.>>Is your data also available?>>Yes, so, the model and the training needed generators
are all online.>>Oh, okay.>>Code is here and
the Arka paper is there.>>Cool. Okay.>>Okay. Thank you.

4 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *