# NeuroSAT: Learning a SAT Solver from Single-Bit Supervision

>>It’s my great pleasure

to introduce Daniel Selsam. Daniel is a student, a PhD candidate at Stanford. He was interning for

us one and a half years ago I have been collaborating with him

for like three years, in completely different topics. Last time he visited

us he was using former methods to verify

Machine Learning algorithms. And today he is

showing the progress he made in trying to

learn a SAT Solver, to have a neuro network

that learns SAT.>>Thanks Leo. Hi, everybody

thanks for coming today. I’ll be showing you how

to learn an entire SAT Solving Algorithm from

Single-Bit Supervision. And this is all joint work with Leonardo and several others

back at Stanford. So, as most of you know

neural networks have beaten the state-of-the-art on

many important problems just in the last few years. Most notably object recognition

and speech recognition. But there have also been

major successes in diverse, and I think unexpected domains including machine translation,

social network analysis, heuristics for board games

like chess and Go, even predicting

properties of molecules, optimizing data

center efficiency, and surprisingly even learning high-performance

indexing data structures. And yet we still know

very little about what neural networks

can and cannot do. In this work we consider

the following question, can a Neural Network learn to perform precise

logical reasoning? And more specifically, can a neural network learn

to solve SAT problems? Here’s our setup. We’re going to train a neural network as a classifier on a data set

of input output pairs, where the input is

a SAT problem P, and the output is a Single-Bit indicating whether or

not P is satisfiable. At test time we’re

going to feed problems into our trained model

which we call NeuroSAT. And it’s going to predict

as it was trained to do whether the problems

are satisfaible or not. As we’ll see later in the talk, it’s able to predict

with high accuracy and when it does

predict satisfiable, we can almost always

decode an actual solution, a satisfying assignment from the internal activations

of the network. And we’ll see how

this works shortly. For some background on SAT, I think most of you will be familiar with this

but just in case. A propositional formula can be represented as an AND of ORS. So here’s an example formula, not X one or X two. And not X two or X

one, some jargon. (X one not X one), (X two not X two) were

all called literals. And C one and C two are

both called clauses. Now a formula is satisfiable if there exists

a satisfying assignment. So the formula above is

satisfiable because there exists a satisfying

assignment specifically, the assignment that maps

X one and X two both to true satisfies it because the first clause is satisfied

because X two is true. And the second clause is

satisfied because X one is true. And the SAT problem

is given a formula, determine if it is

satisfiable and if it is, find a satisfying assignment. Now, our goal is to

train a neural network to predict satisfiability

and we’re going to talk about decoding

solutions and actually solving the problems later on. And there are two main

design challenges, what problems do we train on? And with what kind

of architecture? Let’s talk about

training data first. So, the issue is that some problem

distributions might be arbitrarily easy to classify, based on superficial properties. But, a network that’s trained on such a data set

probably will exploit those properties and

won’t generalize and we won’t really have learned

anything semantic about SAT. So we want to construct

difficult problems to force it to learn

something general, to be more systematic about

exploring the SAT problem. So we define a distribution

that we call SR of N over problems such that the problems

come in pairs, one unSAT, one SAT, and they differ

by negating a single literal and a single clause but otherwise have

the same statistics. So, the idea is that no simple superficial

statistical analysis will be able to distinguish

between the two. And specifically

to sample a pair of problem from SR of N we keep sampling uniformly

random clauses until the problem

is unsatisfiable. At which point we flip a single literal in the final clause

to make it SAT, and we return the pair before and after we

flip that literal. This distribution

is not guaranteed to be hard at any complexity

theoretic sense, but it’s intuitively

statistically hard. For small values of

N it’s still easy for traditional SAT solvers but it will be very hard for

a standard classifier. Now let’s talk about

Network Architecture. So we could treat SAT problems

as character strings.>>You said this are Clauses?>>Yeah.>>You keep sampling

random Clauses?>>As if they were

learned case you know, as if it is uniform

learned on SAT.>>And then you add

clauses. You can->>We start with

N variables and no clauses.>>Okay.>>We’ll sample

a random uniform clause on the N variables, and check clause SAT over

to see if it’s satisfiable.>>Right.>>If it’s satisfiable

we do it again. We keep adding

clauses like this.>>So you keeping adding

clauses, okay that wasn’t clear here so.>>Yeah. Sorry keep

sampling random clauses and adding them until.>>And keep adding

them until unSAT.>>Yeah.

>>Right.>>Sorry that wasn’t clear.>>The point is that,.>>It’s going a little fast.>>So we know that before the last clause is introduced the problem

is still satisfiable. So we know that

flipping any literal in the last clause will preserve the model that was

satisfying previously. So we know we can recover a satisfiable instance by flipping a literal in

the final clause yeah. Sorry if I’m going

too fast please anyone stop me at anytime.>>All right I get

it now. Okay thanks.>>So, let’s talk about

Network Architecture. We could treat SAT problems as character strings and use

standard sequence models. We saw that, I wrote

a SAT problem above it’s characters so the AST standard

models can be applied. But there are many invariances

that we want to enforce that the logic of

satisfiability admits. For example, SAT problems are permutation invariant

in a few ways. You can permute literals, you can permute

clauses and you can permute literals

within a clause. And you don’t want to

have a Sequence Model have to learn all of

this from scratch. We can just build

it into the model. There are also

negation invariant. Which means that you can

negate every occurrence of any specific variable, also without affecting

satisfiability. So we’re going to try to build these invariances into

our Neural Network Model. And the approach is given

a formula like this. We’re going to construct

an undirected graph with a node for every literal so there’s a node for (X1 not X1), (X2 not X2) and a node for

every clause C1 and C2. There’s going to be

one type of edge between every literal and

every clause it appears in. So there’s an edge between

X1 and C1 and X2 and C1 because C1 Contains

the literals X1 and X2. And there’s also a separate type of Edge which I did note with a dotted edge here between a

literal and its complement. So X2 and not X2 are

connected by the dotted edge, and X1 and not X1 are

connected by that edge. Now the basic idea of the model

which we call NeuroSAT, is we’re going to maintain

a vector embedding some inscrutable collection of floating point numbers at

each node in the graph. And we’re going to

iterativly of we pass messages along

the edges of the graph. So, a literal is going

to pass messages to the clauses it appears in

and its complement literal. Clauses are going

to pass messages back to the literals

that occur in them. After teatime steps, we’re

going to map literals into scalar votes using

a learned network. And then we’re just

going to average the votes to compute

a single logic that’s going to guess

whether the entire problem is satisfiable or not. Now I’m going to go

into the architecture in great detail in

the next few slides. But if you have any high level questions

I’ll take them now. Yeah.>>It’s related to survey->>I’m going to discuss every propagation a

little at the end. Tree propagation does

not look like this. There are several

architectural differences. We’ll cover Survey

Propagation with this. In Survey Propagation

the messages that the clauses

sends to the literal, depends on which literal it is. Like you marginalize

out the other literals, which you can’t

express in this way. But it’s structurally similar in the sense that it’s

a message-passing algorithm. So, here’s the NeuroSAT model

in more detail. Where we have a bunch

of learned parameters. So, two vectors, the

L init and C init to initialize

the embeddings for each of the literals and clauses. Three multi-layer

preceptrons, L msg, C msg and L vote. I’ll explain all

of these shortly, and then two layer-norm LSTMs, L update and C update. Now, we’re going to initialize, one matrix that stores all

of the literal embeddings. So, it’s 2N by D, so every row contains the embedding for

one of the literals. And we’re going to

just initialize all of the rows to L init, the learned vector and

similarly for clauses. Every row of C is going to be the embedding

for one of the clauses. And we’re going to run

T iterations of message passing. So, every iteration of message passing

consists of two stages. Stage one, literals are going to pass messages

to the clauses. So, here all of

the literals have embeddings and

they’re going to send them to the clauses

that they occur in. The clauses are going to sum

the incoming messages and also get its previous state with a started edge and use the rnn, the C update layer-norm LSTM

to update its own embedding. And here’s the matrix notation. It’s hard to read but it

just says what I just said. You’ve got the

literal embeddings. You apply this learned MLP to compute the messages,

one for each literal. M is going to be

the binary adjacency matrix. So, this is just summing the messages and sending

them to the clauses and then the clause also get its previous hidden state

and it calls the update LSTM. Now, the stage 2 is similar. Clauses pass messages

to the literals. The only difference here is that the literals get messages

from the clauses, it gets its old state and it also gets

a different kind of message from its complement because of the extra edges

in the graph. Intuitively, these dotted edges

encode the constraint that they should probably

not be equal. Yes.>>What is a message?

What are the messages?>>So, every node has some, at every time step

that every node has some collection of

floating point numbers that represent something potentially inscrutable

about that node. So, at every time step I

have some vector here which represents something hopefully meaningful

about the literal. Now, that’s represented in

this matrix L of t. L msg is a learned Neural Network that

takes those embeddings and computes some other vector of floats that

represents a message. But I have no a

priori knowledge of what that actually

encodes semantically. It has the freedom to choose

what those floats represent.>>And what are these sizes

of the spectrum, again?>>I don’t think I said yet.

We eventually run it with 128 but is pretty arbitrary.>>It’s a good number.>>If we had bigger

hardware we’d run it.>>Got it. Okay, super.>>As big as you can get.>>It’s related to

problem size potentially.>>Potentially,

unclear actually.>>Oh, okay, okay.>>Because reduction sometimes

[inaudible] you don’t always want. I mean having a smaller vector potentially versus

something distractive.>>Conventional wisdom in

machine learning is used as big a vector as

you can get away with in terms of

hardware and performance. We also found that it trains, the bigger we made D

the faster it trained. But it’s not major. You

can train these with 16 even with eight first.>>Okay.>>Yeah.>>And you’re going

to describe how you do the learning itself later?>>Yeah, although

the short answer is background. But I’ll explain

the signal in a second. So, here’s the literals

and it’s similar, I won’t go through

this in detail. It’s identical except for this, except for getting

the compliment of the literals as

a separate message is the same as the clauses. Now, after T iterations, we’re going to compute

a scalar “vote” for each literal by applying this learned multi-layer

perceptron on L vote, that maps each literal

embedding into a single scalar. We’re going to then

average the votes to get a logit which is

just a single scalar for the entire network. And then we’re going to

predict satisfiability using the logit by just

taking the sigmoid function, which is going to return

something between zero and one. Now, we train with

the cross entropy loss. So, we’re using

back propagation to minimize the classification error on a training dataset. Does that->>Where is the input

to your training?>>SAT problems

represented as graphs and a single bit of supervision

for each problem indicating one if it’s SAT, zero if it’s unSAT.>>So you use these pairs that you

were mentioning earlier. Sets of these pairs,

large sets of these pairs.>>Yes. Exactly. But typically, we don’t tell anything

about solutions. We’re just giving it single bit supervision

for each problem. Let’s talk about

experiments now. So, we’re going to train

on SR uniform 10, 40. The uniform part is

not actually essential. It’s just intuitively nice to use a curriculum here because it’s going to be

harder to induce structure on bigger problems. But we train things

without the curriculum and it works just as

well. And we’re going to->>Are those variables

clauses, 10, 40?>>No, sorry. Uniform meaning, we’re sampling the number

of variables from a uniform distribution

between 10 and 40. The number of

clauses is determined by how many we can sample

before it get answered.>>Answered, right.>>And then we’re going

to test on SR(40). Just some perspective on SR(40). There’s 40 variables,

so there are a trillion possible assignments. About 200 clauses per problem. Roughly 1000

literally occurrences, so about five literals

per clause. And all of the clauses

are uniformly random, so every literal is

treated symmetrically. Including X I and not X I are

both treated symmetrically. And every single problem

in the distribution is a single bit away from flipping

between SAT and unSAT. And just a caveat,

because 40 is small, it is still easy for

state of the results. So many SAT can solve SR(40) in almost [inaudible] number

of milliseconds. And results NeuroSAT predicts

with 85 percent accuracy. We’re going to make that number

a lot better shortly, but first let’s

understand how it even beat 50 percent. Yes.>>So, intuitively

what you learn here, is for any pair of formulas, whether you have

the original unSAT or if there’s one flip

which become SAT. That’s what you are

learning here, correct?>>Well, we don’t tell

NeuroSAT that these two, one is SAT one is unSAT. They just get random problems

for distribution.>>I see. So, when you

feed for the training, you do not say that this one was unSAT one and this was SAT?>>Correct. We just feed them, feed the individual problems.>>But they’re labeled.>>They’re labeled but

we don’t tell NeuroSAT.>>So, you do tell them.>>We don’t tell NeuroSAT that they should look at this pair.>>So, input is not a pair.>>Every single

individual problem.>>No, it’s a SAT but he

says it’s effectively what. Okay, that’s not a criticism, I’m just trying to see

what the intuition, if there’s an intuition here. What’s the intuition that

you had in your head? What’s actually

been learned here…>>We have actually a very good explanation

of what’s being learned, which I’m going

to get to shortly.>>Okay.>>So, we were all really surprised by this number

and we trained a few baselines including LSTMs and we could not break

50 percent on any of them. We couldn’t train

a model but even when you fixed the number

of variables and fixed the representation could even distinguish on

small training sets, the difference between

the SAT and unSAT problems. So, the best way, we actually bombarded this

with experiments and made no progress until we just visualized what it was doing

and then it became clear. So, this is an example run of NeuroSAT on a problem

from SR(20). Here, this is the matrix of scalar literal votes

at each time step. So, this is a 20 by two matrix, because there’s 20 variables. So, the Ith row is going to be the scalar vote for X I and then not X I in

the second column. Here, white is zero, red is positive, and

blue is negative. Okay? So, red means SAT, white means unsure,

blue means unSAT. Now, as time goes on, there’s some red and blue here from the

initialization which is an artifact of

the initialization vectors. But very quickly

all of the literals start voting unSAT

with low confidence. So, light blue means

roughly logit of negative one. So, assigning, say 16 percent probability

just to it being SAT. And you can see there’s a lot

happening each time step. A lot of the dots are

changing and moving around but consistently almost all of the literals are voting SAT

with low confidence. And we see this on

every problem it runs on. And this happens for a while, unSAT with no confidence. Quite a few things

are happening, a few speckles of very, very light red that

don’t make any difference because they get washed

out by the averaging. But then, all of a sudden there’s a very distinct

phase transition. All of the literals

start voting SAT with extremely high confidence

and the network converges. So, these light blues are

about negative one logit. These reds are about plus 40. So, this is 99.9999999 percent sure that these problems are satisfiable and

the network convergences. We see this for every problem

that it runs on. If it’s unSAT, it looks

like this forever, and if it’s SAT it either

looks like this for a long time or eventually

end, sometimes, usually, eventually has

this phase transition where it converges. Yes.>>Just to make sure because

that’s where we’re at. This is when you test basically, not when

you learn, here. Right?>>Yes, this is a problem

that was not trained up.>>Okay, okay.>>Yes. So, somehow around here it becomes extremely confident that this problem

is satisfiable. Now, we weren’t exactly sure what it found

that convinced it. But if you look carefully at the literal votes

after convergence, is a little hard to see

because the resolution is poor. But you can maybe make out a pretty distinct

checkerboard pattern. Meaning that there

are only roughly two, maybe three shades of red here. And for each variable, there’s one literal that is distinctly darker

than the other one. So, you can see

a bit pattern here. Do you guys all see it roughly?>>[inaudible]>>You wouldn’t call it what?>>Sorry, go ahead, go ahead.>>So, we hypothesized that maybe this is encoding

is satisfying assignment. And that it had become so confident because

it had found it. So when we made

it a script to just try to just read this off and see if it’s

a satisfying assignment, and it turns out that it usually is a

satisfying assignment. Which suggests that the reason why I became so

confident is actually because it had detected

a satisfying assignment, and knew that that indicated that the problem

must be satisfiable. Now, at first I thought maybe the literals that are true in the assignment are

voting SAT with more confidence even though I didn’t know why

it would do that, it turns out that

is not the case. Sometimes, the literal

that is voting lower, that is voting SAT

with less confidence, is the one that’s true

in the solution. Yeah.>>[inaudible]>>You try both possibilities. So, you try the thing

that maps the one that’s lower to true everywhere, and then you try

the one that’s- Yeah. But this has turned out to be an unreliable way to

decode the solutions. It turns out that

this pair pattern is just an artifact of the fact that the higher dimensional

literal embeddings cluster according to the assignment. So at convergence, in

the 128 dimensional space, the literals that are true are going to be in one cluster, the literals that are

false are going to be in another cluster. And this pair pattern is just an artifact of the projection preserving that clustering

most of the time.>>Excuse me. If

you have this set, you have two [inaudible]

and run it through this. And one cluster could be the true and the other

could be the false.>>Am not.>>And suppose that you

have two disjoint sets of literals that appear

disjoint sets of clauses, then would your learning necessarily correlate when is light verses dark in

one cluster or the other?>>That’s a good question. In that case, it is conceivable that you would

end up with four clusters. The true ones for problem one, the false ones for problem one. My guess is that we didn’t test that SR40s are just

fully connected. My guess is that

the two clusters for the two sets of true literals are going to

be closer together, and then for two cluster, you’ll still get

the right solution. But I don’t have

a proof of that. But the clustering

analysis suggests a much more reliable way

to decode solutions. Which is just directly cluster the higher dimensional

embeddings. So here we can watch the two dimensional

PCA projections of the high dimensional

and embeddings on a run from SR40. Now, I’ve colored the literal, these are the

literal embeddings. And I’ve colored

them according to the solution that

eventually finds. So you can see exactly what

it’s doing over time. These are just the

last nine times steps leading up to convergence. And you see it doesn’t cluster

at all, it’s all mixed up. It’s all mixed up. It’s sort

of starting to separate. Here it’s starting to separate, but one literal is

still incorrect. And then it pulls away and then that convergence falls

into two distinct clusters. And now this method of

decoding is very reliable. And 96 percent of the problems in

the test distribution for which I guess is set, this simple trick

of two clustering will recover the

satisfying assignment. So let’s talk about

solving. Oh yeah.>>Because it doesn’t really use any supervision for

finding that out.>>Right.>>Do you have to

any sort of [inaudible]>>Yeah. So, as I said before

the training distribution. The satisfiable problems and the unsatisfiable problems

are statistically identical. There is no way

for it to predict that bit based on

superficial properties. It learned that

the best way to predict that bit is to search for

satisfying assignments. And not to get SAT until it

found one at which point, guess that with

extreme confidence and suffer no objective

function cost. Yeah. I think it’s

an artifact of the distribution not permitting any other way of

explaining the data. We’re going to talk

about training it on slightly different distributions

later in the talk, and you’ll see how

the training data affects the procedure

that it learns. So let’s talk about

solving the rest. Sometimes, there are

satisfiable problems for which it does not guess SAT. It will continue to

guess on SAT with low confidence.

Here’s an example. Just keeps wandering around, but because NeuroSAT is

inherently iterative, we can just run it for more iterations and

see what happens. So we run this, run this,

run this for a long time. Fast forward to around 137. We only trained

and tested so far on 26 rounds of message passing. So this is five or

six times more rounds than it ever performed

during training. And all the literals are still voting on SAT with

low confidence. Few speckles of red and

then round 150, bam! Face transition, extremely

confident in SAT, the literal embeddings converge. And once again this bit pattern will usually encode

a satisfying assignment. And the clustering technique is even more reliable as always. Now, here’s a graph showing the percentage

of problems from SR(40)SR that it solved as a function of the number of

rounds of message passing. So it solves about 70 percent at 26 rounds because that

is what we trained it on. But as we run it for longer, it continues to

solve more and more. By about a thousand rounds

of message passing, it’s solving

just about 100 percent of the problems in

the test distribution SR(40)SR. And this is solving end to end. This is not just

prediction accuracy. We are decoding solutions for almost 100 percent of the problems in

the test distribution. And curves like this are, I think it’s fair to say fairly unprecedented in

machine learning. Usually, algorithms

are not iterative. And even when they

are, they don’t just continue to do better

when you run longer.>>So, what answer can you

give the SAT over there?>>We’re going to

talk about this later. Right now, no. Right now

it wanders around forever. When trained on SR(40)SR, it will never become

confident in on SAT, but we’re going to

talk about this later. There are a few

cool experiments. Now, it would be

implausible to solve much bigger or

harder problems with only 26 rounds of

message passing. You can imagine that

it’s just not possible to reliably solve big hard problems

with so few rounds. But because we can

run for more rounds, and it continues to

search effectively, it’s conceivable

what we can actually solve problems that are harder. So, let’s talk about

scaling to bigger problems. Here, this purple line is the same line I just showed

you on the previous graph. And we are plotting

the same lines for SR of N for larger

N. So SR(80)SR, SR 120, SR 160, and SR 200. And we see similar curves

where for SR(80)SR, instead of 70 percent,

it’s only able to solve maybe 40 percent when you

only run it for 26 iterations. But as you run it

for up to a thousand, it’s solving about 90 percent

of the problems. And even SR 200, which is substantially

harder problem, and many SAT is now starting

to noticeably slow down on. It’s solving… It solves under five percent of the number

of rounds it was trained on. But it gets up

to almost 30 percent when you run it for

a thousand rounds. Now again, although from

a machine learning perspective, the fact that these curves

go up at all when you extrapolate the number of rounds is interesting and impressive. But it also suggests that the algorithm as is

is highly incomplete. It’s possible that this curve spikes up, we didn’t

run it for longer. But it looks like

it’s asymptotic. Yeah.>>What was your training size, how many SAT problems, how many examples do you fit

your training algorithms?>>It’s a stream. So, I don’t actually have

the exact number. Probably.>>Roughly?

>>A few million.>>A few million?>>Yeah.>>Okay.>>But none of the problems

are bigger than SR(40)SR. We are training on

SR 10 to SR(40)SR. So our hypothesis for

why it flattens out here is that during training, it’s essentially performing

some kind of local search, and it never has to

restart during training. SR(40)SR is easy enough. That is heuristics

are essentially complete without needing to

revert when it gets stuck. So when we watch it run, it very quickly

makes every step. The first few steps,

it will flip many, many literals at

a time and very quickly satisfy over 99 percent

of the clauses. And then, it will start taking much more cautious steps

flipping only a few literals. And we see for

these longer runs and the big problems that it

just stops being aggressive. Now, we think

an architectural change that would make it much more reliable when run on

harder problems for many iterations is giving it access to a source

of randomness. So it has some way to

learn when it’s stuck and when it should inject

noise into its embedding. So, we haven’t tried

this yet. Yeah.>>It’s true what

you see around this is all what you mean

by the iterations? So you have this

recurrence relation between CI and CI plus one, H and these are

the transformations that you buy, and the X-axis?>>Yes. Every iteration

here involves the literals passing messages to the clauses and the clauses passing messages

back to the literals.>>How fast is that dimensions

in the iteration cost?>>Sorry.>>Go ahead.>>So, let me go back

and show you the updates. So, it’s applying

a single neural network here, which is as fast as

applying, that is. In this case, it’s

a three-layer neural network. It’s just basically a couple of matrix modes as far as

matrix modes for the graph, for the message passing, and then the C update is

another matrix mode. So, it depends on the dimension and

the number of literals, but it’s not expensive. Yes.>>But compared

to sets, methods, space, and message

passing of which. I guess it would be

permutation, is one example, and the difference then is

the matrix modification, and where the

modifications come from?>>So, Survey Propagation

is not learned at all. It’s derived mathematically. The structure of

the messages is different. So as I said before,

a clause will pass different messages to

the literals that occur in it. Here, that clause that pass the same messages

to every literal, so there are a couple architectural

differences like that. Numerically, the very

difference of Survey Prop often fails for

numerical reasons that you’re dividing by zero. And here, there’s

no division. So it’s-.>>But, I thought that

philosophical difference would be these matrix multiplications

that you get from your end.>>I don’t know if that’s a philosophical

difference, but that’s a big difference. That there’s a lot more

floating point computation being done per iteration here

than in Survey Propagation. Survey Propagation, it’s a few multiplications

and divisions on a single float for

each literal and clause. Yes.>>I’m just trying

to suit the comment that you made earlier

about the training. So, for each of these five here, we have different training

sets that you used?>>No, this is all- sorry. Everything we’re talking

about today until the very end is just trained

on SR uniform 10 and 40.>>Okay.>>This is all extrapolation

to bigger problems. It has not been retrained.>>So, SR 10? What is SR 10 and 40?

What is that?>>Sorry, SR uniform 10, 40, curriculum, training on SR, somewhere between 10 and 40. The end is randomly

sampled between 10 and 40.>>So, for 80 for instance,

so the training set. Basically, what’s going on? The algorithms is picking patterns that are learned from smaller instances

between 10 and 40, like if you get it to 200, you tried to detect

these patterns somehow, correct?>>I don’t think that’s

the best way to think of it. It’s not the way I think

about it. I think that there are very few patterns

that’s exploiting. And rather than

looking for patterns, it is synthesized

a local search algorithm, which is what makes it generalized where

the patterns would->>What incidents

do you have for your interpretation?

So for instance->>Yes. So, I think if we go

back to the video in action. I think this is actually-.>>This is a visualization of the duration

of the algorithm. It doesn’t show to study

the pattern learned, and it is being identified

in a test sample. I mean how do you see it?>>I’m not sure I

understand the distinction. I think this is

rock solid evidence that it learns that it needs to find somehow a search

for satisfying assignments, and that is the best way to

explain the supervision bit.>>But how does it find this?>>That, I can only speculate.>>It’s by pattern recognition, that’s why I’m asking

where you’ve learned this.>>I have one relevant

experiment here. Trying to tell whether it

was doing a local search or global search. It’s not clear. So, we can decode

the solution by doing this two clustering

technique at the end. It’s not clear

that if we cluster, it’s meaningful

before convergence. But, we hypothesized

that maybe it is an inconsistent assignment

before convergence, and we tested what properties

that assignment had. So when we do this, we see that the number of

clauses satisfied increased monotonically very quickly

and rarely go down, and we see that the number of literals that are flipped at each time step starts out high, and as the number of

clauses satisfied goes up, the number of literals

flipped, goes down. So qualitatively, it looks

a lot like local search, some kind of an nearly. But, it’s not clear

what stronger evidence would constitute

because ultimately, the algorithm is inscrutable. I can’t read

the matrix instance to understand specifically

what it’s doing.>>And we know somehow that it’s also getting stuck from

the graphs and the margins.>>Yes.>>Because it appears

to be plateauing.>>Yes.>>So, it’s not learning

enough from backtracking, enough to go search some

other space because it seems so limited in what

it’s scope, somehow.>>Yes.>>Because of the size.>>I think the lack of needing to do any kind of

restart during training. it’s just not prepared

for such hard problems. Are you satisfied? Do you have another hypothesis

for what it’s doing?>>Well, I think the experiment that might be interesting. If you were using

just a sequence to sequence based on strings, string encoding

of volume sensors, how would- can you approximate

your result? Is there->>We implemented

a baseline and could not be 50 percent on the

classification testing.>>I see.>>There’s two reasons

why I think those approaches can’t

get off the ground. One is that, there’s

no obvious way to deliberate. I mean to deliberate, to run for a long time. A crucial part of

neuroscience is that, it’s iterative, that

it can think longer. A single pass over a sequence is very unlikely to be able to solve than big,

hard problems. But also, it’s just has to spend all of its time learning the invariances

that we build in. It’s another thing. There’s so many different

common permutation invariances, exponential number of that.>>But, if you were to normalize the inputs

for instance by, you sort for each

clause literals by name, and you could get part of that, and still

had this straight. So that would perhaps help test the pattern

theory so to speak, the simple fact

of recognition so far would prove powerful for

its context. I don’t know.>>Maybe. I would

be very surprised if somebody could be

50 percent with something that didn’t have

this iterative formulation, but I would be thrilled to be proved

wrong about that. Yes.>>So, with regards to this

picking up the patterns, you mentioned that you used this SR distribution because you don’t want to pick up

some bad node terms.>>Yes. We designed it to

try to prevent it from, to be able to

exploit the patterns.>>So, did you actually do the other experiment

of completely running a

randomly-generated data, so that you see what

are the accuracies?>>Unfortunately, no. We’ve trained on

regular random set at the empirical threshold, and it does perform well also. But that’s almost

as patternless as our distribution because all of the literals are of

the same statistics. Even though there’s not

a single bit separated, most of the problems is at the empirical threshold

are going to be-.>>Which data set this is? Which data set this is?>>So, the data set

we generate SR event.>>Right.

>>It is very similar to the standard uniform

random set of problems.>>Okay.>>The only difference is that, we adjust the number of

clauses specifically for each problem to find the exact threshold where

it flips from set to onset. But what you can

do is, you can just sample for a given end. That threshold is usually at

the same number of clauses, roughly 4.27 clauses

per variable ratio. So, if you just sample that number of clauses

for a fixed end, you’re going to be a few flips away from flipping

my construction. We trained on that, and it

also has a similar quality.>>What I’m trying to say that, if your hypothesis is true, then the other way

training on that, the other data set should

do significantly better because you can pick up the patterns specific

to the data set?>>Yes. So, I don’t

know if the random set has such patterns

close to the threshold. But, I think one of the promises

of this technique is, if there is a distribution

of set problems that has special structure that

you care about in practice, that this could

learn a less generic, more constant procedure to exploit the patterns

in that data set. Yes. But, you have to be

careful with the data set. If you can predict

satisfiability on the data set based on

superficial patterns. If you have some degenerate data set where

your satisfiable problems come from planning and your unsatisfiable problems

come from scheduling, it could potentially predict with perfect accuracy based on patterns without learning that it needs to search for

satisfying assignments. So, it wouldn’t be useful.

Ideally, the usefulness of the patterns

is limited enough, that it has to learn to search, but it uses the patterns

to help search better. There’s a lot of art in

the data set design I think, and we’re going to see

networks that are trained on a different data set shortly

to really stress this point. Let me fast forward. Okay. Let’s talk about

generalizing to other domains. So, all of the test problems so far have been uniform random, which may have

its own statistical idiosyncrasies or may just, it’s just inherently

uninteresting to some extent. There are an

incredible diversity of possible SAT problems. Almost hard to conceive how

the simple formalism can express such a great variety of different kinds of problems. Any problem in NP can

be reduced to SAT, and the structural and

statistical properties can really vary widely. So, an actual question is, does NeuroSAT trained

on the random problems generalize to other

domains or is it just exploiting properties

of random SAT? Now to test this, we generated a suite of diverse SAT problems. So first, we generated random graphs from a bunch of different random

graph distributions, Spare Bósi, Erdos-Renyi,

Forest Fire, and a few others. Small graphs all of

them look roughly like this about 10

Nodes, 17 Edges. For each graph, we generated

K-coloring problems, K- dominating set problems, K- clique problems

and K-cover problems. All four range of K. Where we choose K so that

of at the hard end, under 50% are satisfiable and at the easy end all of

them are satisfiable. And if you look at this graph, yeah if I remember correctly

this is a four clique, and I don’t think

there’s a five clique. You can look at least

with pen and paper and some thought you can solve all of these problems but they’re not

perceptually obvious. Require reasoning and some kind of search even for humans. But again trivial for us, that’s all we’re just

because of the size. And just to really

stress the diversity, I want to show you

a few visualizations. So this is a visualization

of a problem from SR40. Here every red node

is a literal; the red edges are between a

literal and its complement. And there’s a blue edge between any pair of literals that

appear in the same clause. So this is a lousy

representation because we’re marginalizing

of the clauses. You can see that there’s

just no structure it’s just a big tangled mess. Here is a, this beast

is a problem from SR200. You can see again

there’s no structure. It’s just an unbelievable

tangled mess. Now just a reminder, NeuroSAT gets graphs like

this that takes time, where flipping a single edge changes whether

it’s SAT or UNSAT. And it’s able to

look at this and compute at least

30% of the cases, whether it’s SAT

or UNSAT correctly. And actually, return

the satisfying assignment that

explains that bit. Now in contrast here’s

a graph coloring problem, which has this crystalline

structure also very few edges. Here’s a quick detection problem

where you’ve got this special pattern for

each of the five cliques. There’s a dominating

SAT problem also has a very distinct structure

that is distinctly non-random, and vertex cover problem also has its own

little distinct flair. Just to stress how

different these domains are. Now let’s look at the results. So in total over

this suite of problems, there were almost 5,000

satisfiable problems. Average number of

clauses is a 532. So about two and a half times bigger than the biggest problems

seen during training. And NeuroSAT solves 85% of the satisfiable

problems end to end. So again it’s just trained on small random problems

for 26 iterations, and now it’s solving, and

it’s finding solutions to 85% of problems that are bigger across a suite

of different domains. And in contrast,

the percent solved by Survey Propagation end

to end is zero percent. So Survey Propagation just

simply doesn’t work on these. It usually doesn’t converge, or it keeps oscillating

between one, zero and then zero, one. It’s not completely a fair

comparison because Survey Prop is not designed to solve

problems end to end, it’s usually used in decimation, where it just provides a heuristic for which

variables to SAT. But I keep these numbers here to stress that as far as I know, we don’t know of

a message passing algorithm with qualitatively

similar behavior to NeuroSAT. So it hasn’t just relearned an algorithm

that we know about, it’s doing something new that we don’t know how to explain. Yeah.>>But for the random problems,

solving problems.>>Yeah random problems. I haven’t run it

on SRM but on run CMF so every Prop will

solve a decent fraction.>>Okay.>>It doesn’t work. So ironically Survey Propagation is derived

mathematically without reference to a random SAT. And doesn’t generalize to non-random SAT problems as well. They are actually

trained on random SAT. But there’s already

something more general that lets it transfer. Okay now, let’s

talk about. Yeah.>>How do your results compare to the classic

the overall results of phase transition for

random SAT instances.>>So our SR of n is designed to the phase transition is only really hold asymptotically as the number of variables

goes to infinity. So for small n and

like n equals 40. If you find the

empirical threshold which is something like 4.62, 50% will be satisfiable, and 50% will be unsatisfiable. You are not actually

going to be one little away from flipping in general. As n goes to infinity, adding a single clause with high probability will flip

you from SAT to UNSAT. But for smaller n,

the actual threshold has higher variance. So SR of n is basically just a way of

correcting for that. So that every problem is exactly at its own

personal threshold.>>So you’re saying that

for small problem instances, the threshold like clause

density SAT that one would use to guess this high accuracy doesn’t work well,

or it’s too noisy.>>Yes. This variance

about where that threshold is going

to be for small n.>>But on the other hand, is there possible

explanations of how using randomized

style algorithm, random algorithm that

would explain that these somehow perhaps these randomly generated

problems that you consider are could easy to solve from this probabilistic

point of view. Like equivalent

of explaining how the old phase

transition phenomenon appears basically.

Something equivalent.>>I have empirical

evidence against that. Which is that as

you would expect, since it’s essentially

at the threshold, there are many SAT

will really start to grind and then gets bigger. Because our Ks are bigger than three on average,

about five on average. Many SAT will start

to grind more than it would even for a random Pre SAT

as n gets bigger.>>So what about SAT comes from problems people

use to practice, do you rub it? Is

that competition?>>We didn’t. I think the current version trained only the way we’ve trained it, it’s unlikely to be

able to deal with such big problems without

additional training.>>Just because of

that graph. Okay.>>Yeah. I’ve already told you about all of

the problems we’ve run about.>>Yes.>>I think the breathe is

strong enough to suggest that it’s generalizing to different kinds of SAT problems. I don’t think it’s close to

being useful on big problem.>>The quick thing is perhaps your reference to mean SAT I don’t know

exactly what he does. But it’s not a proof that

such an algorithm thing doesn’t exist that [inaudible] Its just

I don’t know how it is related to actually what I

have just asked but anyway, so I say is there

randomize algorithm that could basically explain

the why these clause of random programming instances

that you consider is easy or hard to solve or to guess

whether it’s SAT or not. Then you answer all

these [inaudible]>>I’m not so familiar with the theory on the difficulty of randoms setting and stuff. My understanding and

maybe Rio can correct me is that at least

as n goes to infinity, at the threshold, the problems are thought to

be exponentially difficult. Now n is small here, so we don’t have

any specific proof for n, we’re not actually

sampling at threshold, we’re sampling at

this problems specific threshold, but it’s believed, I believe that there are

if not theoretical results, theoretical conjunctions.>>So it had to be

a hard problem in general. I have no idea in terms of

time check what you just said.>>No.

>>But I don’t want to.>>When the thread, if

you’re sampling the number of clauses below the threshold

it is known to be easy. So Survey Propagation

with decimation can solve.>>Okay, thank you.>>Like 99 percent of

the problems with n I think 100,000 if the clause to variable ratio was

just a little bit under the threshold. We

know those are easy.>>I see. Okay, thank you.>>But we are trying

to make these as hard as possible by sampling at the threshold for smaller n. But I don’t have

any proof these are hard, and I’m not relying

on it for any of the takeaways from dual set.>>What is this threshold again?>>Yes. So, if you

have n variables, and you’re sampling clauses

uniformly at random, let’s say with all clauses

of three literals in them. If you only sample

a couple of clauses with high probability it’s probably

going to be satisfiable. If you sample a million clauses with high probability it’s

going to be unsatisfiable. The point is that for

every fixed number of variables, there are some number

of clauses where there’s a 50 percent chance

it will be satisfiable, and a 50 percent chance

it will be unsatisfiable. And those are the hard regimes. So, if there are a million clauses it’s probably going to be

unSAT for a simple reason. You probably sampled

just some small unSAT core. And if there’s no clauses it’s satisfiable

for a simple reason, anything you try is

going to be work, and it’s difficult

right at the threshold. But again this is not

particularly relevant to NeuroSAT or the claims or

takeaways from this project. Okay. Let’s talk about

detecting unsatisfiability, this is what you

asked about before. So NeuroSAT at least

trained on SR uniform (10, 40) is never confident in unSAT. It searches indefinitely. It searches poorly indefinitely, meaning it stopped taking big moves and it gets stuck

but it searches indefinitely. And it always guesses low confidence unSAT

until it finds a solution. Now, detecting SAT and unSAT

are very different problems. Now, proofs of SAT are

just variable assignments. In particular, they

fit in NeuroSAT memory. NeuroSAT has a memory as a couple of floats

for each literal. And it just needs to

represent a bit mask on those embeddings to be able to encode

the satisfying assignment. Whereas proofs of unSAT vary. So there are some

trivial proofs of unSAT. So, any formula no

matter how big if it has the clause X1 and the cause

not X1 is unsatisfiable. You don’t have to

look at the rest of the problem to know

it’s unsatisfiable. But others require

exponential time or space to establish

unsatisfiability. In some proof calculus

where the objects are potentially unbounded

in particular will not fit in

NeuroSAT’s memory. So even if it was able to

use some kind of stack, data structure we wouldn’t

be able to trivially decode proofs because they won’t even fit in the final

embeddings for NeuroSAT. But, we hypothesize it if the unsatisfiable problems in the trainings set

have short proofs, NeuroSAT will

learn to find them. And to test this we generated the data set where like SR of N, the data comes in pairs, one SAT and one unSAT, and they differ by negating a single literal and

a single clause, but this time by construction, the unSAT problems

have small unSAT cores. And we do this in a simple way, we define this new

distribution SRC(n, u) where u is an unSAT core. And we start with

the small unSAT core u, over a small number of

variables k less than n. We’re going to flip one

literal to make it SAT. We pick the unSAT core

so that this works. We add clauses as long as

it remains satisfiable. So until it becomes unSAT. We flip the same literal

back in the unSAT core, to reintroduce the unsat core. So we have a satisfiable

problem here, and then we reintroduce

the unsat core by flipping a single literal to construct

the unsatisfiable problem. And we return the pair before

and after that final flip. And we train NeuroSAT

on SRC( 40, u), where u is sampled from

a small fixed SAT of core. I think we used four cores. Pigeon hole 2,1 pigeon hole 3, 2, the rivest unsat

core and one other.>>The core…>>We are going to

talk about this. So, NeuroSAT is only

trained as a classifier, and we’re going to see if

we can decode the cores.>>Oh yes. So you

don’t [inaudible]>>NeuroSAT, the interface

doesn’t a priori regime.>>Sorry, I forgot

that you also for the SAT we could use this cycle.>>Yes. And the results we get 100 percent accuracy on the test that in

five minutes training. It’s a much easier problem than the first distribution

which you would expect. Now let’s see how

it actually works. So here’s a run. We call this NeuroUNSAT by

the way when we train it on this just to

keep them separate. This is the same kind of

visualization we saw before. And the first six rows correspond to the six variables that are part of the unsat core. So this is P.P. three

pigeonhole principle three, two. And you see that these

are voting unsat. It clearly recognizes that those six variables

are worth looking at. It does some kind

of computation, and eventually concludes

that it’s not an unsat core. And then all the literal

start voting SAT. In contrast, here’s

the same problem but with the unsat core

reintroduced. It starts out

almost identically, doing some computation on the six variables

involved in the core. And eventually concludes that

actually is an unsat core. And all of the variables in the core vote unsat with

extremely high confidence. This is black. This is

like negative 100 logit. So this is extremely

high confidence and completely outweighs

these plus one, plus twos. So this suggests that it

has learned to at least memorize the unsat cores that we planted in

the data set. Yeah.>>So the variables are the literals set up

to clause variables.>>Yes. These are

the literal votes.>>So to select

the unsat core you have to figure out what are

the clauses within…>>Yes. But I’m going to

talk about this right now. So, first let’s talk about just recovering the literals

that are in the unsat core. As you can guess,

you can just look at which ones are voting

unsat, the most confidence. And that is usually

going to consist of exactly the literals

in the unsat core. But as before,

a more reliable approach is to cluster the literal

embeddings of the final timestep. And when we do that, we see

that there is a problem in SR 100 with same pigeonhole

principle core implanted. And the 188 literals that

are not in the core are in their own cluster and

the 12 literals that are part of the core are

in their own cluster. And this decoding trick

works 98 percent of the time on the test set SRC(40, u). But as you raise, this is not

exactly what you want. You actually want

the unsat core. So, you can use this as long as the number

of literals are small, you can still construct

a proof efficiently by just knowing that you

should branch on these first. The reason why we

don’t get unsat cores is that we’re using, the literal embeddings are

forced to be more semantic because we’re using them to

vote about satisfiability. If the architecture would… If our aim was to

get unsat cores, we would predict based on the clauses instead

of the literals. And then the clauses

would be more semantic and would cluster. But we don’t test this

partly because during training we batch together problems that have

the same number of variables and different

numbers of clauses. So it’s just, the way we set

things up it was easier to batch where we only predict from the literals since the number of

clause has changed. But in principle, with a more clever system you can predict based

on the clauses or both, and then potentially get

the unsat core directly. But it clearly has

the right semantic information. It clearly knows

about the unsat core. Let me summarize

the results so far. We train NeuroSAT as a

classifier on random problems for a few dozen iterations per problem and it works

well as a classifier. It learns to solve

the problems on its own in order to explain

the supervision bit, and we can almost always decode a solution

when it guesses SAT. It extrapolates well, so we

can run for more iterations. It can solve

much bigger problems, not with perfect reliability

but quite often, and it can solve problems

from a wide range of domains. Lastly, it learns

different algorithms depending on the training data. So in SR40, where we don’t think there

are small UNSAT cores, it learns to search for

satisfying assignments and in src40 where we

apply the UNSAT cores, it learns to detect

them instead of searching for

satisfying assignments. Now, I want to share

what I think are the main lessons and

takeaways from NeuroSAT. First to answer the question

I posed in the beginning. Yes a Neural network

can learn to perform precise

logical reasoning. It might not be

perfect, it might have some extrapolation errors, but for the most part

it’s clearly doing, it clearly knows when it has found a satisfying assignment. And this is a bit

sensitive property. It’s sensitive to

individual bit flips in the input and so forth. Two, I think this is

the most interesting part, stochastic gradients descent

or back propagation alone, can find such programs without any discreet search even

with very weak supervision. So, it’s almost amazing but the search algorithm emerges smoothly and gradually

from back propagation. There’s no moment where

it learns to search. The accuracy just gradually goes up and up during training. And in by the end, it has synthesized

a search procedure but there’s no moment

when it does that, there’s no discrete search,

it’s a smooth process. It’s such an alien

computational model that it’s hard to even imagine

but this is what we found. Third, message passing

architectures are, I think, surprisingly

powerful and flexible. So the same

architecture was generic enough to learn two very different algorithms depending on the training data. One is a search

procedure and one is this subgraph detector. And in fact, we have

other experiments where we train with literal level

supervision and it seems to learn an entirely

different algorithm that’s more like a pattern

recognition system with no phase change, no evidence of detecting

satisfying assignments. So I think it’s

a really generic architecture that can encode many different

kinds of computation. And fourth, random training

data can be useful. So conventional wisdom is that, random training data prevents learning patterns in that

if you train on random data, it’s extremely unlikely to generalize to problems

that you care about. But we actually make

use of this property. We use random training

data to prevent learning patterns exactly to force it to learn an algorithm. We force it to be

more systematic. We don’t think that there are enough patterns to exploit in SR40 to be able to predict that bit without

being more systematic, and we use this

to our advantage. And finally, I want to just talk about SAT solving in

practice very briefly. So modern SAT solvers as most of you know are extremely

well engineered. They can solve

real-world problems with literally millions

of variables. They’re really among

the technological marvels of our time. In contrast, NeuroSAT at least as is trained in the regime

that we trained it, is vastly less reliable and efficient and scalable

than state-of-the-art. I want to make that clear.

However, we still have no idea what NeuroSAT is

and is not capable of. So, we really only trained

on toy scale so far. We train on

small random problems, for a few dozen iterations, for a few hours,

on a single GPU. And conventional wisdom

in machine learning is that scaling alone can

be extremely powerful, and it’s a mistake to assume you know an upper bound

on how well your system is going to do until you really scale it up on more data

and bigger hardware. And also, there’s a lot of potential room for improvement in the architecture itself. We’re talking about giving

it a source of randomness as one way that might

lead synthesize a much richer search procedure. And lastly, the

hardware projections are extremely favorable for

this kind of approach. So, hardware used by traditional SAT solvers

as you all know, has stagnated and

the free lunches are over. In contrast, the Neural

network hardware renaissance is just beginning and one day a descendant

of NeuroSAT might actually be able to

beat state-of-the-art, although we have no evidence of this right now and I

don’t think it’s anywhere. I don’t think it’s

right around the corner, but 10 years out if the hardware

improves by a 1000X, I think it’s possible. And that’s all I

have, thank you.>>So, in a very abnormal knots, people have been adding

all sorts of black memories, and I mean the obvious thing

is the stack,.>>Yeah.>>Right? And so do you have>>That’s a great question.>>Converting, adding

new memories to this?>>So before we came up with

the unSAT core experiment, we were trying to get it to do exhaustive search and we gave it access to a Neural stack, and it didn’t use, on SR40. I think we could

have come up with training data regimes where it was an incentive to use it, but SR40, it’s able to explain it by

searching in a number of times steps we allow but

I think it’s a good idea. Intuitively it’s what

it need to be, yeah.>>Right, and in general I mean, I think when you’re

learning programs. I mean, there’s

a whole thought that will, to make the Neural net

more comprehensible, let’s give it some basic data structures

we do understand, and use this but

I don’t know how successful that

also control flow and program synthesis

sort of says, well let’s try to use

machine learning and get the readable program

out and well, that’s all they’re topic.>>I think the algorithm

it learns could not be converted to a simple

reasonable program.>>Right. Most Neural

net like we don’t have a chance of extracting

these things right now though. There’s been some work on

getting decision trees, I don’t know that, I

wouldn’t follow that.>>I think this is, this project is

some evidence that we might still be

underestimating how powerful the algorithm that it learn the

incomprehensible ones that it learns might mean even

without being able to extract.>>So there’s also something

interesting which is like you are doing

fix point iteration, whereas a lot of the uses of the training models

are one shop, right? And so, there’s also, you’re inherently

doing a fix point with no conversion guarantee but well that is also

true now, isn’t it? Because in a lot of applications we just simply don’t

have the time right? We need to look at the picture

and very quickly classify. There’s a pedestrian

industry right? But here, you’re working

in a regime where, well, that’s all we can take time we can also just figure. I mean you have a model

that’s iterable.>>Yes. Partly that

the problems are hard enough to merit

the extra computation.>>Right, exactly.>>There have been a lot of projects in the

last few years that use a similar message

passing architecture.>>Yes.>>But they usually only run

for three or four rounds. They’re doing

something juristic, not procedural like they’re predicting links selling on social network graphs

or something like that, where it’s not clear that

the extra computation helps, and they’ve shown extrapolation to more rounds improving

performance at all. I think part of this is. Going back to the unSAT

core visualization. In Neuro UNSAT, it does not make use

of more rounds at all. If you run it for

one more round, it loses this information. It doesn’t even converge. It’s learned this very

precise algorithms end up here at the right time.>>Yes.>>And it has no reason to

preserve that information. So it’s really an artifact of the training regime that it was incentivized to learn something that would extrapolate

to more rounds. And part of the reason is that, these UNSAT cores is

the same UNSAT core each time. So it’s only four different

subgraph detectors it needs. It can time how long it takes. Whereas, for searching for

sense when it’s timing, it’s going to find

a different iteration numbers. So it needs to learn to

preserve that information. So it learns to

converge around those. Even though it’s not

trained for more iterations, it least learns convergence. Because sometimes it’s

going to find it on iteration five,

sometimes iteration 15, and it’s penalized

very heavily for losing the solution

that it counts.>>Okay, cool. So, if you threw randomness into this

and parallelize, you might bump up these curves. A bit that you showed for the larger instances

that’s purely to be used.>>I think for us, I start to wonder we can get 100% accuracy.>>Really? It’s randomization,

just adding randomly.>>And just training

on sr100, SR200.>>Oh sure, sure, you’re

doing the training there.>>Yeah.>>But it would be

nice to just train on smaller instances

and then probably just have some

technique to boost.>>Yeah. So I don’t know how, I think empirically, 40 is

not hard enough to make, it needs to learn something

that can do restart. Maybe at 100 it can

learn something really general that would extrapolate

to much bigger problems. I don’t think you

need to train on exactly the size you

are going to test on. But you need to train

on something that is qualitatively in some

similar ballpark of difficulty. Or at least it’s hard

enough to require you to learn the tricks you need to solve the bigger problem. One of the things I really

want to stress is just how sensitive what it learns

is to the training data. It will learn essentially

the simplest thing that lets it predict well

on the training data, and if that simple thing is not what you need to

solve the harder problems, it will not solve

the harder problems. Yeah. Oh, were you

raising your hand? Sorry, I thought this was. Yeah.>>Is your data also available?>>Yes, so, the model and the training needed generators

are all online.>>Oh, okay.>>Code is here and

the Arka paper is there.>>Cool. Okay.>>Okay. Thank you.

neuro.ai

Great talk, except for the obnoxious dude with the glasses and the Microsoft sweatshirt.

Wow, shaping the message network like the problem/instance shape seems like a very powerful idea for other problems with structured input! Well done!

This Daniel guy is brilliant