Optimization With Parity Constraints: From Binary Codes to Discrete Integration

Optimization With Parity Constraints: From Binary Codes to Discrete Integration Stefano Ermon*, Carla P. Gomes*, Ashish Sabharwal+, and Bart Selman* *Cornell University +IBM Watson Research Center UAI - 2013

High-dimensional integration • High-dimensional integrals in statistics, ML, physics • Expectations / model averaging • Marginalization • Partition function / rank models / parameter learning • Curse of dimensionality: • Quadrature involves weighted sum over exponential number of items (e.g., units of volume) n dimensional hypercube L2 L3 L4 Ln L

Discrete Integration Size visually represents weight 2n Items 5 • We are given • A set of 2n items • Non-negative weights w • Goal: compute total weight • Compactly specified weight function: • factored form (Bayes net, factor graph, CNF, …) • Example 1: n=2 variables, sum over 4 items • Example 2: n= 100 variables, sum over 2100 ≈1030 items (intractable) 4 1 … 0 5 1 0 2 factor 5 2 Goal: compute 5 + 0 + 2 + 1 = 8 1 0

Hard EXP Hardness PSPACE P^#P PH 0 1 • 0/1 weights case: • Is there at least a “1”? SAT • How many “1” ? #SAT • NP-complete vs. #P-complete. Much harder • General weights: • Find heaviest item (combinatorial optimization, MAP) • Sum weights (discrete integration) • [ICML-13] WISH: Approximate Discrete Integration via Optimization. E.g., partition function via MAP inference • MAP inference often fast in practice: • Relaxations / bounds • Pruning NP P 0 1 Easy 0 3 4 7

WISH : Integration by Hashing and Optimization The algorithm requires only O(n log n) MAP queries to approximate the partition function within a constant factor MAP inference on model augmented with random parity constraints Repeat log(n) times Outer loop over n variables Aggregate MAP inference solutions AUGMENTED MODEL Original graphical model σ{0,1}n Parity check nodes enforcing A σ= b (mod 2) σ n binary variables

Visual working of the algorithm n times • How it works 1 random parity constraint 2 random parity constraints 3 random parity constraints Function to be integrated …. …. …. …. Log(n) times Mode M0 + median M1 + median M2 + median M3 ×4 ×1 ×2 + …

Accuracy Guarantees • Theorem [ICML-13]: With probability at least 1- δ (e.g., 99.9%) WISH computes a 16-approximation of the partition function (discrete integral) by solving θ(n log n) MAP inference queries (optimization). • Theorem [ICML-13]: Can improve the approximation factor to (1+ε) by adding extra variables and factors. • Example: factor 2 approximation with 4n variables • Remark: faster than enumeration only when combinatorial optimization is efficient

Summary of contributions • Introduction and previous work: • WISH: Approximate Discrete Integration via Optimization. • Partition function / marginalization via MAP inference • Accuracy guarantees • MAP Inference subject to parity constraints: • Tractable cases and approximations • Integer Linear Programming formulation • New family of polynomial time (probabilistic) upper and lower bounds on partition function that can be iteratively tightened (will reach within constant factor) • Sparsity of the parity constraints: • Techniques to improve solution time and bounds quality • Experimental improvements over variational techniques

MAP inference with parity constraints Hardness, approximations, and bounds

Making WISH more scalable • Would approximations to the optimization (MAP inference with parity constraints) be useful? YES • Bounds on MAP (optimization) translate to bounds on the partition function Z (discrete integral) • Lower bounds (local search) on MAP  lower bounds on Z • Upper bounds (LP,SDP relaxation) on MAP  upper bounds on Z • Constant-factor approximations on MAP  constant factor on Z • Question: Are there classes of problems where we can efficiently approximate the optimization (MAP inference) in the inner loop of WISH?

Error correcting codes Communication over a noisy channel • Bob: There has been a transmission error! What was the message actually sent by Alice? • Must be a valid codeword • As close as possible to received message y Alice Bob Noisy channel y x 0100|1 0110|1 Redundant parity check bit= 0 XOR 1 XOR 0 XOR 0 Parity check bit = 1 ≠ 0 XOR 1 XOR 1 XOR 0 = 0

Decoding a binary code Noisy channel x y • Max-likelihood decoding 0110|1 0100|1 ML-decoding graphical model Noisy channel model x Transmitted string must be a codeword More complex probabilistic model MAP inference is NP hard to approximate within any constant factor [Stern, Arora,..] Max w(x) subject to A x = b (mod 2) Equivalent to MAP inference on augmented model LDPC Routinely solved: 10GBase-T Ethernet, Wi-Fi 802.11n, digital TV,.. Our more general case Parity check nodes Parity check nodes

Decoding via Integer Programming • MAP inference subject to parity constraints encoded as an Integer Linear Program (ILP): • Standard MAP encoding • Compact (polynomial) encoding by Yannakakis for parity constraints • LP relaxation: relax integrality constraint • Polynomial time upper bounds • ILP solving strategy: cuts + branching + LP relaxations • Solve a sequence of LP relaxations • Upper and lower bounds that improve over time Parity polytope

Iterative bound tightening Polynomial time upper ad lower bounds on MAP that are iteratively tightened over time • Recall: bounds on optimization (MAP)  (probabilistic) bounds on the partition function Z. New family of bounds. • WISH: When MAP is solved to optimality (LowerBound = UpperBound), guaranteed constant factor approximation on Z

Sparsity of the parity constraints Improving solution time and bounds quality

Inducing sparsity • Observations: • Problems with sparse A x = b (mod 2) are empirically easier to solve (similar to Low-Density Parity Check codes) • Quality of LP relaxation depends on A and b , not just on the solution space. Elementary row operations (e.g., sum 2 equations) do not change solution space but affect the LP relaxation. • Reduce A x = b (mod 2) to row-echelon form with Gaussian elimination (linear equations over finite field) • Greedy application of elementary row operations Matrix A in row-echelon form Parity check nodes Equivalent but sparser Parity check nodes

Improvements from sparsity • Quality of LP relaxations significantly improves • Finds integer solutions faster (better lower bounds) Without sparsification, fails at finding integer solutions (LB) Upper bound improvement Improvements from sparsification using IBM CPLEX ILP solver for a 10x10 Ising Grid

Generating sparse constraints We optimize over solutions of A x = b mod 2 (parity constraints) • WISH based on Universal Hashing: • Randomly generate A in {0,1}i×n, b in {0,1}i • Then A x + b (mod 2) is: • Uniform over {0,1}i • Pairwise independent • Suppose we generate a sparse matrix A • At most k variables per parity constraint (up to k ones per row of A) • A x+b (mod 2) is still uniform, not pairwise independent anymore • E.g. for k=1, A x = b mod 2 is equivalent to fixing i variables. Lots of correlation. (Knowing A x = b tells me a lot about A y = b) n A x b i = (mod 2) Given variable assignments x and y , the events A x = b (mod 2) and A y =b (mod 2) are independent.

Using sparse parity constraints • Theorem: With probability at least 1- δ (e.g., 99.9%) WISH with sparse parity constraints computes an approximate lower bound of the partition function. • PRO: “Easier” MAP inference queries • For example, random parity constraints of length 1 (= on a single variable). Equivalent to MAP with some variables fixed. • CON: We lose the upper bound part. Output can underestimate the partition function. • CON: No constant factor approximation anymore

MAP with sparse parity constraints • MAP inference with sparse constraints evaluation • ILP and Branch&Bound outperform message-passing (BP, MP and MPLP) 10x10 attractive Ising Grid 10x10 mixed Ising Grid

Experimental results • ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW)

Experimental results (2) • ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW)

Conclusions • [ICML-13]WISH: Discrete integration reduced to small number of optimization instances (MAP) • Strong (probabilistic) accuracy guarantees • MAP inference is still NP-hard • Scalability: Approximations and Bounds • Connection with max-likelihood decoding • ILP formulation + sparsity (Gauss sparsification & uniform hashing) • New family of probabilistic polynomial time computable upper and lower bounds on partition function. Can be iteratively tightened (will reach within a constant factor) • Future work: • Extension to continuous integrals and variables • Sampling from high-dimensional probability distributions

Extra slides

Optimization With Parity Constraints: From Binary Codes to Discrete Integration