1 / 46

Learning Submodular Functions

Learning Submodular Functions. Nick Harvey University of Waterloo Joint work with Nina Balcan , Georgia Tech. Submodular functions. V={1,2, …, n} f : 2 V ! R. Submodularity :. Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|).

lewis
Download Presentation

Learning Submodular Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Submodular Functions Nick HarveyUniversity of Waterloo Joint work with Nina Balcan, Georgia Tech

  2. Submodular functions V={1,2, …, n} f : 2V!R • Submodularity: • Concave Functions Let h : R!R be concave.For each S µ V, let f(S) = h(|S|) f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,Tµ V Equivalent • Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) 8SµTµV, xT Examples: • Vector Spaces Let V={v1,,vn}, each vi2Rn.For each S µ V, let f(S) = rank(V[S])

  3. Submodular functions V={1,2, …, n} f : 2V!R • Submodularity: f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,Tµ V Equivalent • Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) 8SµTµV, xT Monotone: f(S) · f(T), 8 S µ T Non-negative: f(S) ¸ 0, 8 S µ V

  4. Submodular functions • Strong connection between optimization and submodularity • e.g.: minimization [C’85,GLS’87,IFF’01,S’00,…],maximization [NWF’78,V’07,…] • Algorithmic game theory • Submodular utility functions • Much interest in Machine Learning community recently • Tutorials at major conferences:ICML, NIPS, etc. • www.submodularity.org is a Machine Learning site • Interesting to understand their learnability

  5. Exact Learning with value queries Goemans, Harvey, Iwata, Mirrokni SODA 2009 Algorithm x1 • Algorithm adaptively queries xi and receives value f(xi), for i=1,…,q, where q=poly(n). • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) • Goal: g(x)·f(x)·®¢g(x) 8x 2 {0,1}n ® as small as possible f(x1) f : {0,1}n R g : {0,1}n R

  6. Exact Learning with value queries Goemans, Harvey, Iwata, Mirrokni SODA 2009 • Algorithm adaptively queries xi and receives value f(xi), for i=1,…,q • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) • Goal: g(x)·f(x)·®¢g(x) 8x 2 {0,1}n ® as small as possible • Theorem: (Upperbound) 9 an alg. for learning a submodular functionwith ® =O(n1/2). ~ • Theorem: (Lower bound) • Any alg. for learning a submodular functionmust have ® = (n1/2). ~

  7. Problems with this model • In learning theory, usually only try to predict value of mostpoints • GHIM lower bound fails if goal is to do well on most of the points • To define “most” need a distribution on {0,1}n Is there a distributional modelfor learning submodular functions?

  8. Our Model Distribution Don {0,1}n xi Algorithm • Algorithm sees examples (x1,f(x1)),…, (xq,f(xq))where xi’s are i.i.d. from distribution D • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) f : {0,1}n R+ g : {0,1}n R+ f(xi)

  9. Our Model Distribution Don {0,1}n Algorithm x • Algorithm sees examples (x1,f(x1)),…, (xq,f(xq))where xi’s are i.i.d. from distribution D • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) • Prx1,…,xq[ Prx[g(x)·f(x)·®¢g(x)] ¸1-² ] ¸1-± • “Probably MostlyApproximatelyCorrect” f : {0,1}n R+ g : {0,1}n R+ Is f(x) ¼ g(x)?

  10. Our Model Distribution Don {0,1}n Algorithm x • “Probably MostlyApproximatelyCorrect” • Impossible if f arbitrary and # training points ¿ 2n • Possible if f is a non-negative, monotone, submodular function f : {0,1}n R+ g : {0,1}n R+ Is f(x) ¼ g(x)?

  11. Example: Concave Functions h • Concave Functions Let h : R!R be concave.

  12. Example: Concave Functions V ; • Concave Functions Let h : R!R be concave.For each SµV, let f(S) = h(|S|). • Claim: f is submodular. • We prove a partial converse.

  13. Theorem:Every submodular function looks like this. Lots of approximately usually. V ;

  14. Theorem:Every submodular function looks like this. Lots of approximately usually. Theorem:Let f be a non-negative, monotone, submodular, 1-Lipschitz function. There exists a concave function h : [0,n] !Rs.t., for any ²>0, for everyk2{0,..,n}, and for a 1-² fraction of SµV with |S|=k,we have: In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Proof: Based on Talagrand’s Inequality. V ;  matroid rank function h(k) ·f(S) · O(log2(1/²))¢h(k).

  15. Learning Submodular Functionsunder any product distribution Product DistributionD on {0,1}n xi Algorithm • Algorithm: Let ¹ = §i=1f(xi) / q • Let g be the constant function with value ¹ • This achieves approximation factor O(log2(1/²)) on a 1-² fraction of points, with high probability. • Proof: Essentially follows from previous theorem. f : {0,1}n R+ g : {0,1}n R+ f(xi) q

  16. Learning Submodular Functionsunder an arbitrary distribution? • Same argument no longer works.Talagrand’s inequality requires a product distribution. • Intuition:A non-uniform distribution focuses on fewer points,so the function is less concentrated on those points. V ;

  17. A General Upper Bound? • Theorem: (Our upper bound)9 an algorithm for learning a submodular function w.r.t. an arbitrary distribution that has approximation factor O(n1/2).

  18. Computing Linear Separators + – + – + + – – + + – – – • Given {+,–}-labeled points in Rn, find a hyperplanecTx = b that separates the +s and –s. • Easily solved by linear programming. + – – +

  19. Learning Linear Separators + – + – + Error! + – – + + – – – + – – + • Given random sampleof {+,–}-labeled points in Rn, find a hyperplanecTx = b that separates most ofthe +s and –s. • Classic machine learning problem.

  20. Learning Linear Separators + – + – + Error! + – – + + – – – + – – + • Classic Theorem: [Vapnik-Chervonenkis 1971?]O( n/²2 ) samples suffice to get error ². ~

  21. Submodular Functions are Approximately Linear • Let f be non-negative, monotone and submodular • Claim:f can be approximated to within factor nby a linear functiong. • Proof Sketch: Let g(S) = §s2Sf({s}).Then f(S) ·g(S) ·n¢f(S). Submodularity: f(S)+f(T)¸f(SÅT)+f(S[T) 8S,TµV Monotonicity: f(S)·f(T) 8SµT Non-negativity: f(S)¸0 8SµV

  22. Submodular Functions are Approximately Linear n¢f g f V

  23. n¢f g – • Randomly sample {S1,…,Sq} from distribution • Create + for f(Si) and – for n¢f(Si) • Now just learn a linear separator! – + + + – – f + + – + V – + –

  24. n¢f g • Theorem:g approximates f to within a factor n on a 1-² fraction of the distribution. • Can improve to factor O(n1/2) by GHIM lemma: ellipsoidal approximation of submodular functions. f V

  25. A Lower Bound? • A non-uniform distribution focuses on fewer points,so the function is less concentrated on those points • Can we create a submodular function with lots ofdeep “bumps”? • Yes! V ;

  26. A General Lower Bound • Theorem: (Our general lower bound) • No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factorõ(n1/3). Plan: Use the fact that matroid rank functions are submodular. Construct a hard family of matroids. Pick A1,…,Am½ V with |Ai| = n1/3 and m=nlog n X High=n1/3 X X X Low=log2 n A1 A2 A3 … … …. …. AL

  27. Matroids • Ground Set V • Family of Independent Sets I • Axioms: • ; 2 I“nonempty” • J½I2I)J2I“downwards closed” • J, I2I and |J|<|I| )9x2InJs.t. J+x2I“maximum-size sets can be found greedily” • Rank function: r(S) = max { |I| : I2I and IµS }

  28. V ; f(S) = min{ |S|, k } |S| (if |S| · k) r(S) = k (otherwise)

  29. A V ; |S| (if |S| · k) r(S) = k-1 (if S=A) k (otherwise)

  30. A1 A2 A3 Am V ; A = {A1,,Am}, |Ai|=k 8i |S| (if |S| · k) r(S) = k-1 (if S 2A) Claim: r is submodular if |AiÅAj|·k-2 8ij r is the rank function of a “paving matroid” k (otherwise)

  31. A1 A2 A3 Am V ; A = {A1,,Am}, |Ai|=k 8i, |AiÅAj|·k-2 8ij |S| (if |S| · k) r(S) = k-1 (if S 2A) k (otherwise)

  32. A1 If algorithm seesonly these examples A2 A3 Then f can’t bepredicted here Am V ; Delete half of the bumps at random. If m large, alg. cannot learn which were deleted ) any algorithm to learn f has additive error 1 |S| (if |S| · k) r(S) = k-1 (if S 2A and wasn’t deleted) k (otherwise)

  33. A1 A2 A3 Am V ; Can we force a bigger error with bigger bumps? Yes! Need to generalize paving matroids A needs to have very strong properties

  34. The Main Question • Let V = A1[[Am and b1,,bm2N • Is there a matroids.t. • r(Ai) · bi8i • r(S) is “as large as possible” for SAi(this is not formal) • If Ai’s are disjoint, solution is partition matroid • If Ai’s are “almost disjoint”, can we find a matroid that’s “almost” a partition matroid? Next: formalize this

  35. Lossless Expander Graphs • Definition:G =(U[V, E) is a (D,K,²)-lossless expanderif • Every u2U has degree D • |¡ (S)| ¸ (1-²)¢D¢|S| 8SµU with |S|·K, where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E } “Every small left-set has nearly-maximalnumber of right-neighbors” U V

  36. Lossless Expander Graphs • Definition:G =(U[V, E) is a (D,K,²)-lossless expanderif • Every u2U has degree D • |¡ (S)| ¸ (1-²)¢D¢|S| 8SµU with |S|·K, where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E } “Neighborhoods of left-vertices areK-wise-almost-disjoint” U V

  37. Trivial Case: Disjoint Neighborhoods U V • Definition:G =(U[V, E) is a (D,K,²)-lossless expanderif • Every u2U has degree D • |¡ (S)| ¸ (1-²)¢D¢|S| 8SµU with |S|·K, where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E } • If left-vertices have disjoint neighborhoods, this gives an expander with ²=0, K=1

  38. Main Theorem: Trivial Case A1 ·b1 u1 ·b2 V U • Suppose G =(U[V, E) has disjoint left-neighborhoods. • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }. • Let b1, …, bm be non-negative integers. • Theorem:is family of independent sets of a matroid. u2 A2 u3 Partition matroid

  39. Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi¸ 4²D 8i A1 ·b1 ·b2 A2

  40. Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi¸ 4²D8i • “Desired Theorem”: I is a matroid, where

  41. Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi¸ 4²D8i • Theorem: I is a matroid, where

  42. Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi¸ 4²D8i • Theorem: I is a matroid, where • Trivial case: G has disjoint neighborhoods,i.e., K=1 and ²=0. = 0 = 0 = 1 = 1

  43. LB for Learning Submodular Functions n1/3 A1 • How deep can we make the “valleys”? V log2 n A2 ;

  44. LB for Learning Submodular Functions • Let G =(U[V, E) be a (D,K,²)-lossless expander, where Ai = ¡(ui) and • |V|=n −|U|=nlogn • D = K = n1/3 − ² = log2(n)/n1/3 • Such graphs exist by the probabilistic method • Lower Bound Proof: • Delete each node in U with prob. ½, then use main theorem to get a matroid • If ui2U was not deleted then r(Ai) ·bi = 4²D = O(log2n) • Claim: If ui deleted then Ai2I(Needs a proof) )r(Ai) = |Ai| = D = n1/3 • Since # Ai’s = |U| = nlogn, no algorithm can learna significant fraction of r(Ai) values in polynomial time

  45. Summary • PMAC model for learning real-valued functions • Learning under arbitrary distributions: • Factor O(n1/2) algorithm • Factor (n1/3)hardness (info-theoretic) • Learning under product distributions: • Factor O(log(1/²)) algorithm • New general family of matroids • Generalizes partition matroids to non-disjoint parts

  46. Open Questions • Improve (n1/3) lower bound to (n1/2) • Explicit construction of expanders • Non-monotone submodular functions • Any algorithm? • Lower bound better than (n1/3) • For algorithm under uniform distribution, relax 1-Lipschitz condition

More Related