260 likes | 399 Views
Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz Jeremy Minshull, DNA 2.0. THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU. Protein Engineering Current Paradigms. Mechanism-based (Rational) detailed structural analysis Empiricism-based
E N D
Putting Engineering back into Protein EngineeringJun Liao, UC Santa CruzManfred K. Warmuth, UC Santa CruzJeremy Minshull, DNA 2.0 THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU
Protein Engineering Current Paradigms • Mechanism-based • (Rational) detailed structural analysis • Empiricism-based • (Non-rational ) libraries based
Mechanism-Based Protein Engineering Based on thermodynamic principles • Calculations are approximate • calculation cost • structures are really not rigid (MDS) • Calculations are primarily able to predict binding • catalysis is a special case of binding to a transition state • Changes in amino acids are designed based on these principles • very small numbers (<5) of new proteins are synthesized and tested
Proteins related to wild type Simulated cross over New variants Empiricism-Based Protein Engineering • Uses similar principles to evolution • make many variants • screen to find those with the best properties • No mechanistic understanding needed • Produces large numbers of variants (>1,000) which are very difficult / expensive to screen for practically relevant properties
The Key Challenge in Protein Engineering = High throughput screens (surrogate assays) Reality Molecular mechanistic models (does not model activity) What we need is not what we assay for….
What we want in Protein Engineering Wish List • No need to develop surrogate assay • Variants are tested directly under application conditions • Rapid process. Requirements • Identification of appropriate amino acid substitutions • Design and synthesis of information-rich variants • Interpretation of quantitative functional data using machine learning techniques.
Protein Engineering using Machine Learning Starting point Select a protein with some correct initial properties Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate Machine learning Model the effect of sequence changes on function(s) of interest. End Select the best variant(s).
Engineering of Proteinase K • Long-term goal of engineering proteinase K to degrade polylactic acid • Member of the serine protease family • Large amounts of phylogenetic and sequence information available • Several different measurable activities available for optimization
Protein Engineering using Machine Learning Starting point Select a protein with some correct initial properties Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate Machine learning Model the effect of sequence changes on function(s) of interest. End Select the best variant(s).
Expert System for Substitution Selection Expert system: - Calculation of 9 independent scores that measure changes that have succeeded in other places in Nature - Weight and combine scores to pick best changes Proteins related to proteinaseK 19 switches = search space of 219 = 500,000 ? ? ? ? ?
x x x Aa 1 x x x x x x x x x x x Aa 2 x x x Finding Optima in Complex Landscapes:Design of Experiment Aa 1 Aa 2 Changing 1 amino acid at a time Making multiple changes simultaneously …Now try to envision doing this not with 2, but 200 amino acids / dimensions
Protein Engineering using Machine Learning Starting point Select a protein with some correct initial properties Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Back to Proteinase K Reality check Synthesize and test the variant set for function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate Machine learning Model the effect of sequence changes on function(s) of interest. End Select the best variant(s).
Protein Engineering using Machine Learning Starting point Select a protein with some correct initial properties Initial design a) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions Reality check Synthesize and test the variant set for function(s) of interest. New design Propose a new variant set (<50) based on the model. Iterate Machine learning Model the effect of sequence changes on function(s) of interest. End Select the best variant(s).
Sequence-Activity Modeling: How Does it Work? 1. Represent the sequence as a matrix Seq1 AGRWGIGAYHKLIMA Seq2 AGRTGVGVYHKLIMA Seq3 AGRWGIGVYHRLIMA Seq4 AGRTGVGAYHRLIMA becomes T WV IV AR K x x1 x2 x3 x4 x5 x6 x7 x8 Seq1 0 1 0 1 0 1 0 1 Seq2 1 0 1 0 1 0 0 1 Seq3 0 1 0 1 1 0 1 0 Seq4 1 0 1 0 0 1 1 0 2. Measure the activity or activities of interest under the final application conditions 3. y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi
Assessing the Proteinase K Sequence-Activity Relationship Measured activity wt Predicted activity y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi
Learning Methods • Variety of regression methods • Ridge Regression & Lasso • SVM Regression & LPSVM Regression • Matching Loss Regression & One-norm Matching Loss Regression • Partial Least Square Regression • LPBoost Regression • Use bagging to improve the prediction stability
Variants Design I • Main issue: Exploitationvs. Exploration • Optimum design (Exploitation) • Take the combination of substitutions predicted to have maximal activity • Also consider • Substitution frequency in the dataset • Variation of weight estimation. • Used in 2nd & 3rd iterations
Variants Design II • Diversity design (Exploration) • Calculate the combination of substitutions predicted to have maximal activity that is also • No more than 5 changes from a sequence that has already been tested • No closer than 3 changes from a sequence that has already been tested or selected for synthesis • Used in 2nd iteration
Three Iterations of Activity Engineering 50 • ONLY 58 variants were tested to allow design of the fourth set, which contained • 3 variants 20-30 x improved over wild-type • 50% of variants more active than the best of previous sets • 70% of variants more active than wild types • 3-11 changes found in variants better than WT 1st set: 34 variants 45 2nd set: 24 variants 40 3rd set: 38 variants 35 30 Activity relative to wild type 25 20 15 10 5 wild-type 0 0 20 40 60 80 100 120 80 90 Variants in order synthesized
Activity Improvement Improving Activity
14 12 10 8 6 4 2 0 Activity (pmol/s/ml) Half life at 68°C (s) Variants are Improved in Multiple Properties Half life at 68°C (s) Activity (pmol/s/ml)
Conclusions • Machine learning • Making a very small number of variants (58) allows a productive search of a total space with 500,000 possible combinations • Synthetic Biology • Recent advances in gene synthesis methods were essential for this type of exploration
The Future • Proteins are the building blocks of life with a wide array of applications (therapeutics, diagnostics, industrial catalysts) • Finding a reliable mechanism for optimizing proteins for human applications would be an amazing feat • We steal ideas about how proteins evolve from nature, but optimize proteins outside their in vivo constraints (the proteins don’t have to be compatible with life)