1 / 28

Better Data with Machine Learning and Serverless

Creating valuable insights out of raw data files, such as audio or video, has traditionally been a very manual and tedious process, and has produced mixed results due to an influential human element in the mix. Thanks to enhancements in machine learning systems, coupled with the rapidly deployable nature of serverless technology as a middleware layer, we are able to create highly sophisticated data insight platforms to replace the huge time requirements that have typically been required in the past. With this in mind, we’ll look at: - How to build end-to-end data insight and predictor systems, built on the back of serverless and machine learning systems. - Best practices for working with serverless technology for ferrying information between raw data files and machine learning systems through an eventing system. - Considerations and practical examples of working with the security implications of dealing with sensitive information.

jcleblanc
Download Presentation

Better Data with Machine Learning and Serverless

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Better Data with Machine Learning and Serverless Jonathan LeBlanc Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: jleblanc@box.com

  2. Agenda for Today Building Blocks: How are these systems built? Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  3. Part 1: Building Blocks

  4. 1 What Machine Learning Isn’t Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  5. 1 Components of the System Serverless Framework Provides the compute and data management from stored data location to machine learning engine. Machine Learning System Provides the data enhancement capabilities which improves the underlying source data’s metadata (information about information). Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  6. 1 Why Serverless? On Demand: Machine learning ties are only required when files need processing, which may be infrequent. No hosting: You don’t have to run or manage any servers, containers, or VMs of your own. Pricing based on use: Execution resources are only run (and charged for) based on your use, typically resulting in very low server costs. Different stack options: Multiple serverless systems exist to fit stack needs, including numerous open source options. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  7. 1 Components of the System Webhook / Event Pump System: Handles notifications to the middleware layer when a new file should be processed. Middleware Layer: Handles communication between the data source and machine learning systems. Metadata Layer: The storage facility for machine learning data responses. Token Downscoping System: Allows you to pass tightly scoped read / write tokens through multiple uncontrolled system layers. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  8. 1 How a Data / ML System Works Webhook Execute Metadata Callback Cloud Data Data store & initial metadata Serverless Framework Callback handler and code execution Machine Learning Data processor and enhancer Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  9. 1 Common Serverless Frameworks AWS Lambda: https:/ /aws.amazon.com/lambda/ Considerations Azure Functions: https:/ /azure.microsoft.com/en-us/services/functions/ 1. Your stack Google Cloud Functions: https:/ /cloud.google.com/functions/ 2. Pricing / free use IronFunctions: https:/ /github.com/iron-io/functions 3. Supported languages 4. Regional support OpenWhisk: https:/ /openwhisk.apache.org/ Fission: https:/ /fission.io/ Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  10. 1 Machine Learning Frameworks Audio / Video / Image Text Extraction Open Source • [video] MS Video Indexer • [audio] Voicebase • [face] Hive AI • [image] Clarifai • [image] Google Vision • [mixed] IBM Watson • [moderation] MS Content Moderator • [face] Kairos • [audio] AT&T Speech • [image] Amazon Rekognition • [id] Acuant • [invoice] Rossum.AI • [contract] eBrevia • [lease] Leverton • [resume] TextKernal • [prediction] AmazonML • [analysis] Aylien • [classification] MonkeyLearn • [natural language] ApiAI • [sentiment] AlchemyText • TensorFlow • Keras • Scikit-learn • MS Cognitive Toolkit • Theano • Caffe • Torch • Accord.NET Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  11. Part 2: Best Practices

  12. 2 Program Logic and Serverless Separation Serverless function agnostic: The core logic of the function should be separate from the serverless requirements. Thin handlers / routers may be written on top of the core logic to maintain separation. Service deployments: To allow for deployment amongst numerous serverless technologies, systems like serverless.com may be utilized. Testability: The separation of concerns allows you to test the function separately from the container. Handler: Separate handler from core program logic for testability. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  13. AWS Lambda Handler / / API Gateway Handler exports.handler = (event, context, callback) => { / / Check for valid event if (isValidEvent()) { processEvent(); } else { callback(null, { statusCode: 200, body: 'Event received but invalid' }); } }; Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  14. 2 Dealing with Cold Starts What is it: The latency experienced when a function is triggered, which only runs when there isn’t a warn / idle container. A container is automatically dropped after a period of inactivity. Options: You can either keep the container warm through memory increases and calls, or deal with the cold start. Fewer libraries: The more libraries that are used the longer it will take to start the container. Smaller functions: Writing smaller functions decreases start time. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  15. 2 Exit Callback Hygiene Error logging: With many serverless environments proper callback use will provide full data logging. Reliability: Failing to exist properly can result in your function executing until a timeout is hit. Timeouts may also cause subsequent invocations to require a cold start, which results in additional latency. Cost: If a timeout occurs, you will be charged for the entire timeout time. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  16. Processing AWS Lambda Exit Callbacks / / Success Callback callback(null, { statusCode: 200, body: 'Event processed' }); / / Error Callback callback({ statusCode: 400, body: 'Event error' }); Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  17. 2 Writing Stateless Single Purpose Functions Error isolation: Debugging and error handling is easier with function / concern isolation. Scaling: With monolith functions, you have to optimize entire for all elements of the functions, rather than the specific functionality receiving the most calls / traffic. Planning and testing: It’s easier to plan and write test plans for functions with singular concerns. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  18. Valid Event Function /** * Check for a valid event. * @param {object} indexerEvent – indexer event * @return {boolean} - true if valid event */ const isValidEvent = (indexerEvent) => { return (indexerEvent.body || indexerEvent.queryStringParameters); }; Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  19. Part 3: Security Considerations

  20. 3 Security Considerations Serverless use consideration: Are serverless systems a viable / approved mechanism within your organization? Token exposure: Many API auth systems are token based, with broadly scoped tokens, leading to the potential of token leakage. Credential exposure: With the use of numerous APIs, each with auth credentials, we have the potential of credential leakage. Sensitive information exposure: Data is being passed through multiple systems and we have to be aware of how the information is used / stored. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  21. 3 Middleware System Serverless Solution All compute functionality is offloaded to the serverless framework. On-prem Solution All computer functionality (and connection to the ML system) is run off of existing internal servers. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  22. 3 Protecting Credentials Use Secure Storage: Use a secure system to store API credentials or tokens, such as the AWS Systems Manager Parameter Store. Least Privilege Principle: Functions requiring access to credentials should follow the least privilege principle, meaning they have access to only as much data as they absolutely need. Separate Environment Credentials: Credentials used in a more open developer environment should not be the same used in a production deployment. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  23. 3 Token Downscoping Access Token Fully scoped access token Downscoped Token Tightly scoped child token Channel Transmission Transmit through uncontrolled channels Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  24. 3 Token Downscoping Components Tightly scoped for single file: A token should only be scoped for the item needed for processing, such as a file. Short lived: Downscoped tokens should only live for their natural useful time (e.g. 1 hour) Revocable: Downscoped tokens may be revoked before natural expiration through the API. Split read / write functions: To further scope token exposure, separate read / write tokens can be issued. Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  25. 3 Sensitive Information Exposure Data in the files: What information is being transmitted through the channels in the files, and is it sensitive information? Are channels secure: Are all connections between your systems, the serverless framework, and the machine learning system secure? How the ML system handles data: Does the machine learning system store any data long-term, and how secure is that storage? Logging sensitive information: Are you logging sensitive information during general program flow unintentionally? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  26. 3 Tokenisation Specification 2. PAN 1. PAN 4. Token / Status 3. Token / Status Data Request Sensitive information request Cloud Data API Data hosting service API Secure Data Vault Secure vault hosting data files Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  27. Wrapup Topics Building Blocks: How are these systems built? Best Practices: How do we architect the solution? Security Considerations: How do ensure data security? Jonathan LeBlanc • Director of Developer Advocacy @ Box • Twitter: @jcleblanc • Email: jleblanc@box.com

  28. Better Data with Machine Learning and Serverless Slides: http:/ /bit.ly/ato-bdml Jonathan LeBlanc Jonathan LeBlanc (Director of Developer Advocacy @ Box) Twitter: @jcleblanc Email: jleblanc@box.com

More Related