1.32k likes | 1.45k Views
Introduction to Massive Upgrades and Changes. Instructors: Tom Limoncelli With Material From: “The Practice of System and Network Administration” by Limoncelli & Hogan http://www.EverythingSysadmin.com. Class Exercise. Multi-Purpose Server Upgrade
E N D
Introduction to Massive Upgrades and Changes Instructors: Tom Limoncelli With Material From: “The Practice of System and Network Administration” by Limoncelli & Hogan http://www.EverythingSysadmin.com
Class Exercise Multi-Purpose Server Upgrade Select a machine from your network and walk through what would be involved in upgrading the OS. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Our gift to all attendees • The Paper-O-Matic • (paperclip not included) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Exercise • Service Checklist – list services • Is each service supported on new OS? • Document Verification (test) procedure • Document Back-out plan • Schedule the big event – when & how long • Announce as appropriate – where and when? • Test, Upgrade, Test • Communicate Completion Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Introductions Your instructors: • Tom Limoncelli – SA since 1988, UNIX since 1991. Currently Director of Network Operations, Lumeta Corp. Previously at Bell Labs. • Co-author of “The Practice of System and Network Administration” Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Definition of “Massive” Scope larger than “normal” projects • Impacts a large number of customers • Failure will be highly visible Examples: • Upgrading a server • Rolling out a new application • Renumbering IP networks • Changes on a large WAN • Day-long reorganization Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Other Commonalities of Massive Changes • Large number of SAs on team • Highly visible to customers • Expensive • Potential for expensive mistakes Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
What causes failure? • Lack of planning -> chaos • Miscommunication -> chaos • Lack of documentation -> chaos • LACK OF PROCESS -> chaos Change management reigns in chaos Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
OVERVIEW • Class Exercise: Upgrading 1 Server (40 min) • Introductions (5 min) • Change Management Basics (30 min) • Service Conversion Theory (15 min) • BREAK • Class Discussion: Nagano (10 min) • Technique: IP Renumbering (30 min) • Managing Maintenance Windows (40 min) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Definition: Change Management • The process that ensures effective planning, implementation, and post-event analysis of changes made to a system. • Changes should be documented, have a back-out plan, and be reproducible, and communicated as appropriate. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
What’s it all about? • To the casual observer: • Documented change requests, approved or rejected before implementation. • Change management is: • Scheduling – for least impact • Communication – within team, to customers, to management • Planning – all eventualities covered Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Formal or Informal? • The larger the site, more formality is required. • Large sites often have a change-control counsel that meets weekly to approve requests. • Smaller sites simply need manager’s approval. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Change Requests Handout #1: Quiet Time Is Coming • A written document • What will be changed • What is the expected impact/outage • When is change needed by • Who requests change, who is it for • Back-out plan • Responsible people Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Types of Changes: • Routine Updates • Major Updates • Sensitive Updates Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Routine Updates • Can happen at any time. Invisible to customers • Ex: Updating a directory/authentication server, debugging a printing problem, altering monitoring systems, enabling an existing router interface. • Failure scope: minimal • Communication needed: None Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Major Updates • Affect many systems or require significant system, network, or service outage or touch a large number of systems • Ex: upgrading authentication systems, changes to email or printing infrastructure, upgrading core network infrastructure, installing new (non-hotplug) router interface. • Failure scope: affects many, many people • Communication needed: email or similar Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Sensitive Updates • Does not seem to be major but would cause significant outages if there was a problem with it. • Ex: Altering router configurations, global access policies, firewall configurations, alterations to a critical server, installing card in router that “should be hot-plug”. • Communication needed: “pull” mechanism like web site, newsgroup, forewarn helpdesk Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Classification Notes: • Different definitions at different sites, or parts of sites. • E-commerce company considered adding a new host to a corporate network to be “routine”, but to the customer-visible network “sensitive”. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
When to do updates? • Major Updates – based on organization’s maintenance window and SLA agreements • Sensitive Updates – should happen outside of peak usage times to minimize impact and maximize time to discover & rectify problems • Routine Updates – any time (what about network quiet times?) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Network Quiet Times • Official days where all changes (outside of repairing outages) are forbidden. Sometimes global, often local. • Examples: • The last 15 days before tax filings due each quarter • 2 weeks before major software release scheduled to ship (and until 3 days after shipment) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Handout #2: SAMPLE CM POLICY A policy you can adopt NOW
The CM Meeting: • Meetings where proposed changes are reviewed, discussed, and scheduled (if approved). • Typically weekly or monthly depending on quantity of change. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Sidebar: Daily CM Meetings? • .COM had stability problems significant enough to be front-page news • Had daily meetings due to extreme growth rate. (Mostly Change Control rather than CM) • Postponed CM Requests on days of “bad weather”. • Daily meetings let them deduce “what changed” when problems sprang up. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Meetings formally document: • What will be done and when • How long will the change take • What can go wrong • Testing procedures • Back-out plans Side benefit: Forces you to think these things out. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Meetings Communicate: • Make other people aware of changes • They can recognize potential source of problems • Meeting should include representatives from across the company • They can then communicate within their own group about the changes Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
CM Meeting’s Global Impact • Attendees develop an overall view of what’s happening within the company • Senior SAs/managers can spot problems before they happen • Reduces entropy and leads to stability Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Communicating Changes • How are CM issues communicated to customers? (Email, Newsletters, etc.?) • When to communicate: • When there will be an outage • When procedures/software will change • Communication via email should only be to customers that will be affected Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Explicit approval vs. objection Email pre-announcing an outage gives opportunity request a reschedule: • Explicit Objection – Outage will happen unless someone explicitly objects • Email: “To request that this maint. window be rescheduled, please contact Joe Smith.” • vs. Explicit approval – Outage will happen only after explicit approval. • In person: Request at the CM board meeting. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Case Study: WAN CM (handout #1) • “The secret to a reliable WAN is good procedure.” • Maintained schedule of outages and “Quiet Times” • Scope (Global or local), Impact & Risk • All changes to back-bone routers required approval by CM Request Board • LAN routers only need CM approval if outage expected. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Case Study: Network Life-cycle • “Build-out” (birth) • Entry -- New construction • CM in the form of documenting rather than approval • Goal -- Get to certification • “Certification” (a series of tests) • “Certified” • Entry -- Installation complete, testing done, check-list of requirements met (VRRP/HSRP, ) • Goal -- Maintain uptime/reliability/performance • “Decommision” • Entry -- Elvis has left the building • Goal -- Eliminate dependence, in order, by deadline Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
OVERVIEW • Class Exercise: Upgrading 1 Server (40 min) • Introductions (5 min) • Change Management Basics (30 min) • Service Conversion Theory (15 min) • BREAK • Class Discussion: Nagano (10 min) • Technique: IP Renumbering (30 min) • Managing Maintenance Windows (40 min) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Service Conversion Theory • Definition • Prepare the customers • Minimize Intrusiveness • Flash cuts vs. phased approach • Theory of Pillars vs. Layers • Back-out plans • Grouping changes Ex: “Rioting Mob” Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Definition: Service Conversion Any change that requires touching many hosts to make a single, or many, changes • The same 1 change on hundreds of hosts • The same 50 changes on hundreds of hosts Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Examples: • Service being replaced: • New client software on each host • Or, Each client re-pointed to new server • Rolling out new software to each client • IP Renumbering • Enabling new feature: Moving to DHCP • Splitting customers over a new server • To load-balance or to divide company before spin-off Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Prepare the customers Does the new service require customers change work methods? • Can they use the old client? • Is training available? • Is new documentation complete and distributed? • Is the helpdesk trained on • potential conversion problems? • the new software itself? Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Minimize SA Intrusiveness Ultimately, you want to minimize intrusiveness to the humans • Does the conversion require service outage? • Can outage be avoided? • Can you minimize the outage duration? • Can the outage be scheduled out of hours? • Will we visit the customer’s PC more than once? • Can the visits be avoided? Combined? Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Flash cut vs. phased approach • Flash cut – Change all at once • Upgrade a server “in place” • (implies little/no ability to back out) • Phased approach – Slower and safer • Provide old and new service for a period (like new area codes) • Or, budget for duplicate hardware, install off-line, move clients over slowly Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Successful Flash-cuts The secret to successful flash-cuts is testing, testing, testing Example: New calendar system doesn’t communicate with old system, data will be exported and all clients will be required to switch on specific day. • New calendar system on new hardware. • Major amounts of load-testing performed. • Trial users test new system (with understanding that data will be wiped on conversion day). • QA metrics defined and met. • Documentation & training for customers. • Helpdesk trained on new s’ware and conversion. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Successful Phased Conversion • “One, Some, Many” Technique • Test conversion w/successively larger groups. • If entire group converted successfully, move to larger group. • If any failures, revise process, shrink group. • “One, some, many” • One – My machine. Large incentive to get right • Some – Co-workers and SAs that can give feedback. • Many – Larger and larger groups, starting with the least risk averse Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Pillars vs. Layers Suppose: 50 tasks to be done on all hosts Layered approach – Perform one task for all hosts before moving on to next task. Pillars approach – Perform all required tasks on a host before moving to next host. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
What to layer or pillar? Layer tasks that are not intrusive to customers. Pillar tasks that are. Example: a new calendar server • Layer: creating accounts • Pillar: visiting customer to install new software, freeze schedule and convert it to new system. Login and initialize password. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Pillar benefits Pillared approach means scheduling one period with each customer. Less annoying to customer. • Scheduling and re-visiting missed customers has extremely high overhead • Two 5-minute meetings is more work than a single 10-minute meeting • Multiple visits = multiple annoyances Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
The Rioting Mob Technique • Tom’s group needed to make many changes to 1000 hosts in 1 month. • UNIX: Script written and tested (1,s,m) • Windows: 5-6 manual changes • Other devices: Ad hoc (mostly IP addr) • Layered all server-side changes. • How could we do the pillars? Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Rioting Mob Example • Numbered the hallways & announced schedule • Mon: Convert a hallway • Tue: Fix problems and improve process • Wed: Convert another hallway • Thu: Fix problems and improve process • Fri: No changes (so we couldn’t break anything and ruin our weekend). Catch up with other work and email. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
First Try: • 9am: entire team starts at hallway • 2 PC techs went office-to-office down left-hand side making changes. • 2 UNIX techs went office-to-office down left-hand side making changes. • Similar pairs went down right-hand side • 2 senior SAs available to debug and/or handle oddball hosts • SAs called into “command central” to request IP addresses Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Tuesday • Cleaned up anything we broke • Brainstormed on how to improve • Detailed what happened minute by minute • Detailed problems • Brainstormed on solutions • Revised process Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
New Process • Make initial pass through hallway: • Give customer a gentle warning to log out • Call in requests for IP addresses • Identified non-standard machines for senior SAs to focus on • Second pass through hallway: • Do actual conversion Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com
Results • Conversion much smoother, customers happier • “Tue/Thu brainstorms” eventually nil as process perfected. • Soon conversions done by noon, Tue/Thu used for planning Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com