1 / 11

ai - config -team report

ai - config -team report. 28/08/2014. Puppet run incident. What we know: Puppet runs start to fail when a puppetdb query in base starts timing out Puppetdb postgres backend maxes out cpu with this one class of query responsible for majority of load

Download Presentation

ai - config -team report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ai-config-team report 28/08/2014

  2. Puppet run incident • What we know: • Puppet runs start to fail when a puppetdb query in base starts timing out • Puppetdbpostgres backend maxes out cpu with this one class of query responsible for majority of load • Load balancers become overloaded with queue • Spiral of death: LB stops responding to lbd, DNS entry removed, ENC not reachable, comes back, puppetdbreplace_facts storm, PDB slows to crawl, repeat

  3. ai-pdb raw /v3/facts --query '["and", ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_2"], ["=", "value", "adm"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_0"], ["=", "value", "bi"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_1"], ["=", "value", "inter"]]]]]]'

  4. Actions and plans • CRM-623: remove the allow ssh from aiadm rule which included “$aiadm_nodes = query_nodes('hostgroup_0="bi" and hostgroup_1="inter" and hostgroup_2="adm"', ipaddress)” • Reduced number of fact-names (thanks Dan), cleaned up foreman (thanks Nacho) • Longer term: reduce amplification effect from load balancers • Read only puppetdb for API access

  5. Things we don’t know • What triggers the problem? Normally load on dbis minimal • Perhaps updating new facts? New fact name across lots of plant this week. Looking at previous events • Will engage upstream, but we are behind on puppetdb versions due to dependencies

  6. Other activity • Postgresdbod slave for puppetdb • so far no stable replication • Updates to puppetdb & postgres modules to support r/o puppetdb • Raising issues with upstream for foreman issues with hostgroup filtering in new version • New teigi::secret::sub_file type testers required

  7. In QA • CRM-401 add an option to enable UDT for gridftp servers • CRM-567 Smartd Puppet Module • CRM-575 Add smartd to the base pluginsync whitelist • CRM-576 Including the smartd module into the hardware module • CRM-577 Deploy blockdevice driver monitoring in QA for EL5 and EL6 • CRM-611 Update of site.pp to support 10-deep hostgroup • CRM-613 Drop alarmed fact from sapp_puppetmaster • CRM-615 Removing megacli and adding storcli for vendors transtec and viglen • CRM-620 New cern_hwcontract function to extract contractid from hwdb cache • CRM-622 New 'ssds' facts • CRM-623 Emergency backout of allow ssh from aiadm

  8. QA-Prod • CRM-591 Do not clobber ADFS-metadata.xml with puppet. • CRM-595 Enable buildMap="1" for new (3.5) shibboleths when memcache is enabled. • CRM-604 facter 1.7.4 -> 1.7.6 upgrade. • CRM-605 Upgrade mcollective filemgr, package and service plugins • CRM-606 Add fact to expose the tenant name • CRM-607 Drop active installation nrpe mco! llective plugin • CRM-608 New Redhat/7.yaml hiera file. • CRM-609 Add CentOS (7) support to osrepos. • CRM-610 CentOS as valid OS name • CRM-612 Update of hiera config to support 10-deep hostgroups • CRM-616 ai-tools 8.2-1 • CRM-617 Update module to upstream version 1.7.9 • CRM-618 RHEL5 repo fixes for osrepos

More Related