Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools

Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools Andy Pedisich Technotics

Why Do This Session … • Many Admins want to take advantage of native Notes monitoring solutions, they just don’t have the bandwidth to explore them • “Free time” is very rare these days • This jumpstart will show you: • How to collect stats • How to analyze stats • How to go behind the scenes • How to set up monitors, alerts • And how to capture just about any little event you are interested in • And finally, how to configure and work with DDM • Let’s get started

What We’ll Cover … • Looking at the big picture of server monitoring • Understanding statistic generation • Designing an efficient and sensible collection infrastructure • Pulling useful information from statistical data • Using cluster stats to keep clusters reliable • Understanding the essentials of event monitoring • Determining the best notification methods • DDM: Understanding how it fits into your environment • DDM: Crafting a perfect DDM data collection hierarchy • DDM: Looking at DDM events and probes • Wrap-up

Driving Your Domino Servers • You can learn a lot about the importance of monitoring from driving your car • Your car tells you a lot about what’s going on • And you know they’re important because you pay attentionto the indicators • You fill the gas tank when it’s low • Unless you are Rob Axelrod (ask Rob) • And (usually) pay attention to the speedometer so you won’t get a ticket • Or maybe you’re the driver who thinks that red light on the dash is just for ambience while you’re driving at night • Uh oh

Domino Servers Are Obsessed with Statistics • Domino servers are constantly spewing stats • Just like your car telling you how fast you’re going • Except with Domino there are literally several hundred statistics generated • Most of them are updated continuously • Many administrators don’t know which ones are important • Or how to tell the good readings from the bad ones • Or what to do about them when they are bad

The Truth About Monitoring • A good administrator shouldn’t have to look very hard • And you can be notified about most problems automatically • You can be proactive about fixing them • When you’re proactive, you put out less fires • Firefighting dilutes your effort • But being notified requires that you monitor your environment for events and issues • And events depend on statistics • And statistics need to be collected • And too many sites don’t collect stats correctly • Some don’t collect them at all

Perpetual Statistics • Domino servers constantly generate statistics • They track data on a surprising level • On almost every aspect of server operations • Agent manager • Mail and calendaring • The server’s platform • SMTP and Notes mail • LDAP • HTTP • Network • And lots more, too

Server Statistics Are Organized Hierarchically • Stats are gathered into major categories like these • And then each one has a multitude of subcategories

Subcategories of Statistics • Here’s a snapshot from the Administrator client showing some of the statistical hierarchy • This gives you a snapshot of the stats on your server • Use Refresh to get another snapshot

Statistics Come in Basic Types • The basic types of statistics are: • Stats that never change once the server is started • Snapshot stats – reflect what’s going on right now • Cumulative stats that grow from the moment the server is started • These stats are available to you for: • Your Domino servers • The platform your server is running on • Your network environment

Static Statistics • Statistics that don’t change usually represent the operating environment of the server • Server.Version.Notes = Release 8.5.3FP3 • Server.Version.OS = Windows NT 5.0 • Server.CPU.Type = Intel Pentium • Disk.D.Size = 71,847,784,448 • Mem.PhysicalRAM = 527,433,728

Amazing Detail, Yours Free! • This includes OS platform, Domino version, RAM • Lots of information about disks in use • Platform.LogicalDisk.TotalNumofDisks = 3 • Platform.LogicalDisk.2.AssignedName = E • Disk.C.Size = 80,023,715,840 • And even Network Interface Card (NIC) information • Platform.Network.1.AdapterName = Intel[R] PRO_1000 MT Server Adapter • Platform.Network.2.AdapterName = Broadcom NetXtreme Gigabit Ethernet _2 • Platform.Network.3.AdapterName = Broadcom NetXtreme Gigabit Ethernet

What Good Are These Static Stats? • Think these static stats aren’t helpful? • Guess again • They are extremely valuable • If you are collecting stats correctly from all your servers, you can take a pretty detailed server inventory • Without leaving your desk • From servers all around the world, just by looking at the data we’re going to collect in the Monitoring Results database • This database is also know by its filename: STATREP.NSF

Snapshot Statistics • Snapshot stats show what’s happening at the moment youask for them • They are changing all the time • Disk.E.Free = 18,679,414,784 • Server.Users = 280 • Mem.Free = 433,614,848 • MAIL.Waiting = 250 • The best part about this is that you get lots of Domino-related stats you wouldn’t get by looking at the operating system’s performance tools

Cumulative Stats • Some stats are cumulative • They start counting from zero when you start the server • Server.Trans.Total = 31,915 • SMTP.MessagesProcessed = 966 • Stats, like averages and maximums, are calculated from the cumulative ones • Server.Users.Peak.Time = 02/21/2006 07:50:33 MST • Platform.Memory.PagesPerSec.Peak = 1,364.1

Resetting Statistics • Some of these cumulative stats can be reset using the following console command: • Set Statistics statisticname • You can’t use wildcards (*) with this argument! • Here’s an example of why you might want to reset a stat: • Set Stat Server.Trans.Total • Resets the Server.Trans.Total statistic to 0 • You might want to reset this stat if: • You are starting to benchmark a new application • You are debugging an agent and want to see if it is more efficient after changes to its design

Platform Stats, Too • Platform stats vary widely from OS to OS • Getting platform stats from within Notes has great value • Track Domino server performance on an OS level even if your servers run on a variety of operating systems • For example, it’s very common to have a mix of AIX and Wintel servers • In a few minutes, we’ll be discussing threshold tracking • You’ll be able to set notification thresholds universally from within Notes to track these platform stats

Getting to Platform Statistics • Domino releases 6, 7, and 8 track platform stats automatically • In earlier versions, they had to be explicitly enabled and many times were disabled due to problems with servers crashing • These problems are gone • To see all platform stats – enter this console command • Show stat platform

A Word About Platform Stats on Partitioned Servers • Domino collects platform stats that pertain to the whole system • Not to an individual partition • The only statistics that are specific to a partition are those that reflect tasks, such as process statistics • One partition might run 10 tasks, while another partition runs 15 tasks

Confirming Stats with Other Tools • Be careful when trying to confirm platform statistics using other performance monitoring tools • Because of the differences in sampling intervals, you cannot use native monitoring tools to confirm platform statistics • There will be discrepancies between platform statistics and those obtained … • Using Perfmon – for Windows 2000 • Or a system command, such as this UNIX command: • iostat /vmstat/ netstat

See Server Statistics • Quickest way to see all server stats is to enter console command: • Show stat • Any place you can get to a console, you can access stats that can tell you a lot about the current state of the server • A SHOW STAT command gives you every statistic the Domino server has • Several hundred of them! • That’s really too many to deal with at once

Can I See That in a Smaller Size? • Get a better view of the stats showing just what you’re looking for using the asterisk wildcard • You can ask directly for the top level of the hierarchy • Show stat server • That shows all of the stat hierarchy under “server”

You Might Want Only Part of the Data • To get a select list of just the stats under the top level requires the use of wildcards in your console commands • If you only want Server.Users hierarchy, use the global “*” • Show stat server.users.*

Pushing the Wildcards • If you want a closer look, like just grabbing particular sub-levels of stats, get clever with the wildcard • For example, use the following command to find out about mail waiting • Show stat mail.wait* MAIL.Waiting = 1 Mail.WaitingForDeliveryRetry = 1 MAIL.WaitingForDIR = 0 MAIL.WaitingForDNS = 0 MAIL.WaitingRecipients = 1 5 statistics found

Take It to the Next Level • Now that we know where the statistics are, it’s time to kick it up a notch • Let’s set up a collection architecture • Some Notes shops do not collect server statistics at all! • How in the world can they: • Determine what is causing performance issues? • Plan for future growth? • Have a grip on whether their server platforms are configured correctly? • Do they just make the stuff up and go with it?

The Two Things Needed • There are two things that are needed for statistics collection to happen: • The Events4 database must have a Server Collection document • The Collect task must be running on the server that is designated to collect the statistics

Details, Details, Details • Events4, the Monitoring Configuration database, needs a Statistics Collection document for each server collecting stats • This database should replicate to every server in the domain • A server will know it is supposed to collect stats because of this document • But it won’t automatically load the collect server task • We have to make sure that happens

Server Statistics Collection Docs • Use a Server Statistic Collection doc to indicate the server that will collect stats • And the servers you want the stats collected from

Set the Statistics Collection Interval • Use the collection report interval on the Options tab to set up how often statistics should be gathered • Generally, collecting once an hour is sufficient • If you are upgrading or changing the environment, it’s better to collect every 30 minutes • Or even every 15 minutes, if you are trying to fix problems

A Single Document Looks Like Many in the View • This single document, with a multi-value field containing all the servers, will look like it is multiple documents in the Events4 database • Make sure administrators know this, or they might delete everything by mistake • Guess how I know this?

Centralize Your Domain’s Statistic Collection • Ideally, use just a few key servers to do the collection • You might even be able to get away with just one! • Your network topology will have a profound effect on which servers you select • So will the load currently running on the servers • Avoid collecting stats over long, slow links • Be careful of WAN routes that are already packed with other network traffic

Configure Key Collect Points • If you have offices in London and Tokyo, then pick a collection server from each city • That server will collect stats from all servers in that region • Collect stats in a database created from the Monitoring Reports template • The databases don’t have to be called Statrep • Voilà! Centralized data at your fingertips

Remember to Add the Collect Task • The Collect server task must be running on the servers you selected as collectors • Use LOAD COLLECT from the console to get it started • Add the Collect task to the ServerTasks= line in the selected servers’ Notes.ini to make it permanent • Remove Collect from ServerTasks= from all other servers! • Want the servers to start collecting stats immediately? • Use the following console command: • Tell Collector Collect • It will kick off a statistic collection of all the servers you specified

The Collect Task Should Not Run on Every Server • Stat collection can be set up so each server collects its own stats • And puts them into a local Statrep Monitoring Results database • This method has the following drawbacks: • You have to run the Collect task on every server • You must visit Statrep on each server to analyze statistics • This is a real pain in the neck • And it makes analysis harder • Statistics have the most value when collected into a central location where they can be easily analyzed

Demonstration: Setting Up the Collect Task Demo

Let’s Start by Looking at Disk Stats • If I get a call about server performance, I check disk stats first • Bad disk utilization can seriously tank a server • One stat to track is Percent Utilization • A very busy disk can mean a very busy server • But it might mean something else is wrong • Perhaps a controller is beginning to fail or drive cache is wrong • Disk stats names depends on platform, but have PctUtil in them • It could be Logical Disk or Physical Disk • Like Platform.LogicalDisk.1.PctUtil.Avg • This should rarely hit 60% on Wintel boxes • On AIX and iSeries, it depends on disk sub-systems config • They often can run 90%+ without issues

Average Disk Queue Length • This is a major statistic! • Platform.LogicalDisk.1.AvgQueueLen, .Avg and .Peak • Queues of more than a couple of seconds mean your disks can’t really keep up with the action • You can hit high peaks occasionally without issues • But constant highs mean moving users or apps • Balance these disk stats against CPU/Memory stats • Because memory = virtual disk • And constant thrashing of disks might mean you need more RAM • Problem is, Statrep doesn’t have a view that shows these important statistics

There’s a Lot of Stuff That Isn’t There • Before we get any further, it’s important to point out something that is hidden • Statistical data – In the Monitoring Reporting database • STATREP.NSF • Statrep has views that simply don’t have data that is as useful as it could be • It’s there, it’s just not in views • However, it’s important to know that every document in the database contains every statistic you see when you issue a SHOW STAT command at the console • It’s just a matter of showing it in a view

Take Home This View • But now you have a version of Statrep with a view that does contain those important stats! • A specially-crafted version of the Statrep template with a view like the one below is available • You can download it from my blog • You’ll probably have to modify the columns based on the disk configurations of your own systems

Processor Statistics • Platform.Memory.RAM stats will disclose memory usage • Don’t just think you might need more memory: be certain by checking this out • On Wintel systems, this number should rarely be 60% • But on iSeries and AIX, it can be much higher • On iSeries it can actually run quite nicely at 90%

CPU Stats Are There for Each Task • Platform.Process.ActiveDomino.TotalCpuUtil • Gives you the big picture of how Domino is using processors • There is a Platform.Process.$$$.PctCpuUtil stat for each task you run on your Domino servers • Platform.Process.Amgr.PctCpuUtil • Platform.Process.Router.PctCpuUtil • Platform.Process.Process.PctCpuUtil • Platform.Process.Amgr.PctCpuUtil • … And so on

Using These Stats • You might find that the Agent Manager is the biggest hog because of user personal agents! • You could move busy user agents to a different server • These stats don’t show in the Lotus version of Statrep • But they are on the Technotics85Statrep.NTF version • You can download it from my blog • www.andypedisich.com

Why Wouldn’t the Failover Replica Be Up to Date? • When primary server is down, users are directed to a replica on a failover server • But sometimes that replica is not up to date • Cluster replication keeps primary server in sync with failover • It’s an event-driven process – occurs automatically when a change is made to a database • Changes to a database are pushed to the replica on failover • Deletion stubs are not replicated • That’s why you also need a scheduled replication doc between servers in a cluster • It’s vital that these replicas are synchronized • But by default, clusters only have 1 cluster replicator task

Not Now … I’m Too Busy • Occasionally, there is too much data changing to be replicated efficiently by a single cluster replicator • If cluster replicators are too busy, replication is queued until more resources are available • Your databases get out synch and stale • Adding a cluster replicator will help fix this problem • Use this parameter in the Notes.ini • CLUSTER_REPLICATORS=# • But how do you tell if there’s a potential problem? • Adding too many cluster replicators will have a negative effect on server performance

Key Stats for Vital Information About Cluster Replication

What to Do About Stats Over the Limit • Acceptable Replica.Cluster.SecondsOnQueue • Queue is checked every 15 seconds • Under light load, should be less than 15 seconds • Under heavy load, if the number is larger than 30, another cluster replicator should be added • If the above statistic is low, and Replica.Cluster. WorkQueueDepth is constantly higher than 10 • Perhaps your network bandwidth is too low • Consider setting up a private LAN for cluster replication traffic

Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools