1 / 81

Framework Functionality, Tuning and Troubleshooting

Framework Functionality, Tuning and Troubleshooting. Purpose. Tuning is a delicate and and complex mechanism. The variables are vast and tuning must begin at the operating system and network levels. Before applications can be streamlined, ensuring

doris-burns
Download Presentation

Framework Functionality, Tuning and Troubleshooting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Framework Functionality, Tuning and Troubleshooting

  2. Purpose Tuning is a delicate and and complex mechanism. The variables are vast and tuning must begin at the operating system and network levels. Before applications can be streamlined, ensuring that the operating system can service requests, as well as network stability is crucial. An Understanding of the tuning components are imperative and only careful analysis and implementation will ensure that the environment is streamlined and scalable. All of these variable are related to each other and balancing them is key. Every environment is unique in regards to applications and functions that they serve. This document will explain the components that must be tuned. As with most scenarios a baseline set of configuration are implemented before reaching a final working configuration.

  3. Agenda • Operating System Tuning • File Descriptors • Process Per User • Network Tuning • Memory Buffer • Socket Buffer • TCP • UDP • Tivoli Tuning • Oserv • Gateway • Dump/Cores • Endpoint Manager • MDist/MDist2 • Endpoints • Endpoint Communication Reliability • 3.7.1 – 4.1 • 4.1.1 – 4.3.1

  4. Functionality, Tuning and Troubleshooting • Gateway Daemon • Jobq threads • TMF_Dispatch Thread • Threads Categorized • Gateway Functionality - logstatus • epact.bdb – Endpoint Activity Database • Gateway Commands • Gateway Tuning Considerations • Gateway tuning script • Examples of errors

  5. Operating System Tuning • File Descriptors • Handle created by a Process when opened. • Retired when a file is closed or terminated. • Commands to view descriptors • Commands to change descriptors • Make the change permanent. • Sockets • Formulas • Processes Per User • Predefined Limit • Maximum of concurrent processes per user-id. • Formulas

  6. Operating System Tuning cont’d Network Tuning • Network Options • Memory Buffer • Formula • Socket Buffer Maximum • Formula • TCP • sendspace • recvspace • Formula • UDP • Sendspace • recvspace • Formula

  7. Tivoli Tuning • ITM/DM • request_manager.threads • taskengine.max_threads • Endpoints • Ifs_ignore • Login thread usage • Endpoint command usage • continue_eplogin_onerror • Oserv • rpc_max_threads • iom_by_name • set_force_bind • Gateway • rpc_maxthreads • max_concurrent_jobs • max_concurrent_logins • eplogin_timeout • continue_eplogin_onerror • tcp_backlog • Endpoint Manager • epmgr_rpc_max_threads • login_limit • timeout_interval • MDist/MDist2 • max_sessions • mem_max • net_load • max_conn • packet size • rpt_dir • disk_dir

  8. Endpoint Communication Reliability GATEWAY Receives endpoint login Request, and puts login request in the JOBQ JOBQ handler start login thread EPMGR Processes, initial migratory or isolated logins. Sends login_info to the GW LCFD Sends login request to the Gateway Receives login results from the Gateway Process Normal Login Sends result back to Endpoint Yes Is the endpoint login a normal login? No For initial, migratory or isolated logins: Endpoint login threads contact the endpoint manager. Sends new login_info to the Endpoint Login Process for TMF 3.7.x or 4.1

  9. Endpoint Communication Reliability continued GATEWAY Receives endpoint login request, and puts login request in endpoint login JOBQ JOBQ handler starts login thread LCFD Sends login request to Gateway Receives login timeout Sets login interval Receives login results from the gateway or -1 timeout, if the endpoint manager is busy and resets timeout in login cycle EPMGR Processes initial, migratory or isolated logins Sends login_info to the gateway If the endpoint manager is busy, throws an exception Yes Process Normal Login Sends result back to Endpoint Is the endpoint login a normal login? No Initial, migratory or isolated logins are forwarded to the endpoint manager Sends new login_info or -1 timeout if the endpoint manager is busy to the endpoint Login Process for TMF 4.1.1

  10. Gateway Log Messages • Gateway Boot Process • Read • Initialize • Opens • Contacts Endpoint Manager • Threads started • Reconnect Thread • Listens for incoming request from the endpoints • Login Thread • Gateway process listens to the broadcast login requests from endpoints • Reader Thread • Processes all gateway requests assigned to the reader queue • Logstatus Thread • At every logstatus_interval logs status of JOBQs and endpoints

  11. Endpoint Login Thread • Endpoint Data Functions • Data Stored • Downcall Process • Checks credentials of the user • Resolves method dependencies • Opens Connection • Endpoint Performs Downcall • Returns Results • Upcall Process • Reconnect Thread gets Upcall • Reader Thread creates Job • JOBQ Monitoring Threads Starts • Updates • Upcall Method resolves • Gateway Returns Results. • Endpoint Login Thread • Login request • Runs login_filter • Compares endpoint data • Login_info • Upgrade • Login Policy Boot Method • Multi-Cast • Notifies Repeater • Login_status • Endpoint Logout Process • Logout Packet

  12. Gateways Purposes and Impact Gateway Process • Designed to Handle a large percentage of Management Function. Gateway Stability • Number of Management units affected • Ability to Management Endpoints • Crucial to 7X24 operations Products Impacted by Gateway Stability • Tivoli Software Distribution • Tivoli Inventory • System Availability

  13. Gateway Daemon • Gateway Thread Usage • Fixed Remote Procedure Call (RPC) • Gateway method spawned from an oserv thread • gateway_method_in() • Two types of gateway threads execute gateway methods • Jobq threads • Tmf_dispatch thread

  14. Gateway Daemon (cont’d) • Jobq Threads • new_job() • Processed by pthreads • max_concurrent_jobs • Pthreads best practices • Pthreads=max_concurrent_jobs=number of hosting gateways >= 200 <= 1000 (or to the nearest 100) • Monitoring memory usage of gateway process. • Monitoring the endpoint manager and management regions methods • TMF_Dispatch Threads • tmf_dispacth()

  15. Gateway Thread Execution

  16. Gateway Function - logstatus • wgateway <gw_label> logstatus • Jobq_threads in queue • Jobq_threads running • Reader_threads in queue • GW methods running • Examples of logstatus responses: • bash#$ wgateway gbl_gw logstatus • Status Data --------- • Jobq_threads in queue : 0 • Jobq_threads running : 0 • Reader_threads in queue : 0 • GW methods running : 1

  17. Gateway Function – logstatus (cont’d) • Logstatus_interval • Sets interval for gatelog update • Set smaller than the default for more useful data • Entries are written to the gatelog regardless of debug level • Example of logstatus entry in gatelog: • 2002/09/18 23:03:48 +05 010275A0: STATUS DATA: jobqq= 0 jobqr= 0 reqdq= 0 gwmethods= 0 • Setting the logstatus_interval • wgateway <gw_label> logstatus_interval 600

  18. Logstatus of a Gateway The reconnect_thread listens to TCP communication from an endpoint Gateway Method Calls Gateway_method_in() The read queue processes only endpoint logins, endpoint logouts, upcall, upcall proxy and endpoint control packets tmf_dispatch() Job Queue Running Jobs Exit

  19. Gateway Commands • set_session_timeout seconds • Can override with task, software package times, etc… • set_rpc_maxthreads count • Currently must be (max_concurrent_jobs +50) • set_max_concurrent_jobs count • Currently max = 2000 • set_max_concurrent_logins count • logstatus_interval seconds • Interval at which statistics are written to the gatelog • set_method_trace_time seconds • Disable, or set interval to refresh epact.bdb with endpoint method date and time information • set_debug_level • Keep the value of 1 when not troubleshooting to reduce the load on the system resources.

  20. Gateway Tuning Considerations • Operating System – File Descriptors • Oserv – RPC threads • Gateway operations – RPC threads and jobq • Endpoint Manager – RPC threads

  21. RPC Threads OS File Descriptors Limits Admin Console SP Editor Web Gateway Admin Console Agent SP Editor Oserv RPC Threads Limits EPMgr RPC Threads Gateway RPC Threads The number of OS File Descriptors limits the number of all other RPC Threads in Tivoli.

  22. Operating System File Descriptors • Threads created consumes OS File Descriptors • NT/W2K - (all versions) testing reveals that the maximums are 2038. • AIX – It not affected if the files descriptors (nofiles(descriptors)) are set to unlimited, but only the soft limit. • Solaris – Does have limitations to how large the file descriptors (no files) can be set, but the best practices is to ensure that the value is at least twice the value of the oserv rpc threads.

  23. Procedures to change File Descriptors • NT/W2K (all versions) • Default desktop memory heap is 512 • Check value • regedt32 • HKEY_LOCAL_MACHINES\System\CurrentControlSet\Control\Session\Manager\subsystem • The default value for this registry value will look something like the following: %SystemRoot%\systems32\crss.exe ObjectDirectory=\Windows SharedSection=1024,3072,512 Windows=OnSubSystem Type=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserserverDllInitialization,3 ServerDll-winsrv:ConServerDllInitialization,2 ProfileControl=OffMaxRequestThreads=16 • Double click Windows and change the SharedSection to 1024,3072,1024 • This changes the third value from 512 to 1024 • Close regedt32 • Reboot the system

  24. Procedures to change File Descriptors (cont’d) • AIX: • ulimit –a will list the current values for the current userid. • The data will look something like the following: • file(blocks) 100 • data(kbytes) 523256 • stack(kbytes) 512 • coredump(blocks) 200 • nofiles(descriptors) 64 • memory(kbytes) unlimited • ulimit –n will set the current userid descriptors dynamically (eg. ulimit –n unlimited) • The oserv needs to be re-cycled (NOT reexec) before changes will take affect • ulimit –Sn will be the command to put in the system startup files to allow the value to be set to accommodate tivoli rpc threads at boot time statically. • The operating system will need to be rebooted before changes will take affect.

  25. Procedures to change File Descriptors (cont’d) • Solaris • ulimit –a will list the current values for the current userid. • The data will look something like the following: • time(seconds) unlimited • file(blocks) 2097151 • data(kbytes) 131072 • stack(kbytes) 32768 • memory(kbytes) 32768 • coredump(blocks) 2097151 • nofiles(descriptors) 2000 • ulimit –n will set the current userid descriptors dynamically (eg. ulimit –n unlimited) • The oserv needs to be re-cycled (NOT reexec) before changes will take affect • ulimit –Sn will be the command to put in the system startup files to allow the value to be set to accommodate tivoli rpc threads at boot time statically. • The operating system will need to be rebooted before changes will take affect.

  26. Oserv RPC Threads • Set the rpc_max_threads to as big as OS will allow • (default is 250) • Viewing this parameter • odadmin get_rpc_max_threads #### • Setting this parameter • odadmin set_rpc_max_threads ####

  27. RPC Threads and Jobq • Every gateway method handled by a Gateway RPC thread - initially spawned via an oserv thread. • The Gateway RPC Thread will determine what type of thread to execute this method with: • Queued • Note: valid range for max concurrent jobs allowed Gateway is 200-2500 and cannot be greater than (rpc_maxthreads –50) • To set the value • wgateway <gw_label> set_max_concurrent_jobs <value> • Non-Queued • Note: valid range is 250 – 2000 and cannot be less than max_concurrent_jobs+50) • Formula: rpc_maxthreads > max_concurrent_jobs + max_concurrent_logins + 20 • To set the value • wgateway <gw_label> set_rpc_maxthreads <value>

  28. Endpoint Manager Threads • Endpoint Manager Policy Threads • max_install – maximum number of concurrent allow_install_policy scripts • max_sgp – maximum number of concurrent select_gateway_policy scripts • max_after – maximum number of concurrent after_install_policy scripts • Viewing the Endpoint Manager RPC threads • wepmgr get max_epmgr_rpc_threads • Yields something like the following: max_install = 10 max_sgp = 10 max_after = 10 login_interval = 300 stanza_interval = 720 max_iom_records = 500 epmgr_flags = 1 max_epmgr_rpc_threads = 300 automigrate = 0ff migrate_max = 0 chk_cntl_chars = 0 labelspace = “ invalid_chars = “ • Setting the Endpoint Manager RPC threads • wepmgr set max_epmgr_rpc_threads <value>

  29. Gw_tune.pl script Usage: gw_tune.pl [-h | -r | -s ] Calculate the recommended values of RPC Threads for the oserv, epmgr and gateway. Additionally, calculate the max_concurrent_jobs setting for the Gateways. (Read Script Header for methodology, or the IBM/Tivoli Gateway Tuning Field Guide) This script only works on the TMR it is being run on. It should be run on each TMR in an environment. No Options This Message -r Calculate and show the recommendations ONLY. -s Calculate the recommendations AND actually set them in the TMR. -h This usage statement. NOTE: This script can ask you to change a value that requires operating Changes.

  30. Error and Examples • Gateway Thread Usage errors in the gatelog • 2003/09/20 14:04:22 –01 #RPC Request rejected outstanding threads 250 • Summary of Queued Methods • Downcalls to Endpoints • Excessive isolated logins • Delayed distributions • Summary of Non-Queued Methods • Note: the following methods (except for login_policy) are bound by a RPC thread limit of 250 on the gateway. • The foremost method is the login_policy method • Deleting and Adding endpoint uses RPC threads • Mass migrations of endpoints uses up RPC threads on the old gateway. • Debug levels: • Endpoints: • Debug level 3 is typically the best level to troubleshoot effectively. • Debug level 4 will provide the necessary data to identify a network issue in addition to Tivoli issues. • Gateways: • Debug level 6 is typically the best level to troubleshoot effectively. • Debug level 9 provides more granular errors for troubleshooting and diagnosis. NOTE: There is a policy script mechanism that I engineered to help with endpoint login and streamlining of endpoint maintenance.

  31. The Tivoli Factors • Endpoint Instability • Gateway Instability/Unresponsiveness • Endpoint Manager Instability/Unresponsiveness • Oserv instability

  32. The Tivoli Factors - Oserv Instability • Resource exhaustion of RPC threads (rpc_max_threads) • Limits • Windows (Hard coded in TMF 2038) • Unix (flavour specific, Limited by the number of OS file descriptors) • FWK commands that contact all MNs will exhaust a poorly configured parameter – wchkdb & wgateway • ITM’s Request Manager uses an oserv thread for each endpoint downcall it manages. • request_manager.threads (The default is 10, which is usually too low) • Be cautious when changing. • The tec_gateway uses RPC threads to receive upcalls from endpoints. • Parameter GWThreadCount (tec_gateway max RPC threads) range = 250 –10000 (default 250) • Be very cautious when changing.

  33. The Tivoli Factors - Oserv Instability • Disk/File System Space • $DBDIR • Corrupt/inconsistent odb.bdb • Memory – on AIX • Default maximum process size of 256MBytes (function of LDR_CNTRL and hard limit) • Memory leaks can cause core dumps when limit is reached • Transaction log size • Odb.log

  34. The Tivoli Factors - Endpoint Manager Instability • Resource Exhaustion of RPC threads (max_epmgr_rpc_threads) • Too many endpoint calls can exhaust this resource quickly • Logins (Isolation, Initial, Orphaned) • Migrations • Long running policy scripts and w-commands • The real story about max_sgp, max_allow and max_after • Hung/Hanging gateway methods • Restricted/Impacted by the availability of oserv RPC threads • So exercise some caution when changing

  35. The Tivoli Factors - Gateway Instability • Resource Exhaustion of RPC threads (rpc_maxthreads) • Each gateway method uses an RPC thread until its type is determined. • Each MDist2 session uses a RPC thread (rpt <-> app, rpt <-> rpt) • Restricted/Impacted by the availability of oserv RPC threads. • JOBQ threads (pthreads) – max_concurrent_jobs, max_concurrent_logins • Too many upcalls can result in this resource being over extended • Results in requests being queued • Can result in isolation logins

  36. The Tivoli Factors - Gateway Instability • Disk Space • Depot directory (rpt_dir) • ‘depot’ and ‘states’ directories share same file system • Corrupt/inconsistent MDist2.bdb in states directory • reconnect_thread overloaded • Single threaded • TCP Backlog full • TCP requests get a slow response • No failures from a framework perspective • Applications fail because of delay

  37. The Tivoli Factors - Gateway workflow

  38. The Tivoli Factors - Endpoint Instability • Gateway unavailable • Down • Results in an Isolation login to alternate gateway • Unmanageable until it completes login • Busy • Results in queued requests • Delays • Retries • Port saturation • Upcall and downcall clash. • ITM heartbeat is a downcall that had a large dependency list • Fixed in 5.1.1-ITM-FP05 • Upcalls not possible during downcall.

  39. Software Distribution • Slow reporting • Distributions and reporting use MDist2 sessions • Consider increasing sessions to ease contention • report_thread_limit controls the number of mdist2_result methods that run from SH on TMR server • When report_thread_limit is reached, SWDIST throws an exception and messages in queue go into INTERRUPTED state • Parameter conn_retry_interval is used to trigger reprocessing of messages • Default of 900 seconds too high. Decrease to 120 seconds • no endpoints on gateway • customers using 20 seconds with good results Text slide withlarge image

  40. Distributed Monitoring • Too many upcalls because of too many monitors • saturate the reader_thread on gateway • Symptoms • JOBQ_THREAD_EXCEPTION errors because listening port no longer available to reciprocating downcall. • Not many jobs in JOBQ • Upcall request delays seen in lcfd.log (no errors)

  41. IBM Tivoli Monitoring • ITM Heartbeat downcall takes too long (fixed – just an example) • Upcalls on endpoints can’t succeed because of downcalls • When application upcall fails it forces endpoint to become isolated • Cascading effect of many isolation logins causing gateways to become unresponsive because it is busy • Incorrectly configured Web Health Console and Heartbeat • Too many downcalls generated

  42. Inventory • Collection Table of Contents (CTOC) processing • Uses pthreads which are configured by max_input_thread and max_output_threads • Maximum of 100 input and output threads respectively • Each pthread opens an IOM channel to pickup datapack • same IOM channel used if datapack source is the same as previous connection • new IOM channel opened if datapack source is different from previous connection • Maximum of 250 concurrent IOM channels • RPC threads used to maintain IOM channels for upcall (CTOC) and downcall (datapack) • RIM connections • RDBMS deadlocks caused by too many connections

  43. Best Practices - Stability • Monitor Disk/File System Space and Transaction Log Size • To avoid odb.bdb corruption and oserv instability • To avoid gateway distribution failures because of MDist2.bdb corruption • Don’t stop oserv if odb.log is too large • INVESTIGATE! • Inventory • 1 RIM and 5 connections as a starting point • Increase connections and RIMs as performance requires • Software Distribution • 50,000 endpoints per distribution • 7,000-10,000 endpoints per activity plan • 5-10 activities per plan*

  44. ifs_ignore • Gateway will listen on one or all interfaces • Gateway listens to the same interfaces as the oserv based on the odadmin “set_force_bind” setting • Use one or all, no other solution • More flexibility and control • Can now select which interfaces you want the gateway to use • Was available in 3.7.1 and 4.1 as an idlattr • wgateway <gwlabel> set ifs_ignore <IP address> • wgateway <gwlabel> run ifs_ignore_remove <IP address>

  45. mcache_bwcontrol • Method/dependency downloads were not bandwidth controlled • Method/dependency downloads flooded slowlinks • Preloading/prestaging of cache directory workarounds • Method/dependency downloads can now use mdist2 bandwidth • wgateway <gwlabel> mcache_bwcontrol TRUE • Default is FALSE.

  46. run sync_login_interval • Change to EPMGR login_interval • Gateway would use old value • Gateway loads login_interval at boot time • Hence a restart of gateway needed to effect changes • wgateway <gwlabel> run sync_login_interval • Gateway’s login_interval is synchronized with EPMGR • No restart of gateway needed to effect change.

  47. New Fixes/Features • In TMF 4.3.1 the RDBMS interface module (RIM) uses ODBC to connect to a Microsoft SQL Server database rather than using a Microsoft SQL Server client. Therefore, you must create an ODBC data source to enable RIM to connect to a Microsoft SQL Server database. For more information about creating an ODBC data source, see the chapter about using RIM objects in the Tivoli Enterprise Installation Guide.

  48. New Fixes/Features • Update JRE to 1.4.2 in TMF 4.3.1 As JRE 1.3.1 is going out of service and new platforms prerequisite JRE 1.4, TMF 4.3.1 will distribute the new JRE 1.4.2. TCM Java components like: Software Package Editor, APM, CCM, GUIs and ISMP will be rebuilt using the new JDK 1.4 and will work runtime with the JRE 1.4.2.

  49. New Fixes/Features • The following new platforms are supported in 4.3.1 • AIX 6.1 (Server/Gateway) • Windows 2008 (Server/Gateway, ISMP) • Windows Vista EP (added Desktop formerly not supported due to JRE) • SLES 10 and RHEL 5 (EP, Server/Gateway, ISMP) • Linux PPC (Server/Gateway) • Solaris x86 (Server/Gateway)

  50. New Fixes/Features • Manage 64 bit Windows OS in TMF 4.3.1 TCM/TMF are 32 bit applications, but may need to run 64 bit binaries or .vbs and access directly to the 64 bit registers and file system in Windows endpoints. Produced wrappers that take as input the name of script or binary and required parameters, then disable registers and fs redirections, run binary and scripts, then restore the redirection. (wrun-AMD64.exe and wrun-IA64.exe) under directory: lcf_bundle.43100\bin\w32-ix86\tools

More Related