A good quick reference guide on Veritas Cluster..
A very good reference guide on VxVM.
VCS IMP configuration files
tail –f /var/VRTSvcs/log/engine_A.log - VCS Error Logs TAG_A: VCS internal error – requires immediate attention TAG_B: Messages indicating errors and exceptions TAG_C: Messages indicating warning TAG_D: Messages indicating normal operations TAG_E: Informational messages from agents /var/VRTSvcs/log/engine_A.log /var/VRTSvcs/log/hashadow_A.log agent logs - /var/VRTSvcs/log/AgentName_A.log /etc/VRTSvcs/conf/config/main.cf - Main Configuration File /etc/VRTSvcs/conf/config/types.cf - Configuration File. /etc/llttab - sets the frequency of heartbeats /etc/gabtab - sets the number of nodes required to be running GAB to seed the cluster before the VCS engine will start. /etc/sysname -Optional configuration file: /opt/VRTSvcs/bin - Command Path /app/VCS/VRTSvcs/bin - CS Customised Command Path
How to Check the Veritas Cluster Status
# hastatus -sum |more
# hasys –list - List all participating Cluster nodes
# hasys -display - detailed info on system attributes
# hatype -display - detailed info on the resources types
|# hagrp -state | more - show the state of the groups
# hagrp -display <group_name> - display the details of the group
# hagrp -resources <group_name> - show resources associated with group
# hares -state <resource_name> - show the state of a resource
# hares -display <resource_name> - display the details of the resource
How To Failover node when it is in Partial / Faulted Stage.
How to do Failover from one node to Other
1. Check Dependent Resource Group. # hagrp -dep sg_PSGITDBA01 #Parent Child Relationship sg_PSGCSGRD01 sg_inf_PSGCSGRD01 online local firm sg_PSGEQHLT02 sg_inf_PSGEQHLT02 online local firm Offline sequence: 1. Parent 2. Child_1 3. Child_2 Online sequence: 1. Child_2 2. Child_1 3. Parent If Dependency is their , As per dependency follow the following sequence.... a) First Offline "Parent" resource group. b) Switch respective "Child" resource group. c) Online "Parent" resource group. # hagrp -switch <Resource Group> -to <target node> --- will Switch Resource Group to Target Node. OR # hagrp -offline <Inf Resource Group> -sys <current node>===> will Offline Resource Group on Specified Node # hagrp -online <Inf Resource Group> -sys <target node> ===> will Online Resource Group on Specified Node
NOTE :
1. If a "Resource" in a "Resource Group" is in Faulted State, The "Resource Group" will not come Online till you make "Resource" Online from "Faulted" State.
2. Once you "Resource Group" will Fail to come Online if any of the "Resource" is in "Fault" State.
#hastatus -sum |grep -i Partial
#hastatus -sum |grep -i faulted
# hareg -dep
Parent Child Relationship
sg_PSGCSGRD01 sg_inf_PSGCSGRD01 online local firm
sg_PSGEQHLT02 sg_inf_PSGEQHLT02 online local firm
Offline sequence:
1. Parent
2. Child_1
3. Child_2
Online sequence:
1. Child_2
2. Child_1
3. Parent
If Dependency is their , As per dependency follow the following sequence....
a) First Offline "Parent" resource group.-------- ( hagrp -offline <Inf Resource Group> -sys <current node>)
b) Clear the Faulted Resource Flag. -------------( hagrp -clear <group> [-sys <system>] )
c) Switch respective "Child" resource group-- ( hagrp -switch < Resource Group> -to <target node> )
d) Online "Parent" resource group.--------------- ( hagrp -online <Inf Resource Group> -sys <target node> )
How to Freez the System ie Maintance Mode
You can take down your services, IP's, filesystems, databases and applications, and VCS will NOT monitor in this case.
To freeze a Group:
#haconf -makerw
#hagrp -freeze <Group name> -persistent
#haconf -dump -makero
To unfreeze a Group:
#haconf -makerw
#hagrp -unfreeze <Group name> -persistent
#haconf -dump -makero
VCS Concepts
How to start VCS with one of the Cluster node is having problem.
For TWO cluster node the cluster may not come up if One node id down with some problem. In this case we need to specify one of the cluster while starting it. Run this command on each node that is up: # /sbin/gabconfig -cx VCS should then be starting up. You may have to online some Service Groups manually. # hagrp -online <Service Group> -sys <hostname> If the gabconfig command doesn't work, reconfigure GAB and LLT and try again. Do the following on both nodes.Make sure had and hashadow are not in the process table. Check "ps -ef" and kill them if you have to. # /sbin/gabconfig -U # /sbin/lltconfig -U (answer yes) # /sbin/lltconfig -c # /sbin/gabconfig -cx # hastart VCS should then startup on each node that is up. Stop & Start Cluster Services hagrp -freeze <Service Group> -persistent hastop -local -force or hastop -all -force ( Only VCS Services Will Stop ) hastop -local ( VCS Serives will stop on the from which the cmd is fired ) hastop -all ( Stop All VCS Services & Resource Group ) hastop -all -force ( Stop All VCS Services ) hastart ( Will start VCS , Need to run on each Cluster Node )
GAB : Group Membership and Atomic Broadcast – dynamically tracks to overall cluster topology monitoring cluster membership, tracking cluster state, and distributing information such as anytime system changes state – reboot, power off, fault, join.
cat /etc/gabtab /sbin/gabconfig –c –n 2
Two systems are required to be running GAB to seed the cluster. The seed is the number of systems that must be running GAB before the VCS engine will start. Generally, the rule is n/2 + 1, n being the number of nodes.
# /sbin/gabconfig -a - reports GAB status, should see a Port a and Port h
- if no Port a, gab is not running and cannot communicate with other nodes
- if no Port h, had is not running
- membership 012 (012 = node ’1′,’2′,’3′)
LLT : Low Latency Transport – fast kernel-to-kernel communications & monitors network
communications.Distributes (“load-balances”) internode communication across private network links. Traffic distribution, Heartbeat, communication notification.
HAD – High Availability Daemon - the primary process running (sometimes referred to as the “VCS engine”). It receives info from various agents regarding resources and forwards it to each member node and to update its own “view” of the cluster. Also, monitors hashadow and restarts if necessary/etc/llttab - set frequency of heartbeats
cat /etc/llthosts - to list nodes of a cluster
/sbin/lltconfig -a list
/sbin/lltstat -nvv - can the nodes communicate?
/etc/VRTSvcs/conf/sysname
/etc/llttab
set-node train1 - set system ID numbers (node name)
set-cluster 10 - set cluster ID number
link qfe /dev/qfe:0 – ether - private network using qfe
link hme0 /dev/hme:0 – ether - private network using hme0
/etc/llthosts
0 nys01d-0001a
1 nys01d-0001b
2 nys01d-0001c
hashadow – monitors HAD and, when required, restarts HAD. This also runs on each node of the cluster
Split Brain (Partition Cluster) – Multiple systems are running a failover application simutaeoulsly, split-brain condition results. The two applications are unaware that they are each writing to shared storage.To protect against split brain condition, VCS requires a minimum of two Ethernet heartbeat links.
Preventing split brain:
Two or more Ethernet heartbeats
Public Ethernet networks
Disk partitions
Service group heartbeats
Shared disks configured using VCS SCSI reservation agent.
How To Stop & Start Complete Cluster on running Host
/opt/VRTSvcs/bin/hastop -local -force ( This will Stop Cluster Only ) /sbin/gabconfig -U ( This will unconfigure VCS gabconfig ) /sbin/lltconfig -U ( This will unconfigure VCS lltconfig ) /usr/sbin/modinfo | egrep 'llt|gab' ( Chk PID ) /usr/sbin/modunload -i "gab_process_id" /usr/sbin/modunload -i "llt_process_id" /etc/rc2.d/S70llt start /etc/rc2.d/S92gab start hastart
# fsck -F ufs -o f /dev/rdsk/c1t1d0s0 ** /dev/rdsk/c1t1d0s0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames MISSING '.' I=1335684 OWNER=65535 MODE=40755 SIZE=512 MTIME=Nov 30 10:30 2006 DIR=/user/ecs/PRSTAT/vmstat/0112 CANNOT FIX, FIRST ENTRY IN DIRECTORY CONTAINS prstat.032501 MISSING '..' I=1335684 OWNER=65535 MODE=40755 SIZE=512 MTIME=Nov 30 10:30 2006 DIR=/user/ecs/PRSTAT/vmstat/0112 CANNOT FIX, SECOND ENTRY IN DIRECTORY CONTAINS prstat.033000 ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 229700 files, 6000159 used, 6397069 free (215861 frags, 772651 blocks, 1.7% fragmentation) take a look at the logs.. The directory corrupted is /user/ecs/PRSTAT/vmstat/0112 and related Inode is 1335684 . So By clearing Inode 1335684 we can clear the directory /user/ecs/PRSTAT/vmstat/0112. Here is How to clear A Inode ? #clri [filesystem device file name][inode to be cleared] Eg:- #clri /dev/dsk/c0t0d0s0 1335684 Then do FSCK on the Filesystem to repair the inconsistency. # fsck /dev/rdsk/c0t0d0s0
# vxassist shrinkto vg01 2000; vxassist shrinkby vg01 2000
# vxassist growto vg01 2000; vxassist growby vg01 2000
# vxassist addlog volume-name #vxassist addlog vg01
# vxassist mirror vg01 disk80 disk90 Example to make a 50 mb mirror on volume called vg01 using any two free disks: # vxassist mirror vg01 50m layout=mirror vxassist mirror vg01 disk80 disk90; vxassist mirror vg01 50m layout=mirror
# vxassist make vg01 100m Example to make a volume called vg01 to be 100m big using the disk disk80: # vxassist make vg01 100m disk80 #vxassist make vg01 100m disk80; vxassist make vg01 100m
# vxassist make vg01 100m layout=raid5
# vxassist mirror vg01
# vxassist maxgrow volume-name # vxassist maxgrow vg01
# vxassist make vg01 50m layout=mirror disk80 disk90
# vxassist -g rootdg mirror vol80 vol90
# vxassist maxsize layout=stripe
# vxassist maxsize layout=raid5
# vxassist move vg01 !disk90
# vxassist make vg01 50m layout=mirror,stripe,log disk80 disk90 disk92 disk95
# vxedit -rf rm volume_name Example #vxedit -rf rm vg01
# vxedit -g rootdg rename disk90 disk80
# vxedit -g homedg set spare=on disk90
# vxedit set comment"comments are here" subdisk01-01
# vxedit set user=ep group=epgrp mode=0666 vg01
# vxdisk clearimport c?t?d?s? Example #vxdisk clearimport c0t0d0s0
# vxdisk list
# vxdisk rm disk## #vxdisk rm disk88
# vxdisk rm c?t?d? #vxdisk rm c0t0d0
5)How to add or bring a disk under Veritas Volume Manager control
# vxdiskadd c?t?d? OR # vxdisksetup -i c?t?d? Note: It might help to newfs the s2 slice of the disk and perform a vxdctl enable to get it to add a disk. vxdiskadd c0t0d0; vxdisksetup -i c0t0d0
6)Interactive front end to the vxdisk program in Veritas Volume Manager
# vxdiskadm
# vxvol rdpol prefer volume-name plex-name Example #vxvol rdpol prefer vg01 plex-80
2)How to set a round robin read policy on the volume in Veritas Volume Manager
# vxvol rdpol round volume_name Example #vxvol rdpol round vg01
3)How to put a volume in maintenance mode in Veritas Volume Manager
# vxvol maint volume_name Example #vxvol maint vg01
4)How to stop a volume in a disk group in Veritas Volume Manager
# vxvol -g disk-group volume-name Example #vxvol -g homedg vg01
Using fsadm:
# fsadm -F vxfs /dir_name If you need to enable largefile support: # fsadm -F vxfs -o largefiles /dir_name Example fsadm -F vxfs /dir_name; fsadm -F vxfs -o largefiles /dir_name
# vxrecover -s
# vxrecover -b volume Example #vxrecover -b vg01
# vxrecover -s volume-name Example vxrecover -s vg01
# vxtrace volume-name Example #vxtrace vg01
# vxinfo volume-name Example #vxinfo vg01
# vxsd aslog disk-name volume-name Example #vxsd aslog disk90 vg01
To join subdisk-88 and subdisk-77 to create the new bigger subdisk-99: # vxsd join subdisk-88 subdisk-77 subdisk-99
# vxsd mv subdisk-90 subdisk-80
# vxplex att volume_name plex-name Example #vxplex att vg01 plex-80
# vxiod set 10
# vxiod Example #vxiod set 10
Description vxconfigd is the main daemon of Veritas Volume Manager which must be running at all times. It is started at system startup. You can verify it is running with a ps command: # ps -ef | grep vxconfigd
# vxprint -ht
# vxprint -l plex-name OR # vxprint -lp Example # vxprint -l plex-80; vxprint -lp
# vxprint -l diskname-## OR # vxprint -st Example # vxprint -l disk80; vxprint -st
# vxprint -l volumename OR # vxprint -vl OR # vxprint -vt Example # vxprint -l vg01; vxprint -vl; vxprint -vt
# vxprint -t -v -e 'aslist.aslist.sd_disk="boot-disk-name"' Example #vxprint -t -v -e 'aslist.aslist.sd_disk="bootdisk"'
Description To report disk statistics in Veritas Volume Manager: # vxstat -d
# vxmend off plex-name Example # vxmend off plex-80
# vxmend on plex-name Example # vxmend on plex-80
# vxmend fix clean plex-name Example # vxmend fix clean plex-80
Description To mirror all the volumes on the disk rootdisk to disk90 in Veritas Volume Manager: # vxmirror rootdisk disk90
# vxmake plex plex-name sd=sub-disk-name Example: # vxmake plex plex-80 sd=subdisk-80
# vxmake sd subdisk-80 disk80,0,10000 If you wanted to put another subdisk on this disk then you would have an offset of the size of the previous subdisk (10000 in our case): # vxmake sd subdisk-81 disk80,10000,20000 Example # vxmake sd subdisk-80 disk80,0,10000
Description To rebuild the partition table after recovering from a root disk failure after re-mirroring the disk in Veritas Volume Manager: # vxmksdpart -g rootdg diskpart 1 0x03 0x01
To display free space on the disks in Veritas volume Manager: # vxdg free
To add the physical disk c0t0d0 in the disk group homedg calling it disk90 in Veritas Volume Manager: # vxdg -g homedg adddisk disk90=c0t0d0
To remove a disk, disk90, from a disk group, homedg, in Veritas Volume Manager: # vxdg -g homedg rmdisk disk90
4).Use the vxdg command to create disk groups. Use the -s option to specify shared mode, as in the following example:
# vxdg -s init logdata c0t3d2
5).Verify the configuration with the following command:
# vxdg list
NAME STATE ID
rootdg enabled 971995699.1025.node1
logdata enabled,shared 972078742.1084.node2
6).Activate the disk group, as follows, before creating volumes:
# vxdg -g logdata set activation=ew
Some important VCS Problem and it’s Solution::::
1) How to change the name of a cluster ?
Details:
Use these commands to change the name of a cluster:
# haconf -makerw # haclus -modify Cluster Name [New_ClusterName] # haconf -dump -makero
Note:
For RAC environments, the cluster name (similar to the hostid) gets stamped onto the provide region of the disks. Therefore, if you change the name of the cluster also update the cluster name on the disks by following these steps:
1. Confirm you are on the Master Node: # vxdctl -c mode 2. Update the cluster name stamped on the disks: # vxdg deport [disk_group] # vxdg -Cs import [disk_group]
2. VCS CRITICAL V-16-1-10029 VxFEN driver not configured. VCS Stopping. Manually restart VCS after configuring fencing
Note: According to the Veritas Cluster Server Bundled Agent’s Guide for VCS 5.0, a SAN disk group is only supported in the Storage Foundation Volume Set (SFVS) environment and therefore needs a different license.After rebooting the nodes in a cluster, Veritas Cluster Server (VCS) fails to start and the following messages are seen in the /var/VRTSvcs/log/engine_A.log file:
2009/10/22 05:16:47 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2009/10/22 05:17:02 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2009/10/22 05:17:17 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2009/10/22 05:17:33 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2009/10/22 05:17:48 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2009/10/22 05:18:03 VCS CRITICAL V-16-1-10037 VxFEN driver not configured. Retrying...
2009/10/22 05:18:18 VCS CRITICAL V-16-1-10029 VxFEN driver not configured. VCS Stopping. Manually restart VCS after configuring fencing
Attempting to start I/O Fencing manually results in the following error:
# ./S97vxfen start
Starting vxfen..
Starting vxfen.. Done
[/etc/rc2.d]# VCS FEN vxfenconfig NOTICE Driver will use SCSI-3 compliant disks.
VCS FEN vxfenconfig ERROR V-11-2-1016 There exists the potential for a preexisting split-brain
The coordinator disks list no nodes which,
are in the current membership. However, they,
also list nodes which are not in the,
current membership.
I/O Fencing Disabled!
This is a clear indication there are pre-existing keys left on the coordinator disks.
Resolution:
Clear these keys to start I/O fencing, using these commands:
# /opt/VRTSvcs/vxfen/bin/vxfenclearpre
# reboot
Note: Reboot all nodes in the cluster
===============================================================
3) Import failed: License has expired or is not available for operation
This error message is generated when failing over a disk group from one server to another in Veritas Cluster Server (VCS) 5.0. Workaround: 1. Check that the license for Veritas Volume Manager (VxVM) and VCS is valid and not expired. 2. Verify main.cf for the disk group: DiskGroup test ( DiskGroup = oradg_u01 DiskGroupType = SAN
3. Change DiskGroupType to private with these commands:
# haconf -makerw # hares -modify test DiskGroupType private # haconf -dump -makero
The disk group now can fail over to another server.
===============================================================
4) Error when starting cluster with hastart
Details:
Error when starting cluster with hastart. When changing the name of the system, the sysname might file not be changed. The Cluster software checks the names in the configuration files to see if they are consistent. Check that the file sysname has the name of the system provided in the other Cluster configuration files. The location of the configuration file is: /etc/VRTSvcs/conf
===============================================================
5)How to change a heartbeat on the fly?
Need to change a heartbeat during production timeframe. Low Latency Transport reads the /etc/llttab file when started. this loads the proper devices for Low Latency Transport to monitor. It is possible to change the devices that Low Latency transport monitors while it is active. Low Latency Transport reads the /etc/llttab file when started. It is possible to change the devices that Low Latency transport monitors while it is active.To add a new high priority link while Low Latency transport is active, use the following command: lltconfig -t <alias> -d <device> -b ether To see your results instantly, run: lltstat -vvn The new device will show in the output
===============================================================
6) Getting error while trying to swtich the service group
Exact Error Message
# hagrp -switch asdoraSG -to gsun908 VCS WARNING V-16-1-10484 Group dependency is violated if group asdoraSG goes offline on system gsun909
Details:
Cause:
In an online global dependency, a child service group must be online on a system in the cluster before the parent service group can come online.
The child service group cannot be taken offline while the parent service group is online, however the parent service group can be taken offline while the child service group is online.
Solution:
1. Determine the child and parent service groups.
i.e.
# hagrp -dep asdoraSG #Parent Child Relationship asddmSG asdoraSG online global firm
2. Offline the parent service group, swtich the child service group to another system and then online the parent service group again.
i.e.
7 ) A disk group under Veritas Cluster Server (VCS) control cannot be deported. # hagrp -offline asddmSG -sys gsun909
# hagrp -switch asdoraSG -to gsun908
# hagrp -online asddmSG -sys gsun909
===============================================================
Details:
The DiskGroup resource had the attributes StartVolumes = 0 StopVolumes = 0
With the attributes StartVolumes = 0 StopVolumes = 0 the DiskGroup agent uses:
Resolution: vxdg flush <diskgroup>
vxdg deport <diskgroup>
Manually running vxdg flush <diskgroup> never completed.
Remove and reinstall Storage Foundation.
=====================================================================================================================
How to clear a faulted disk group agent ?
The output from the command
hastatus -sum
shows the disk group agent faulted on one system.
Workaround:
Clear the fault by stopping and restarting the agent:
#haagent -stop Diskgroup -sys <system> #haagent -start Diskgroup -sys <system>
===============================================================
9 ) A mount resource faults. The file system mounts successfully when not under Veritas Cluster Server (VCS) control
Details:
10 )The Veritas Cluster Server (VCS) utility ‘hastatus -sum’ shows that a node is stuck in REMOTE_BUILD status This issue is caused by a syntax error in the main.cf file:
Mount xxx-xxx-Mount ( Critical = 0
MountPoint = "/xxx"
BlockDevice = "dev/vx/dsk/xxxx-dg/xxx"
FSType = vxfs
MountOpt = rw
FsckOpt = "-y"
#There is no leading "/" before "dev" on the BlockDevice line.
The BlockDevice line should read:
BlockDevice = "/dev/vx/dsk/xxxx-dg/xxx"
Workaround:
Run the command
hacf -verify ----> { to verify the syntax of the main.cf file}
Details:
The issue After upgrading network cards, which require a new Solaris network driver (e1000g), a node joining a VCS cluster is stuck in a REMOTE_BUILD state:
Symptoms:
/opt/VRTSvcs/bin/hastatus -sum reports that a node is in a REMOTE_BUILD state.
Conditions:
The Maximum Transmission Unit (MTU) size on the new driver/card is set greater than the corresponding MTU value on the network switch.
Cause:
The MTU size is too high.
Workaround:
1.Change the MTU of the new driver/card to less than or equal to the MTU value on the network switch.
2. Restart LLT.
===============================================================
11) How to add the include line to the main.cf file via the command line?.
Details: Run the following command to create the main.cmd file: hacf -typetocmd <types.cf> The main.cmd file is a file of commands that is needed to add the include line to the main.cf file. Each command in the main.cmd needs to be run to add the include line to the main.cf file. This can be a lengthy process.
===============================================================
12) How to turn off the notifier?
Details:
In order to disable the notifier, the following can be run from any node in the cluster:
haconf -makerw
hares -modify ntfr Enabled 0
hares -modify ntfr Critical 0
haconf -dump -makero
To re-enable the notifier, follow the above steps by replacing the 0 with a 1 in steps 2 and 3.
The service group will always show partial after the above steps are done. It will never show fully online.
===============================================================
13) The largefiles option does not work ?
Details: The issue: An error message in /var/VRTSvcs/log/engine_A.log states that the mount option is incompatible with the file system. Change: The largefiles option was added to the MountOpt attribute for the mount resource. Resolution: The largefiles option must be enabled for the file system at the operating system level before it can be configured for largefiles within Veritas Cluster Server (VCS). Enable largefiles with this command: /usr/lib/fs/vxfs/fsadm -o largefiles <mount>
14) Kernel message: Dazed and confused, but trying to continue
Details: Symptoms: System panic with error messages on boot: kernel: LLT INFO V-14-1-10009 LLT Protocol available kernel: device eth1 entered promiscuous mode kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. kernel: tg3: eth1: Flow control is on for TX and on for RX. kernel: device eth1 left promiscuous mode kernel: Uhhuh. NMI received for unknown reason 35 on CPU 0. kernel: Dazed and confused, but trying to continue kernel: Do you have a strange power saving mode enabled? Cause: Defective hardware was used in the loading of the LLT module. Resolution: Replace CPU 0.
===============================================================
15) Multi Network Interface Card B (MultiNICB) resource demands high CPU time with Solaris IP Multipathing (IPMP)
===============================================================Details:
Solaris IP Multipathing (IPMP) is in use and the MultiNICB resource is configured with the UseMpathd attribute enabled (set to 1).
Cause:
During every monitor cycle, the MultiNICB agent checks the system process table for the IPMP daemon process in mpathd.
If several MultiNICB resources are configured in the cluster, the agent checks for the IPMP daemon many times every minute, resulting in a higher CPU demand.
Workaround:
To decrease CPU demand:
Increase the MonitorInterval attribute for the MultiNICB resource from the default 10 seconds to 30 seconds.
In cluster configurations that have more than three MultiNICB resources, change the NumThreads attribute from the default of 10 to 1 or 2.
Enhancement request e426856 addresses this issue in a future product or patch release
16) ‘Stale NFS handle’ accessing files exported by NFS clients
Exact Error Message Stale NFS handle Details: On SLES 10 SP1 systems, whenever an NFS resource is faulted and fails over to the second node, the clients, on accessing the exported file system, may see 'Stale NFS handle' errors. This has been resolved by mounting the special file system nfsd before starting nfs daemons. Resolution: Download the 4.1MP4+e1023246.tar_288679.gz file from the Download Now link(ftp.support.veritas.com/pub/support/products) and then unpack the file: # mv 4.1MP4+e1023246.tar_288679.gz 4.1MP4+e1023246.tar.gz # cksum 4.1MP4+e1023246.tar.gz 608184018 3487 4.1MP4+e1023246.tar.gz # gzip -d 4.1MP4+e1023246.tar.gz # tar -xf 4.1MP4+e1023246.tar # cd 4.1MP4+e1023246 # cat README POINT PATCH FOR VERITAS CLUSTER SERVER 4.1 MP4 ============================================== NAME: 4.1MP4+e1023246 DATE: 2007-May-14 VCS RELEASE: 4.1MP4 LINUX RELEASE: SLES 10 SP1 RELEVANT ARCHITECTURES: i586, x86_64, and ia64 ETRACK REFERENCE: 1023246 PROBLEM DESCRIPTION: 'Stale NFS handle' errors seen on accessing files exported by NFS clients when a service group configured with NFS fails over PATCH CONTAINS -------------- . |__ online |__ README PATCH INSTALLATION INSTRUCTIONS: -------------------------------- Installed this patch after installing Veritas Cluster Server 4.1 MP4, following these steps: The Default value of $VCS_HOME is /opt/VRTSvcs 1. Log in as superuser to the system where the point patch is to be installed. 2. Go to the directory $VCS_HOME/bin/NFS: #cd $VCS_HOME/bin/NFS 3. Copy online as online.orig on all nodes of the cluster: #cp online online.orig 4. On each node of the cluster, copy the "online" from this patch to $VCS_HOME/bin/NFS/online: #cp /PointPatchDir/online ./online
17) Veritas Cluster Server (VCS) I/O Fencing parameters for racing (Solaris)
Details: When communication between cluster nodes fails, causing the cluster to be divided into sub-clusters, these sub-clusters start a race to grab coordinator disks for data protection (VCS I/O Fencing). vxfen has a mechanism that enables cluster administrators to give larger sub-clusters better odds to win this race. This document describes the differences in implementation between VCS versions and their tunable parameters. Note: While this mechanism can be used to give larger sub-clusters much better odds to win the race condition, it cannot be used to guarantee that the larger sub-cluster will always win. 1. How to give the odds Prior to 4.1 MP2 If the number of nodes in a sub-cluster is less than the number of nodes leaving the original cluster, the sub-cluster repeats reading the coordinator disks to delay the start of the race. By default, the number of reads is calculated as cube of (the number of leaving nodes). For example, if a 5-node cluster is divided into a 3-node and a 2-node cluster, the 2-node sub-cluster repeats reading coordinator disks 27 (= 3 cubed) times. A tunable parameter, max_read_coord_disk, can be used to change this value, as described later. 4.1 MP2 and 5.0 or later If the number of nodes in a sub-cluster is less than the number of nodes leaving from the original cluster, the sub-cluster waits for a number of seconds before joining the race. This wait time is calculated as (the number of leaving nodes) x 5. For example, if a 5-node cluster is divided into a 3-node and a 2-node cluster, the 2-node sub-cluster will wait for 15 (3 times 5) seconds. Tunable parameters min_delay and max_delay (4.1MP2), or vxfen_min_delay and vxfen_max_delay (5.0 or later), can be employed to change this wait time, as described later. 2. Tunable parameters The following parameters can be specified in the file /kernel/drv/vxfen.conf. Use/tune these parameters only in situations where you often see a larger sub-cluster losing the vxfen race. Note that careful and ample testing is required to determine the most optimal values for a specific environment. Prior to 4.1 MP2 max_read_coord_disk: The maximum number of times vxfen will loop reading coordinator disks. If the calculated repeat count exceeds this limit, this value will be used instead. Default = 25 Min = 1 Max = 1000 4.1MP2 min_delay: The lower limit of the wait time in seconds. If the calculated wait time is below this limit, this value will be used instead. Default = 1 Min = 1 Max = N/A max_delay: The upper limit of the wait time in seconds. If the calculated wait time exceeds this limit, this value will be used instead. Default: 600 Min = N/A Max = 600 Limitations: The implementation of 4.1 MP2 is a subset 5.0 (or later) implementation and has some limitations. vxfen in 4.1 MP2 will "round down" the min_delay and max_delay values to a number that is a multiple of 5. For example, if the calculated delay time is 20 and max_delay specified is 18, the wait time value chosen will be 18. However, vxfen will only the wait for 15 seconds, ignoring the remaining 3 seconds. Therefore, to avoid confusion, it is recommended that only numbers that are a multiple of 5 be specified. Note: With 4.1 MP2, the default and minimum values implemented for min_delay are a bit confusing, as they are both 1 and not a multiple of 5. Magic in the code prevents this value from being rounded down to 0, so specifying 1 here - or using the default value of 1 - is safe and will not be a problem. VCS 5.0 or later allows more granularity, so the delay can be specified in any number of seconds within the minimum and maximum boundaries. 5.0 or later vxfen_min_delay: The lower limit of the wait time in seconds. If the calculated wait time is below this, this value will be used instead. Default = 1 Min = 1 Max = 600 vxfen_max_delay: The upper limit of the wait time in seconds. If the calculated wait time exceeds this, this value will be used instead. Default: 60 Min = the value of vxfen_min_delay Max = 600 ===================================================================================================================== On Veritas Cluster Server 4.1/4.1MP1 and Solaris 10 encapsulated rootdisk, the haremajor command prevents the system from starting up due to missing devices under the /dev/vx/dsk directory Details:- Example: # haremajor -vx 320 321 LLT INFO V-14-1-10009 LLT Protocol available GAB INFO V-15-1-20021 GAB available haremajor 1.1 Using the following major number(s): 320 321 Do you want to continue [y/n]? y Updating /etc/name_to_major Backing up swapvol Generating the new device: swapvol 320 62003 Backing up swapvol Generating the new device: swapvol 320 62003 Backing up rootvol Generating the new device: rootvol 320 0 Backing up rootvol Generating the new device: rootvol 320 0 Backing up var Generating the new device: var 320 62000 Backing up var Generating the new device: var 320 62000 Backing up home Generating the new device: home 320 62002 Backing up home Generating the new device: home 320 62002 Backing up opt Generating the new device: opt 320 62001 Backing up opt Generating the new device: opt 320 62001 If there are any problems, you can backout the changes by restoring the following files: - /etc/name_to_major.off.363 - /dev/vx/dsk/bootdg/off.swapvol - /dev/vx/rdsk/bootdg/off.swapvol - /dev/vx/dsk/bootdg/off.rootvol - /dev/vx/rdsk/bootdg/off.rootvol - /dev/vx/dsk/bootdg/off.var - /dev/vx/rdsk/bootdg/off.var - /dev/vx/dsk/bootdg/off.home - /dev/vx/rdsk/bootdg/off.home - /dev/vx/dsk/bootdg/off.opt - /dev/vx/rdsk/bootdg/off.opt To complete re-majoring, reboot your machine with the following command: reboot ********** Rebooting with command: boot -s Boot device: /sbus@3,0/SUNW,fas@3,8800000/sd@2,0:a File and args: -s SunOS Release 5.10 Version Generic_118833-18 64-bit Copyright 1983-2005 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Booting to milestone "milestone/single-user:default". Hostname: olds14.example.com NOTICE: VxVM vxdmp V-5-0-34 added disk array DISKS, datype = Disk The / file system (/dev/vx/rdsk/bootdg/rootvol) is being checked. WARNING - Unable to repair the / filesystem. Run fsck manually (fsck -F ufs /dev/vx/rdsk/bootdg/rootvol). Jan 8 16:04:11 svc.startd[7]: svc:/system/filesystem/usr:default:Method "/lib/svc/method/fs-usr" failed with exit sta. [ system/filesystem/usr:default failed fatally (see 'svcs -x' for details) ] Requesting System Maintenance Mode Console login service(s) cannot run Solution: A hotfix is available to fix this issue. Please contact Symantec Technical Support to obtain this hotfix.
#!/bin/sh
#TIME=`date "%Y/%m/%d %H:%M:%S"`
YYMMDD=`date "%y%m%d"`
OUTPUT_PATH=${LOGPATH:-/usr/amk/log}
#location of the output file to be stored
LOGFILE=${OUTPUT_PATH}/HW_error_$YYMMDD
# if dir does not exist, create it
if [ ! -d ${OUTPUT_PATH} ];
then
mkdir -pm 755 ${OUTPUT_PATH}
fi
# If logfile does not exist, create it
if [ ! -f ${LOGFILE} ];
then
touch ${LOGFILE}
fi
egrep "temperature|Fan Failure|Insufficient|unconfigured|Intermittent|disconnected|i\
tran_err|Fibre Channel Loop is Down|Persistent|needs maintenance|\
OVERHEATING|fatal|uncorrectable write error|Transport Errors|RAIDarray.driver.3004\
|reset failed|disk not responding|NIC failure|Device voltage problem" \
/var/adm/messages >> ${LOGFILE}
prtdiag -v | grep -i on | grep amber >> ${LOGFILE}
prtdiag -v | grep -i failed >> ${LOGFILE}
iostat -En |grep -i hard >> ${LOGFILE}
An excellent one liner in perl which I use to find large size files on centos/rhel5
du -k | sort -n | perl -ne 'if ( /^(\d+)\s+(.*$)/){$l=log($1+.1);$m=int($l/log(1024)); printf ("%6.1f\t%s\t%25s %s\n",($1/(2**(10*$m))),(("K","M","G","T","P")[$m]),"*"x (1.5*$l),$2);}'
Fs_mark
The tool we will be using is called fs_mark. It was developed by Ric Wheeler (now at Red Hat) to test file systems. Fs_mark will test various aspects of the file system performance, which is interesting, but not the focus of this test. However, in running the tests, fs_mark will conveniently create the file system, which is what is needed.
Using fs_mark, the file system is filled and tested. There are a large number of options for fs_mark, but we will focus on only a few of them. An example of command line for creating 100 million files is the following:
# fs_mark -n 10000000 -s 400 -L 1 -S 0 -D 10000 -N 1000 -d /mnt/home -t 10 -k
where the options are:
- -n 10000000: The number of files to be tested per thread (more on that later)
- -s 400; Each file will be 400KB
- -L 1: Loop the testing once (fs_mark testing)
- -S 0: Issue no sync() or fsync() during the creation of the file system. Since fs_mark is not being used for testing file systems, we just care about creating the file system quickly
- -D 10000: There are 10,000 subdirectories under the main root directory
- -d /mnt/home: The root directory for testing; for this particular test, we are using only 1 root directory
- -N 1000: 1,000 files are allocated per directory
- -t 10: Use 10 threads for building the file system
- -k: Keep the file system after testing
- -N 10,000: Allocate 10,000 files per directory
With these options, there are a total of 1,000 files per directory, and there are 10,000 directories. This results in a total of 10 million files. However, note that the number of files specified by the “-n” option lists only 10 million files because each thread will produce “-n” files. Since we have 10 threads and we have 10 million files per thread, this results in a total of 100 million files.
Since we have 100 million files and each file is 400KB, the file system uses a total of 40TBs. This is about half of the 80TBs for the largest file system. With the goal of filling at least 50 percent of the file system for the specified number of files, the resulting file sizes are listed below.
1) 80TB XFS File System
100 Million files: 400KB file size
50 Million files: 800 KB file size
10 Million files: 4,000 KB (4MB) file size
2) 40TB XFS File System
100 Million files: 200KB file size
50 Million files: 100KB file size
10 Million files: 2,000 KB (2MB) file size
3) 10 TB ext4 File System
100 Million Files: 5KB file size
50 Million Files: 10KB file size
10 Million Files: 50KB file size
4) 5TB ext4 File System
100 Million Files: 3KB file size
50 Million Files: 6KB file size
10 Million Files: 30KB file size
I have been using Veritas Cluster Server Java Console on windows for quite sometime. I had the Urge to get it working on Ubuntu 10.10. Since Symantec provides the Symantec HA Java console as an rpm package I had to convert the .rpm package to .deb to get it installed on my Ubuntu based Notebook.
The steps I followed to get it installed on Ubuntu 10.10
1) Alien is a program that converts between the rpm, dpkg, stampede slp, and slackware tgz file formats. If you want to use a package from another distribution than the one you have installed on your system, you can use alien to convert it to your preferred package format and install it.
2) Install Alien using sudo apt-get install alien
3) I tried the syntax provided on several websites to install alien but it failed. so had to digg the man page and got the syntax right to install it
sudo alien -t -k –scripts –veryverbose VCS_Cluster_Manager_Java_Console_5.1_for_Linux.rpm
The above snippet creates a tarball as well as a .deb package and thats it
use : dpkg -i vrtscscm_5.1.00.20-1_all.deb ( The install was complete… )
to execute the Console Manager : sudo /opt/VRTSvcs/bin/hagui &
Linux provides multiple mechanisms to rescan the SCSI bus and recognize SCSI devices exposed to the system. In the 2.4 kernel
solutions, these mechanisms were generally disruptive to the I/O since the dynamic LUN scanning mechanisms were not consistent.
With the 2.6 kernel, significant improvements have been made and dynamic LUN scanning mechanisms are available. Linux currently lacks a kernel command that allows for a dynamic SCSI channel reconfiguration like drvconfig or ioscan.
The mechanisms for reconfiguring devices on a Linux host include:
◆ System reboot
◆ Unloading and reloading the modular HBA driver
◆ Echoing the SCSI device list in /proc
◆ Executing a SCSI scan function through attributes exposed to
/sys
◆ Executing a SCSI scan function through HBA vendor scripts
1) System reboot
Rebooting the host allows reliable detection of newly added devices.
The host may be rebooted after all I/O has stopped, whether the
driver is modular or statically linked.
2) HBA driver reload
By default, the HBA drivers are loaded in the system as modules.
This allows for the module to be unloaded and reloaded, causing a
SCSI scan function in the process. In general, before removing the
driver, all I/O on the SCSI devices should be quiesced, file systems
should be unmounted, and multipath services need to be stopped. If
there are agents or HBA application helper modules, they should also
be stopped on the system. The Linux utility modprobe provides a
mechanism to unload and load the driver module
3) SCSI scan function in /proc
In the 2.4 kernel, the /proc file system provides a listing of available
SCSI devices. If SCSI devices exposed to the system are reconfigured,
then these changes can be reflected on the SCSI device list by echoing
the /proc interface.
To add a device, the host, channel, target ID, and LUN numbers for
the device to be added to /proc/scsi/, scsi must be identified.
The command to be run follows this format:
# echo “scsi add-single-device 0 1 2 3″ > /proc/scsi/scsi
where:
0 is the host ID
1 is the channel ID
2 is the target ID
3 is the LUN
This command will add the new device to the file /proc/scsi/scsi. If
one does not already exist, a device filename may need to be created
for this newly added device in the /dev directory.
To remove a device, use the appropriate host, channel, target ID, and
LUN numbers and issue a command similar to the following:
# echo “scsi remove-single-device 0 1 2 3″ > /proc/scsi/scsi
where:
0 is the host ID
1 is the channel ID
2 is the target ID
3 is the LUN
4) SCSI scan function in /sys
The Host Bus Adapter driver in the 2.6 kernel exports the scan
function to the /sys directory which can be used to rescan the SCSI
devices on that interface. The scan function is available as follows:
# cd /sys/class/scsi_host/host4/
# ls -al scan
# echo ‘- – -’ > scan
The three dash marks refer to channel, target, and LUN numbers. The
above action causes a scan of every channel, target, and LUN visible
through host-bus adapter instance ‘4’.
5) SCSI scan through Linux distributor provided scripts
Novell’s SuSE Linux Enterprise Server (SLES) provides a script
named /bin/rescan-scsi-bus.sh. It can be found as part of the SCSI
utilities package.
# rpm -qa | grep scsi
yast2-iscsi-server-2.13.26-0.3
yast2-iscsi-client-2.14.42-0.3
open-iscsi-2.0.707-0.44
scsi-1.7_2.36_1.19_0.17_0.97-12.21
xscsi-1.7_2.36_1.19_0.17_0.97-12.21
The following is an example from SLES 10 SP2:
# /bin/rescan-scsi-bus.sh -h
Usage: rescan-scsi-bus.sh [options] [host [host ...]]
Options:
-l
activates scanning for LUNs 0-7
[default: 0]
-L NUM activates scanning for LUNs 0–NUM [default: 0]
-w
scan for target device IDs 0 .. 15 [default: 0-7]
-c
enables scanning of channels 0 1
[default: 0]
-r
enables removing of devices
[default: disabled]
-i
issue a FibreChannel LIP reset
[default: disabled]
–remove:
same as -r
–issue-lip:
same as -i
–forceremove:
Remove and readd every device (DANGEROUS)
–nooptscan:
don’t stop looking for LUNs is 0 is not found
–color:
use coloured prefixes OLD/NEW/DEL
–hosts=LIST:
Scan only host(s) in LIST
–channels=LIST: Scan only channel(s) in LIST
–ids=LIST:
Scan only target ID(s) in LIST
–luns=LIST:
Scan only lun(s) in LIST
Host numbers may thus be specified either directly on cmd line (deprecated) or
or with the –hosts=LIST parameter (recommended).
LIST: A[-B][,C[-D]]… is a comma separated list of single values and ranges
(No spaces allowed.)
On Symmetrix and CLARiiON storage systems, you can use a PowerPath pseudo (emcpower) device located on external storage as a root device (the device that contains the startup image). To use a PowerPath pseudo device as the root device, the device must be under LVM control. Once the PowerPath drivers have been loaded, using a PowerPath pseudo device as the root device provides load balancing and path failover for the root device Configuring PowerPath in a boot-from-SAN setup (RHEL) To configure a PowerPath root device using the LVM on a RHEL host: 1. Install RHEL on the host. Configure a single active path to the boot LUN during the initial installation. You attach additional LUNs and configure additional paths at the end of this procedure. 2. Install and configure PowerPath 3. If you are working on RHEL 5 or IBM PowerPC hosts, no changes to the /etc/fstab file are necessary. By default, the /boot partition is configured to mount by label, as shown in the following example: /dev/system/LogVol00 / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 devpts /dev/pts devpts gid=5,mode=620 0 0 tmpfs /dev/shm tmpfs defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 /dev/system/LogVol01 swap swap defaults 0 0 Use the default mount by label device to mount the /boot partition. If you are working in RHEL 4, you need to add the pseudo device and mount point to /etc/fstab. Note that the mount by label multipath device function does not depend on the PowerPath version installed because it is a feature that the underlying operating system either supports or does not support. 4. Configure additional paths to the storage devices. Attach additional LUNs to the host. Upgrading the Linux kernel in a boot-from-SAN setup To upgrade the Linux kernel in a boot-from-SAN setup: 1. Upgrade the kernel, following the steps provided by RedHat and Novell for upgrading the kernel in the host. 2. Before restarting the host, edit the /etc/fstab file to comment out entries that refer to the PowerPath pseudo (emcpower) names. An example /etc/fstab file with a commented out entry for the /boot partition is shown below. /dev/VolGroup00/LogVol00 / ext3 defaults 1 1 #/dev/emcpowera1 /boot ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 /dev/VolGroup00/LogVol01 swap swap defaults 0 0 /dev/hda /media/cdrom auto pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0 0 /dev/fd0 /media/floppy auto pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0 0 3. Restart the host. On a PowerPath 5.0.1 and later host, PowerPath detects that a new version of Linux has been installed on the host and automatically reinstalls the PowerPath drivers. You do not need to reinstall PowerPath after upgrading Linux. 4. Uncomment all entries in the /etc/fstab file that refer to PowerPath pseudo (emcpower) devices. A modified /etc/fstab file is shown below: /dev/VolGroup00/LogVol00 / ext3 defaults 1 1 /dev/emcpowera1 /boot ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 /dev/VolGroup00/LogVol01 swap swap defaults 0 0 /dev/hda /media/cdrom auto pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0 0 /dev/fd0 /media/floppy auto pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0 0 5. Run mount -a to ensure that all emcpower partitions in the /etc/fstab file are mounted.
#!/bin/sh
fila_post=`postqueue -p |grep -e "^[A-Z,0-9]" | grep -v "Mail" |wc -l`
crit=30
warn=25
if [ $fila_post -gt $warn ]; then
if [ $fila_post -le $crit ]; then
echo "WARNING: postqueue -p -> $fila_post"
exit 1
fi
fi
if [ $fila_post -gt $crit ]; then
echo "CRITICAL: postqueue -p > $fila_post"
exit 2
fi
if [ $fila_post -le $warn ]; then
echo "OK: postqueue -p -> $fila_post"
exit 0
fi
The below script provides Memory statistical information on Solaris,
#!/bin/sh
totalmb=`/usr/sbin/prtconf|/usr/bin/grep "Memory size"|/usr/bin/awk '{print $3}'`
freekb=`/usr/bin/vmstat 1 2|/usr/bin/tail -1|/usr/bin/awk '{print $5}'`
freemb=`/usr/bin/echo $freekb/1024 | /usr/bin/bc`
usedmb=`/usr/bin/echo $totalmb-$freemb|/usr/bin/bc`
/usr/bin/echo "Total Memory : $totalmb MB
"/usr/bin/echo "Used Memory : $usedmb MB
"/usr/bin/echo "Free Memory : $freemb MB"
This script performs a few routine checks of the specified database and sends an email message to the specified user if anything goes wrong.
This script is suitable for running via cron. As such, it generates no standard output. All output is written to a log. You will need to edit
this script before using in order to invoke your password lookup facility or otherwise determine the correct Oracle password.
#!/bin/ksh
#
# dbmon.sh
# ========
#
#
# Usage: dbmon.sh -d sid -n notify_list
# sid is the ORACLE_SID of the database to monitor.
# notify_list is a list of comma-separated email addresses where email
# should be sent if any problems are discovered.
#
#
#
# The quit_dbmon function sends a failure email message, cleans up, and exits.
#
function quit_dbmon
{
echo $1 >> $LOGFILE
mailx -s"Database $ORACLE_SID requires attention" $NOTIFY < $LOGFILE
rm -f $TMPLOG $LOGFILE $TMPFILE $ALRTFILE
exit 1
}
#
# Parse the command line arguments.
#
USAGE="Usage: `basename $0` -d sid -n notify_list"
SID=""
NOTIFY=""
while getopts :d:n: opt
do
case "$opt" in
"d") SID="$OPTARG" ;;
"n") NOTIFY="$OPTARG" ;;
":"|"?") echo "$USAGE" 1>&2 ; exit 1 ;;
esac
done
let i=$#+1
if [ "$i" != "$OPTIND" -o -z "$SID" -o -z "$NOTIFY" ]
then
echo "$USAGE" 1>&2
exit
fi
#
# Set up the Oracle environment.
#
export ORACLE_SID="$SID"
export ORAENV_ASK="NO"
export PATH="$PATH:/usr/local/bin"
. /usr/local/bin/oraenv
# Note: If oraenv does not set $ORACLE_BASE, you must set it here.
#
# Set up the general environment.
#
STATFILE="$ORACLE_BASE/local/tools/.dbmon${ORACLE_SID}.stat"
# Note: You may need to edit the above assignment; choose an appropriate location
# for the stat file.
LOGFILE="/tmp/dbmon$$.log"
TMPLOG="/tmp/dbmon$$.logtmp"
TMPFILE="/tmp/dbmon$$.tmp"
ALRTFILE="/tmp/dbmon$$.alr"
> $TMPFILE
> $ALRTFILE
cat <<EOF > $LOGFILE
Date: `date`
Server: `uname -n`
Database: $ORACLE_SID
EOF
#
# Get the necessary Oracle password.
#
ORAPWD="`<enter your password lookup service here> $ORACLE_SID system 2>&1"
# If you want to hardcode the Oracle password, change the above line to read:
# ORAPWD="<enter your password here>"
if [ "$?" != "0" ]
then
echo $ORAPWD >> $LOGFILE
quit_dbmon "Unable to get Oracle password for system schema"
fi
#
# Prepare a SQL*Plus script to check various things on the database.
#
echo "system/$ORAPWD" > $TMPFILE
chmod 700 $TMPFILE
cat <<\EOF >> $TMPFILE
WHENEVER SQLERROR EXIT FAILURE 1
WHENEVER OSERROR EXIT FAILURE 2
DEFINE extent_threshold = 10
DEFINE space_threshold = 52428800
SET TERMOUT OFF
SET SERVEROUTPUT ON SIZE 100000
SET HEADING OFF
SET TRIMSPOOL ON
SET FEEDBACK OFF
SET VERIFY OFF
SET PAGESIZE 0
SPOOL &1
SELECT A.value || '/alert_' || B.value || '.log'
FROM v$parameter A, v$parameter B
WHERE A.name = 'background_dump_dest'
AND B.name = 'db_name';
SPOOL OFF
SET TERMOUT ON
SET PAGESIZE 999
SELECT 'SEGMENTS WITHIN &extent_threshold EXTENTS OF MAXEXTENTS'
FROM SYS.dual
WHERE EXISTS
(
SELECT 'x'
FROM SYS.dba_segments
WHERE owner NOT IN ('SYS', 'TECHWEB')
AND extents > max_extents - &extent_threshold
);
SET HEADING ON
COL owner FORMAT a8
COL segment_type FORMAT a6 TRUNCATE
COL segment_name FORMAT a28 TRUNCATE
COL extents FORMAT 990
COL max_extents FORMAT 990
COL bytes FORMAT 9,999,999,990
SELECT owner, segment_type, segment_name, extents, max_extents, bytes
FROM SYS.dba_segments
WHERE owner NOT IN ('SYS', 'TECHWEB')
AND extents > max_extents - &extent_threshold
ORDER BY owner, segment_type, segment_name;
DECLARE
CURSOR c_ts IS
SELECT tablespace_name, SUM (bytes) siz
FROM SYS.dba_data_files
GROUP BY tablespace_name
ORDER BY tablespace_name;
CURSOR c_segments IS
SELECT owner, segment_type, segment_name, tablespace_name, next_extent
FROM SYS.dba_segments
ORDER BY owner, segment_type, segment_name;
TYPE varchararray IS TABLE OF VARCHAR2(30) INDEX BY BINARY_INTEGER;
TYPE numberarray IS TABLE OF NUMBER INDEX BY BINARY_INTEGER;
v_ts_names varchararray;
v_ts_sizes numberarray;
v_free_spaces numberarray;
v_biggest_frees numberarray;
v_ts_count NUMBER := 0;
v_first_display BOOLEAN := TRUE;
i NUMBER;
j NUMBER;
BEGIN
FOR r IN c_ts LOOP
v_ts_count := v_ts_count + 1;
v_ts_names(v_ts_count) := r.tablespace_name;
v_ts_sizes(v_ts_count) := r.siz;
SELECT NVL (SUM(bytes), 0), NVL (MAX(bytes), 0)
INTO v_free_spaces(v_ts_count), v_biggest_frees(v_ts_count)
FROM SYS.dba_free_space
WHERE tablespace_name = r.tablespace_name;
--
-- If a tablespace other than TEMP has sufficient free space
-- but it is not contiguous, then coalesce the free space for
-- the tablespace and check how much contiguous space is free
-- after the coalesce.
--
IF r.tablespace_name != 'TEMP' AND
v_biggest_frees(v_ts_count) < &space_threshold AND
v_free_spaces(v_ts_count) > &space_threshold THEN
i := dbms_sql.open_cursor;
dbms_sql.parse (i, 'ALTER TABLESPACE ' || r.tablespace_name ||
' COALESCE', dbms_sql.native);
j := dbms_sql.execute (i);
dbms_sql.close_cursor (i);
SELECT NVL (SUM(bytes), 0), NVL (MAX(bytes), 0)
INTO v_free_spaces(v_ts_count), v_biggest_frees(v_ts_count)
FROM SYS.dba_free_space
WHERE tablespace_name = r.tablespace_name;
END IF;
--
-- On the TEMP tablespace we just look for ample free space. On all other
-- tablespaces, we look for ample contiguous free space.
--
IF (r.tablespace_name != 'TEMP' AND
v_biggest_frees(v_ts_count) < &space_threshold) OR
(r.tablespace_name = 'TEMP' AND
v_free_spaces(v_ts_count) < &space_threshold) THEN
IF v_first_display THEN
dbms_output.put_line (CHR(9));
dbms_output.put_line ('TABLESPACES WITH FEWER THAN ' ||
'&space_threshold CONTIGUOUS BYTES OF FREE SPACE');
dbms_output.put_line (CHR(9));
dbms_output.put_line ('TABLESPACE_NAME TOTAL_SIZE ' ||
' FREE_SPACE BIGGEST_FREE');
dbms_output.put_line ('-------------------- -------------- ' ||
'-------------- --------------');
v_first_display := FALSE;
END IF;
dbms_output.put_line (RPAD (SUBSTR (v_ts_names(v_ts_count), 1, 20), 20)
|| ' ' || TO_CHAR (v_ts_sizes(v_ts_count), '9,999,999,990')
|| ' ' || TO_CHAR (v_free_spaces(v_ts_count), '9,999,999,990')
|| ' ' || TO_CHAR (v_biggest_frees(v_ts_count), '9,999,999,990'));
END IF;
END LOOP;
v_first_display := TRUE;
FOR r IN c_segments LOOP
i := v_ts_count;
LOOP
EXIT WHEN i = 0 OR v_ts_names(i) = r.tablespace_name;
i := i - 1;
END LOOP;
IF i > 0 THEN
IF v_biggest_frees(i) < r.next_extent THEN
IF v_first_display THEN
dbms_output.put_line (CHR(9));
dbms_output.put_line ('SEGMENTS WHERE NOT ENOUGH FREE SPACE ' ||
'EXISTS TO ALLOCATE ANOTHER EXTENT');
dbms_output.put_line (CHR(9));
dbms_output.put_line ('OWNER TYPE SEGMENT_NAME ' ||
' TABLESPACE_NAME DESIRED_NEXT');
dbms_output.put_line ('---------- -------- ---------------' ||
'--------------- --------------- ------------');
v_first_display := FALSE;
END IF;
dbms_output.put_line (RPAD (SUBSTR (r.owner, 1, 10), 10) || ' ' ||
RPAD (SUBSTR (r.segment_type, 1, 8),
|| ' ' ||
RPAD (r.segment_name, 30) || ' ' ||
RPAD (SUBSTR (r.tablespace_name, 1, 15), 15) || ' ' ||
TO_CHAR (r.next_extent, '999,999,990'));
END IF;
END IF;
END LOOP;
END;
/
EXIT 7
EOF
#
# Run the SQL*Plus script.
#
sqlplus -s @$TMPFILE $ALRTFILE > $TMPLOG 2>&1
RETCODE="$?"
cat $TMPLOG >> $LOGFILE
case "$RETCODE" in
"0") quit_dbmon "Unable to connect to database" ;;
"7") ;;
*) quit_dbmon "SQL*Plus exited with error code $RETCODE" ;;
esac
#
# Look for the most recent ORA-600 or ORA-12012 error in the alert log.
#
if [ -s "$ALRTFILE" ]
then
ALERT_LOG="`cat $ALRTFILE`"
grep -n "ORA-00600" $ALERT_LOG | tail -1 | cut -f 1 -d : | read NEW_LINE_NO1
grep -n "ORA-12012" $ALERT_LOG | tail -1 | cut -f 1 -d : | read NEW_LINE_NO2
if [ "$NEW_LINE_NO1" -gt "$NEW_LINE_NO2" ]
then
NEW_LINE_NO="$NEW_LINE_NO1"
else
NEW_LINE_NO="$NEW_LINE_NO2"
fi
if [ -s "$STATFILE" ]
then
OLD_LINE_NO="`cat $STATFILE`"
else
OLD_LINE_NO=""
fi
if [ "$NEW_LINE_NO" != "$OLD_LINE_NO" ]
then
echo "\nCheck alert log $ALERT_LOG\nfor ORA-00600 or ORA-12012 error at line $NEW_LINE_NO"\
>> $LOGFILE
echo $NEW_LINE_NO > $STATFILE
echo "x" > $TMPLOG
fi
fi
#
# If any problems were found, then send email.
#
[ -s "$TMPLOG" ] && quit_dbmon ""
#
# The monitor was completed successfully. Now clean up and exit.
#
rm -f $TMPLOG $LOGFILE $TMPFILE $ALRTFILE
exit 0
The below script has been tested on solaris 10/sparc & X86 based servers..
#!/bin/sh getent passwd | cut -d: -f1 | perl -e'while(<>){chomp;$l = `crontab -l $_ 2>/dev/null`;print "$_\n$l\n" if $l}'
The below script lists all cronjobs run by different users, need to run this script with root privileges. The below script has been tested on RHEL based linux servers..
#!/bin/bash
# System-wide crontab file and cron job directory. Change these for your system.
CRONTAB='/etc/crontab'
CRONDIR='/etc/cron.d'
# Single tab character. Annoyingly necessary.
tab=$(echo -en "\t")
# Given a stream of crontab lines, exclude non-cron job lines, replace
# whitespace characters with a single space, and remove any spaces from the
# beginning of each line.
function clean_cron_lines() {
while read line ; do
echo "${line}" |
egrep --invert-match '^($|\s*#|\s*[[:alnum:]_]+=)' |
sed --regexp-extended "s/\s+/ /g" |
sed --regexp-extended "s/^ //"
done;
}
# Given a stream of cleaned crontab lines, echo any that don't include the
# run-parts command, and for those that do, show each job file in the run-parts
# directory as if it were scheduled explicitly.
function lookup_run_parts() {
while read line ; do
match=$(echo "${line}" | egrep -o 'run-parts (-{1,2}\S+ )*\S+')
if [[ -z "${match}" ]] ; then
echo "${line}"
else
cron_fields=$(echo "${line}" | cut -f1-6 -d' ')
cron_job_dir=$(echo "${match}" | awk '{print $NF}')
if [[ -d "${cron_job_dir}" ]] ; then
for cron_job_file in "${cron_job_dir}"/* ; do # */ <not a comment>
[[ -f "${cron_job_file}" ]] && echo "${cron_fields} ${cron_job_file}"
done
fi
fi
done;
}
# Temporary file for crontab lines.
temp=$(mktemp) || exit 1
# Add all of the jobs from the system-wide crontab file.
cat "${CRONTAB}" | clean_cron_lines | lookup_run_parts >"${temp}"
# Add all of the jobs from the system-wide cron directory.
cat "${CRONDIR}"/* | clean_cron_lines >>"${temp}" # */ <not a comment>
# Add each user's crontab (if it exists). Insert the user's name between the
# five time fields and the command.
while read user ; do
crontab -l -u "${user}" 2>/dev/null |
clean_cron_lines |
sed --regexp-extended "s/^((\S+ +){5})(.+)$/\1${user} \3/" >>"${temp}"
done < <(cut --fields=1 --delimiter=: /etc/passwd)
# Output the collected crontab lines. Replace the single spaces between the
# fields with tab characters, sort the lines by hour and minute, insert the
# header line, and format the results as a table.
cat "${temp}" |
sed --regexp-extended "s/^(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(.*)$/\1\t\2\t\3\t\4\t\5\t\6\t\7/" |
sort --numeric-sort --field-separator="${tab}" --key=2,1 |
sed "1i\mi\th\td\tm\tw\tuser\tcommand" |
column -s"${tab}" -t
rm --force "${temp}"













![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=7253d073-ba27-4532-9626-03027098bbf5)