The SAN Guy

tips and tricks from an IT veteran…

First day at EMC World 2013

It’s been a great first day at EMC World 2013 so far.  I’ve been to three breakout sessions, two of which were very informative and useful. I focused my day today on learning more about best practices for the EMC technologies that I already use.  I’m not going to go into great detail now as I need to get to the next session. :) The notes I made are specific to a brocade switch implemenation as that’s what I use. Here are some useful bullet points from a few of my sessions today.

Storage Area Networking Best Practices:

1.  Single Initiator Zoning.  Only put one initiator and the targets it will access in one zone.  I’ve already practiced this for many years, but it was interesting to hear the reasons for doing it that way.  It drastically reduces the number of server queries to the switch.

2.  Dynamic Interface Management.  Use Brocade port fencing.  It can prevent a SAN outage by giving you the ability to shut down a single host or port.

3.  Monitor for Congestion.  Congestion from one host can spread an cause problems in other operating environments.  Definitely enable Brocade bottleneck detection.

4.  Periodic SAN health checks.  Use the EMC SANity or Brocade SAN health tools regularly.  If you’re reading this, you’re verly likely already doing that. :)

5.  Monitor for bit errors.  These can lead to bb_credit loss.  The Brocade parameter portcvglongdistance should be set (bb_credit recovery option), which will prevent the problem.  Bit errors could still be a problem on F Ports, however.

6.  Cable Hygeine.  Always clean your cables!  They are a major contributor to physical connectivity problems.

7.  Target Releases.  Always run a target release of Firmware.

VNX NAS Configuration Best Practices:

1. For transactional NAS, always use the correct transfer size for your workload.  The latest VNX OE release supports up to 1MB.

2. Use Jumbo frames end to end.

3. 10G is available now. Don’t use 1G anymore!

4. Use link aggregation for redundancy and load balancing, failsafe for high availability.

5. Use AVM for typical NAS configurations, MVM for transactional NAS where you have high concurrency from a single NFS stream to a single NFS export. 

6. Use Direct Writes for apps that use concurrent IO or are asynchronous (Oracle, VMWare, etc.).

7. For NAS Pool LUNs, use thick LUNs that are all the same size, balance the SP designation for each one, use the same tiering policy for each one, use approximately 1 LUN for every 4 drives, and use thin enabled file systems to maximize the consumption of the highest tier.

8. Always enable FASTCache for NAS LUNs.  They benefit from it even more than block LUNs.

 

Magic Quadrant for Storage Vendors

I was reading over the Gartner report today that dives into specifics on the strengths and weaknesses of the big players in the storage market.  The original Gartner article can be viewed here: http://www.gartner.com/technology/reprints.do?id=1-1EL3WXN&ct=130321&st=sb.

It came as no surprise to me that EMC was ranked as the market leader, with NetApp as #2, Hitachi #3, and IBM #4.  I’ve been very pleased with the performance and reliability of EMC products in my environment and their global support is top notch.  Below is the magic quadrant from the article. marketleaders

Relocating Celerra Checkpoints

On our NS-960 Celerra we have multiple storage pools defined and one of them that is specifically defined for checkpoint storage. Out Of the 100 or so file systems that we run a checkpoint schedule on, about 2/3 of them were incorrectly writing their checkpoints to the production file system pool rather than the defined checkpoint pool, and the production pool was starting to fill up. I started to research how you could change where checkpoints are stored. Unfortunately you can’t actually relocate existing checkpoints, you need to start over.

In order to change where checkpoints are stored, you need to stop and delete any running replications (which automatically store root replication checkpoints) and delete all current checkpoints for the specific file system you’re working on. Depending on your checkpoint schedule and how often it runs you may want to pause it temporarily. Having to stop and delete remote replications was painful for me as I compete for bandwidth with our data domain backups to our DR site. Because of that I’ve been working on these one at a time.

Once you’ve deleted the relevant checkpoints and replications, you can choose where to store the new checkpoints by creating a single checkpoint first before making a schedule. From the GUI, go to Data Protection | Snapshots | Checkpoints tab, and click create. If a checkpoint does not exist for the file system, it will give you a choice of which pool you’d like to store it in. Below is what you’ll see in the popup window.

——————————————————————————–
Choose Data Mover [Drop Down List ▼]  
Production File System [Drop Down List ▼]  
Writeable Checkpoint [Checkbox]:
Data Movers: server_2
Checkpoint Name: [Fill in the Blank]

Configure Checkpoint Storage:
There are no checkpoints currently on this file system. Please specify how to allocate
storage for this and future checkpoints of this file system.

Create from:
*Storage Pool [Option Box]
*Meta Volume [Option Box]

——————————————————————————–
Current Storage System: CLARiiON CX4-960 APM00900900999
Storage Pool: [Drop Down List ▼]    <—*Select the appropriate Storage Pool Here*
Storage Capacity (MB): [Fill in the Blank]

Auto Extend Configuration:
High Water Mark: [Drop Down List for Percentage ▼]  

Maximum Storage Capacity (MB): [Fill in the Blank]

——————————————————————————–

Powerpath Install / Upgrade Issues

I recently had several issues when attempting to upgrade a Windows 2008 server from Powerpath v5.3 to v5.5 SP1. I uninstalled 5.3 using the windows utility, rebooted, then reinstalled v5.5 SP1. After the reboot, the server did not come back up. In order to get it to boot, the “last known good configuration” option had to be chosen. I opened an SR with EMC, and they determined that the uninstall process was not completing correctly.

To resolve the problem, you need to run the executable from the command line and add a few parameters. The name of the Powerpath install file will vary depending on the version you are installing, but the command looks like this:

EMCPower.Net32.signed.5.3.b310.exe /v”/L*v C:\logs\PPremove.log NO_REBOOT=1 PPREMOVE=ALL”

In this example, the c:\logs directory must exist before you run it. After running that command to uninstall powerpath and then reinstalling the new version, I no longer had the problem of the server not booting correctly.

After properly installing it, I continued to have a problem with Powerpath administrator not properly recognizing the devices. All of the devices showed up as “DEV ??”. I also saw “harddisk ??” when running powermt display dev=all. To resolve the problem I ran through the following steps:

1. Open Device Manager under Disks, and right-click the device drive that had a yellow ‘!’.
2. Choose “Update Driver Software”.
3. Click on “Browse my comptuer for driver software”.
4. Click on “Let me pick from a list of device drivers on my computer”.
5. In the next screen make sure the “Show compatible hardware” box is checked.
6. Under the Model list you should see the ‘PowerPath Devices’ driver. Highlight it and click next. This will install the PowerPath Driver. When it is done it will require a reboot. Once the server has come back online run another: ‘powermt display dev=all’ to see that the harddisk?? will have changed to harddisk## as expected.

Close requests fail on data mover when using Riverbed Steelhead appliance

We recently had a problem with one of our corporate applications having file close requests fail, resulting in 200,000+ open files on our production data mover.  This was causing numerous issues within the application.  We determined that the problem was a result of our Riverbed Steelhead appliance requiring a certain level of DART code in order to properly close the files.  The Steelhead applicance would fail when attempting to optimize SMBV2 connections.

Because a DART code upgrade is required to resolve the problem, the only temporary fix is to reboot the data mover.  I wrote a quick script on the Celerra to grab the number of open files, write it to a text file, and publish to our internal web server.  The command to check how many open files are on the data mover is below.

This command provides all the detailed information:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Timestamp   Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
11:15:36     3379      905     9584        11        9      272        30         1856     4915   

server_2    CIFS     CIFS     CIFS    CIFS Avg   CIFS     CIFS    CIFS Avg     CIFS       CIFS
Summary     Total    Read     Read      Read     Write    Write    Write       Share      Open
            Ops/s    Ops/s    KiB/s   Size KiB   Ops/s    KiB/s   Size KiB  Connections   Files
Minimum      3379      905     9584        11        9      272        30         1856     4915   
Average      3379      905     9584        11        9      272        30         1856     4915   
Maximum      3379      905     9584        11        9      272        30         1856     4915   

Adding a grep for Maximum and using awk to grab only the last column, this command will output only the number of open files, rather than the large output above:

/nas/bin/server_stats server_2 -monitor cifs-std -interval 1 -count 1 | grep Maximum | awk ‘{print $10}’

The output of that command would simply be ‘4915’ based on the sample full output I used above.

The solution number from Riverbed’s knowledgebase is S16257.  Your DART code needs to be at least 6.0.60.2 or 7.0.52.  You will also see in your steelhead logs a message similar to the one below indicating that the close request has failed for a particular file:

Sep 1 18:19:52 steelhead port[9444]: [smb2cfe.WARN] 58556726 {10.0.0.72:59207 10.0.0.72:445} Close failed for fid: 888819cd-z496-7fa2-2735-0000ffffffff with ntstatus: NT_STATUS_INVALID_PARAMETER

Collecting info on active shares, clients, protocols, and authentication on the Celerra

I had a comment in one of my Celerra posts asking for more specific info on listing and counting shares, clients, protocol and authentication information, as well as virus scan information.  I knew the answers to most of those questions however  I’d need to pull out the Celerra documentation from EMC for virus scan info.  I thought it might be more useful to put this information into a new post rather than simply replying to a comment on an old post.

How to list and count shares/exports (by protocol):

You can use the server_export command to list and count how many shares you have.

server_export server_2 -Protocol cifs -list -all:  This command will give you a list of all CIFS Shares across all of your CIFS servers.  It will count a file system twice if it’s shared on more than one CIFS server.

server_export server_2 -Protocol cifs -list -all | grep <CIFS_Server_Name>:  This will give you a list of all CIFS shares on a specific CIFS server

server_export server_2 -Protocol cifs -list -all | grep <CIFS_Server_Name> | wc:  This will give you the number of CIFS shares on a specific CIFS server.  The “wc” command will output three numbers, the first number listed is the number of shares.

server_export server_2 -Protocol nfs -list -all:  This command will give you a list of all NFS exports.  It’s just like the previous commands, you can add “| wc” at the end to get a count.

How to list client connections by OS type:

To obtain information for the number of client connections you have by OS type, you’d have to use the server_cifs audit command.

To get a full list of every active connection by client type, use this command:

server_cifs server_2 -option audit,full | grep “Client(“

The output would look like this:

|||| AUDIT Ctx=0x022a6bdc08, ref=2, W2K Client(10.0.5.161) Port=49863/445
|||| AUDIT Ctx=0x01f18efc08, ref=2, XP Client(10.0.5.42) Port=3890/445
|||| AUDIT Ctx=0x02193a2408, ref=2, W2K Client(10.0.5.61) Port=59027/445
|||| AUDIT Ctx=0x01b89c2808, ref=2, Fedora6 Client(10.0.5.194) Port=17130/445
|||| AUDIT Ctx=0x0203ae3008, ref=2, Fedora6 Client(10.0.5.52) Port=55731/445

In this case, if I wanted to count only the number of Fedora6 clients, I’d use the command server_cifs server_2 -option audit,full | grep “Fedora6 Client”.  I could then add “| wc” at the end to get a count.

To do a full audit report:

The command server_cifs server_2 -option audit,full will do a full, detailed audit report and should capture just about anything else you’d need.  Every connection will have a detailed audit included in the report. Based on that output, it would be easy to run the command with a grep statement to pull only the information out that you need to create custom reports.

Below is a subset of what the output looks like from that command:

|||| AUDIT Ctx=0×0177206407, ref=2, W2K8 Client(10.0.0.1) Port=65340/445
||| CIFSSERVER1[DOMAIN] on if=<interface_name>
||| CurrentDC 0x0169fee808=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=c2de9f99-1945-11e2-a512-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0×1 NTcred(0x0125fc9408 RC=2 KERBEROS Capa=0×2) ‘DOMAIN\Username’
|| Cnxp(0x0230414c08), Name=FileSystem1, cUid=0×1 Tid=0×1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem1
| NTFSExtInfo: shareFS:fsid=49, rc=830, listFS:nb=1 [fsid=49,rc=830]
 
|||| AUDIT Ctx=0x0210f89007, ref=2, W2K8 Client(10.0.0.1) Port=51607/445
||| CIFSSERVER1[DOMAIN] on <interface_name>
||| CurrentDC 0x01f79aa008=<Domain_Controller>
||| Proto=SMB2.10, Arch=Win2K8, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| Client GUID=5b410977-bace-11e1-953b-005056af0
||| SMB2 credits: Granted=31, Max=500
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0×1 NTcred(0x01c2367408 RC=2 KERBEROS Capa=0×2) ‘DOMAIN\Username’
|| Cnxp(0x0195d2a408), Name=Filesystem2, cUid=0×1 Tid=0×1, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=0
| Absolute path of the share=\Filesystem2
| NTFSExtInfo: shareFS:fsid=4214, rc=19, listFS:nb=0
 
|||| AUDIT Ctx=0x006aae8408, ref=2, XP Client(10.0.0.99) Port=1258/445
||| CIFSSERVER1[DOMAIN] on if=<interface_name>
||| CurrentDC 0x01f79aa008=<Domain_Controller>
||| Proto=NT1, Arch=Win2K, RemBufsz=0xffff, LocBufsz=0xffff, popupMsg=1
||| 0 FNN in FNNlist NbUsr=1 NbCnx=1
||| Uid=0x3f NTcred(0x01ccebc008 RC=2 KERBEROS Capa=0×2) ‘DOMAIN\Username’
|| Cnxp(0x019edd7408), Name=Filesystem8, cUid=0x3f Tid=0x3f, Ref=1, Aborted=0
| readOnly=0, umask=22, opened files/dirs=3
| Absolute path of the share=\Filesystem8
| NTFSExtInfo: shareFS:fsid=35, rc=43, listFS:nb=0
| Fid=2901, FNN=0x0012d46b40(FREE,0×0000000000,0), FOF=0×0000000000  DIR=\Directory1
|    Notify commands received:
|    Event=0×17, wt=1, curSize=0×0, maxSize=0×20, buffer=0×0000000000
|    Tid=0x3f, Pid=0×2310, Mid=0x3bca, Uid=0x3f, size=0×20
| Fid=3335, FNN=0x00193baaf0(FREE,0×0000000000,0), FOF=0×0000000000  DIR=\Directory1\Subdirectory1
|    Notify commands received:
|    Event=0×17, wt=0, curSize=0×0, maxSize=0×20, buffer=0×0000000000
|    Tid=0x3f, Pid=0×2310, Mid=0xe200, Uid=0x3f, size=0×20
| Fid=3683, FNN=0x00290471c0(FREE,0×0000000000,0), FOF=0×0000000000  DIR=\Directory1\Subdirectory1\Subdirectory2
|    Notify commands received:
|    Event=0×17, wt=0, curSize=0×0, maxSize=0×20, buffer=0×0000000000
|    Tid=0x3f, Pid=0×2310, Mid=0×3987, Uid=0x3f, size=0×20

 

 

Frequent 0×622 and 0×606 errors in the SP Event Logs

During some routine checking of the SP Event logs on our NS-40 I was noticing a large number of alerts. Every few seconds I was seeing these three alerts pop in:

0x60a Internal Information Only. A logical unit has been enabled
0×622 Background Verify Aborted
0×606 Unit Shutdown for trespass
 

After a bit of investigation, I narrowed down the cause to several large LUNs that had just been added to a new ESX host.  It turns out that the LUNs were still running the background zeroing process, and that’s what was causing all the alerts in the SP Log. When you create a new LUN and the disks have been previously used for other LUNs, the new LUN needs to be “zeroed” (filled with all zeros to clear data). This takes place in the background and it is part of the LUN initialization.  Once this background zeroing process completed on my new LUNs the alert messages stopped.  I was unaware of that process, so I did a bit of research on it.

LUNs are immediately available for use after a bind (using “Fastbind”), however all the operations associated with a bind can take a long time to finish.  The duration of a LUN bind is dependent on these things:

  • LUN’s bind time background verify priority (rate)
  • Size of the LUN being bound
  • Type of drives in the LUN’s RAID Group
  • Potential disabling of initial verify on bind
  • State of the Storage System (Idle or Load)
  • Position of the LUN on the hard disks of the RAID Group

From that list, priority, LUN size, drive type and verification selection all have the greatest effect on duration.  You can calculate the approximate duration of the bind process with this formula:

Time = Bound LUN Capacity * Bind Rate

Here are the Average Bind Rates for FC and SATA disks:

Disk Type ASAP Bind Rate High Bind Rate Medium (default) Bind Rate Low Bind Rate
FC 83 MB/s 7.54 MB/s 5.02 MB/s 4.02 MB/s
SATA 61.7 MB/s 7.47 MB/s 5.09 MB/s 3.78 MB/s

If we were to calculate how many hours it would take to bind a 2000GB LUN on a five disk RAID5 group composed of SATA drives set to a medium (default) bind rate, here’s what the formula would look like:

Time = 2000 GB * ((1/5.09 MB/s) * 1024 MB->GB * (1/3600 sec->hrs) = 111.76 Hours.

There is a detailed white paper that covers this topic from EMC called “The Effect of Priorities on LUN Management Operations” that you can view here:  http://www.emc.com/collateral/hardware/white-papers/h4153-influence-priorities-emc-clariion-lun-wp.pdf.  That’s where I gathered the information above.

Adding a Celerra to a Clariion storage domain from the CLI

If you’re having trouble joining your Celerra to the storage domain from Unisphere, there is an EMC service Workaround for Joining it from the Navisphere CLI. When attempting it from Unisphere, it would appear to work and allow me to join but would never actually show up on the domain list.  Below is a workaround for the problem that worked for me. Run these commands from the Control Station.

Run this first:

curl -kv “hps://<Celerra_CS_IP>/cgi-bin/set_incomingmaster?master=<Clariion_SPA_DomainMaster_IP>,”

Next, run the following navicli command to the domain master in order to add the Celerra Control Station to the storage domain:

/nas/sbin/naviseccli -h 10.3.215.73 -user <userid> -password <password> -scope 0 domain -add 10.32.12.10

After a successful Join the /nas/http/domain folder should be populated with the domain_list, domain_master, and domain_users files.

Run this command to verify:

ls -l /nas/http/domain

You should see this output:

-rw-r–r– 1 apache apache 552 Aug  8  2011 domain_list
-rw-r–r– 1 apache apache  78 Feb 15  2011 domain_master
-rw-r–r– 1 apache apache 249 Oct  5  2011 domain_users
 

You can also check the domain list to make sure that an entry has been made.

Run this command to verify:

/nas/sbin/naviseccli -h <Clariion_SPA_DomainMaster_IP> domain -list

You should then see a list of all domain members.  The output will look like this:

Node:                     <DNS Name of Celerra>
IP Address:           <Celerra_CS_IP>
Name:                    <DNS Name of Celerra>
Port:                        80
Secure Port:          443
IP Address:           <Celerra_CS_IP>
Name:                    <DNS Name of Celerra>
Port:                        80
Secure Port:          443

Automating Storage Processor % Utilization alerts with EMC Performance Manager

I was tasked with coming up with a way to get email alerts whenever our SP utilization breaks a certain threshold.  Since none of the monitoring tools that we own will do that right now, I had to come up with a way using custom scripts.  This is my 2nd post on the same subject, I removed my post from yesterday as it didn’t work as I intended.  This time I used EMC’s Performance Manager rather than pulling data from the SP with the Navisphere CLI.

First, I’m running all of my bash scripts on a windows sever using cygwin.  These should run fine on any linux box as well, however.  Because I don’t have a native sendmail configuration set up on the windows server, I’m using the control station on the Celerra to actually do the comparison of the utilization numbers in the text files and then email out an alert.  The Celerra control station automatically pulls the file via FTP from the windows server every 30 minutes and sends out an email alert if the numbers cross the threshold.  A description of each script and the schedule is below.

Windows Server:

Export.cmd:

This first windows batch script runs an export (with pmcli) from EMC Performance Manager that does a dump of all the performance stats for the current day.

For /f “tokens=2-4 delims=/ ” %%a in (‘date /t’) do (set date=%%c%%a%%b)

C:\ECC\Client.610\PerformanceManager\pmcli.exe -export -out c:\cygwin\home\scripts\sputil999_interval.csv -type interval -class clariion -date %date% -id APM00400500999

Data.cmd:

This cygwin/bash script manipulates the file export from above and ultimately creates two single text files (one for SPA and one for SPB) with a single numerical value of the most recent SP Utilization.  There are a few extra steps at the beginning of the script that are irrelevant to the SP utilization, they’re there for other purposes.

#This will pull only the timestamp line from the top
grep -m 1 “/” /home/scripts/sputil/0999_interval.csv > /home/scripts/sputil/timestamp.csv
# This will pull out only the “disk utilization” line.
grep -i “^% Utilization” /home/scripts/sputil/0999_interval.csv >> /home/scripts/sputil/stats.csv
# This will pull out the disk/LUN title info for the first column
grep -i “Data Collected for DiskStats -” /home/scripts/sputil/0999_interval.csv > /home/scripts/sputil/diskstats.csv
grep -i “Data Collected for LUNStats -” /home/scripts/sputil/0999_interval.csv > /home/scripts/sputil/lunstats.csv
# This will create a column with the disk/LUN number
cat /home/scripts/sputil/diskstats.csv /home/scripts/sputil/lunstats.csv > /home/scripts/sputil/data.csv
# This combines the disk/LUN column with the data column
paste /home/scripts/sputil/data.csv /home/scripts/sputil/stats.csv > /home/scripts/sputil/combined.csv
cp /home/scripts/sputil/combined.csv /home/scripts/sputil/utilstats.csv
 
#  This removes all the temporary files
rm /home/scripts/sputil/timestamp.csv
rm /home/scripts/sputil/stats.csv
rm /home/scripts/sputil/diskstats.csv
rm /home/scripts/sputil/lunstats.csv
rm /home/scripts/sputil/data.csv
rm /home/scripts/sputil/combined.csv
 
# This next line strips the file of all but the last two rows, which are SP Utilization.
# The 1 looks at the first character in the row, the D specifies “starts with D”, then deletes rows meeting those conditions.
awk -v FS=”" -v OFS=”" ‘$1 != “D”‘ < /home/scripts/sputil/utilstats.csv > /home/scripts/sputil/sputil.csv
 
#This pulls the values from the last column, which would be the most recent.
awk -F, ‘{print $(NF-1)}’ < /home/scripts/sputil/sputil.csv > /home/scripts/sputil/sp_util.csv
 
#pull 1st line (SPA) into separate file
sed -n 1,1p < /home/scripts/sputil/sp_util.csv > /home/scripts/sputil/spAutil.txt
#pull 2nd line (SPB) into separate file
sed -n 2,2p < /home/scripts/sputil/sp_util.csv > /home/scripts/sputil/spButil.txt
#The spAutil.txt/spButil.txt files now contain only a single numerical value, which would be the most recent %utilization from the Control Center/Performance Manager dump file.
 
#Copy files to web server root directory
cp /home/scripts/sputil/*.txt /cygdrive/c/inetpub/wwwroot
 

Celerra Control Station:

CelerraArray:/home/nasadmin/sputil/ftpsp.sh

The script below connects to the windows server and grabs the current SP utilization text files via FTP every 30 minutes (via a cron job).

#!/bin/bash
cd /home/nasadmin/sputil
ftp windows_server.domain.net <<SCRIPT
get spAutil.txt
get spButil.txt
quit
SCRIPT
 

CelerraArray:/home/nasadmin/sputil/spcheck.sh:

This script does the comparison check to see if the SP utilization is over our threshold. If it is, it sends an email alert that includes the %Utilization number in the subject line of the email. To change the threshold setting, you’d need to change the THRESHOLD=<XX> line in the script.  The line containing printf “%2.0f” converts the floating point value to an integer, as bash scripts don’t recognize floating point values.

#!/bin/bash

SPB=`cat /home/nasadmin/sputil/spButil.txt`
SPBcheck= printf “%2.0f” $SPB > /home/nasadmin/sputil/spButil2.txt
SPB=`cat /home/nasadmin/sputil/spButil2.txt`
echo $SPB
THRESHOLD=50
if [ $SPB -eq 0 ] && [ $THRESHOLD -eq 0 ]
then
        echo “Both are zero”
elif [ $SPB -eq $THRESHOLD ]
then        
        echo “Both Values are equal”
elif [ $SPB -gt $THRESHOLD ]
then         
        echo “SPB is greater than the threshold.  Sending alert” 
        uuencode spButil.txt | mail -s “<array_name> SPB Utilization Alert: $SPB % above threshold of $THRESHOLD %” notify@domain.com
else        
echo “$SPB is lesser than $THRESHOLD”
fi

CelerraArray Crontab schedule:

The FTP script is currently set to pull SP utilization files.  Run “crontab –e” to edit the scheduler.  I’ve got the alert script set to run at the top of the hour and half past the hour, and the updated SP files from the web server are FTP’d in a few minutes prior.

[nasadmin@CelerraArray sputil]$ crontab –l
58,28 * * * * /home/nasadmin/sputil/ftpsp.sh
0,30 * * * * /home/nasadmin/sputil/spcheck.sh
 

Overall Scheduling:

Windows Server:

Performance Manager Dump runs 15 minutes past the hour (exports data)
Data script runs at 20 minutes past the hour (processes data to get SP Utilization)

Celerra Server:

FTP script pulls new SP utilization text files at 28 minutes past the hour
Alert script runs at 30 minutes past the hour

The cycle then repeats at minute 45, minute 50, minute 58, and minute 0.

 

12/14/12 Update:

There is an alternate method to gather data for creating alerts if you don’t have EMC’s Control Center. I don’t have scripts written that use this command, however. The Navisphere CLI command to get busy/idle ticks for the Storage processors is naviseccli -h <SP_IPaddress> getcontrol -cbt.

The output looks like this:

Controller busy ticks: 1639432
Controller idle ticks: 1773844

The SP utilization statistics outputted are an average of the utilization across all the cores of the SP’s processors since the last reset. To get the actual point-in-time SP CPU utilization from this output requires a calculation. You need to poll twice, create a delta for the individual counters by subtracting the earlier value from the later, and apply this formula:

Utilization = Busy Ticks / (Busy Ticks + Idle Ticks)

SAN Performance Metrics

I often get requests from application owners to review performance stats.  I thought I’d give a quick overview of some of the things I look at, what the myriad of performance metrics in Navisphere Analyzer and ECC Performance Manager mean, and how you might use some of them to investigate a performance problem.  Performance analysis is very much an art (not a science) and it’s sometimes difficult to pinpoint exact causes based on the mix of applications and workload on the array. Taking all of the metrics into account with a holistic view is needed to be successful.  Performing data collection of application workloads over time is recommended because application workload characteristics will likely vary over time. If you have a major problem, I would always recommend opening an SR with EMC.

This post is just an overview of SAN performance metrics and isn’t meant to dive in to every possible scenario from every angle. EMC already has excellent guides for performance best practices that you can read here:

Because we have EMC’s Performance Manager tool installed in our environment, I always go to that tool first rather than Navisphere Analyzer.  Both use the same metrics, so the following information will be useful regardless of which method you use.

The first thing I do is look at the Storage Processors.  This will give you a good indication of the overall health of the array before you dive into the specific LUN (or LUNs) used by the application.

  • SP Cache Dirty Pages (%). These are pages in write cache that have received new data from hosts but have not yet been flushed to disk.  You should have a high percentage of dirty pages as it increases the chance of a read coming from cache or additional writes to the same block of data being absorbed by the cache. If an IO is served from cache the performance is better than if the data had to be retrieved from disk.  That’s why the default watermarks are usually around 60/80% or 70/90%.  You don’t want dirty pages to reach 100%, they should fluctuate between the high and low watermarks (which means the Cache is healthy).  Periodic spikes or drops outside the watermarks are ok, but consistently hitting 100% indicates that the write cache is overstressed.
  • SP Utilization (%). Check and see if either SP is running higher than about 75%.  If either is running that high application response time will be increased.  Also, both will need to be under 50% for non-disruptive upgrades. We had to do a large scale migration of data from one SAN to another at one point in order to get a NDU accomplished.  You’ll also want to check for proper balance.  If one is much higher than the other, you should consider migrating LUNs from one SP owner to another.  I check SP balance on all of our arrays on a daily basis.
  • SP Response time (ms). Make sure again that both SPs are even and that response time is acceptable. I like to see response times under 10ms.  If you see that one SP has high utilization and response time but the other SP doesn’t, look for LUNs owned by the busier SP that are using more array resources. Looking at total IO on a per LUN basis can help confirm If both SPs have relatively similar throughput but one SP has much higher bandwidth.  That could mean that there is some large block IO occurring.
  • SP Port Queue Full Count. This represents the number of times that a front end port issued a QFULL response back to the hosts. If you are seeing QFULL’s it could mean that the Queue Depth on the HBA is too large for the LUNs being accessed.  A Clariion/VNX front end port has a queue depth of 1600 which is the maximum number of simultaneous IO’s that port can process.  Each LUN on the array has a maximum queue depth that is calculated using a formula based on the number of data disks in the RAID Group. For example, a port with 512 queues and a typical LUN queue depth of 32 can support up to: 512 / 32 = 16 LUNs on 1 Initiator (HBA) or 16 Initiators (HBAs) with 1 LUN each or any combination not to exceed this number. Configurations that exceed this number are in danger of returning QFULL conditions. A QFULL condition signals that the target/storage port is unable to process more IO requests and thus the initiator will need to throttle IO to the storage port. As a result of this, application response times will increase and IO activity will decrease.

The next thing I do is look at the specific LUNs that the application owner is asking about. The list below includes the basic performance metrics that I most often look at when investigating a performance problem.

  • Utilization (%) represents the fraction of an observation period during which a LUN has any outstanding requests. When the LUN becomes the bottleneck, the utilization will be at or close to 100%. However, since I/Os can get serviced by multiple disks an increase in workload might still result in a higher throughput.  Utilization by itself is not a very good indicator of the overall performance of the LUN, it needs to be factored in with several other things. For example, If you are writing to a LUN (100% Writes) and the location of the data is in a small physical space on the LUN, it may be possible to get to 100% with write cache re-hits. This means that all writes are being serviced by the write cache and since you are writing data to the same locations over and over you do not flush any of the data to the disks. This can cause your LUN Utilization to be 100% but there will actually be no IO to the disks. Utilization is very affected by caching, both read and write. The LUN can be very busy but may not have a problem. Use Utilization to assist in identifing busy LUNs then look at queuing and response times to see if there really is an issue.
  • Queue Length is the average number of requests within a polling interval that are outstanding to this LUN. A queue length of zero indicates an idle LUN. If three requests arrive at an idle LUN at the same time, only one of them can be served immediately; the other two must wait in the queue. That scenario would result in a queue length of 3.  My general guideline for “bad performance” on a LUN is a queue length greater than 2 for a single disk drive.
  • Average Busy Queue Length is the average number of outstanding requests when the LUN was busy. This does not include any idle time. This value should not exceed 2 times the number of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. Since this queue length is counted only when the LUN is not idle, the value indicates the frequency variation (burst frequency) of incoming requests. The higher the value, the bigger the burst and the longer the average response time at this component. In contrast to this metric, the average queue length does also include idle periods when no requests are pending. If you have 50% of the time just one outstanding request, and the other 50% the LUN is idle, the average busy queue length will be 1. The average queue length however, will be ½.
  • Response Time (ms) is the average time, in milliseconds, that a request to this LUN is outstanding, including its waiting time. The higher the queue length for a LUN, the more requests are waiting in its queue, thus increasing the average response time of a single request. For a given workload, queue length and response time are directly proportional.  Keep in mind that cache re-hits bring down the average response time (and service times), whether they are reads or writes. LUN Response time is a good starting point for troubleshooting. It gives a good indicator of what the host system is experiencing. Usually if your LUN response time (Response time = queue length * service time) is good then the host performance is good. High response times don’t always mean that the CLARiiON is busy, it can also indicate that you’re having issues with your host or Fabric.  We use the Brocade Health report on a regular basis to identify hosts that have an excessive amount of traffic, as well as running the EMC HEAT report on hosts that have reported issues (which can identify incorrect HBA Drivers, Bad HBA, etc).These are my general guidelines for response time:
    Less than 10 ms: very good
    Between 10 – 20 ms: okay
    Between 20 – 50 ms: slow, needs attention
    Greater than 50 ms:  I/O bottleneck
  • Service Time (ms) represents the Time, in milliseconds, a request spent being serviced by a component. It does not include time waiting in a queue. Service time is mainly a characteristic of the system component. However, larger I/Os take longer and therefore usually result in lower throughput (IO/s) but better bandwidth (Mbytes/s). In general, Service time is simply the time it takes to actually send the I/O request to the storage and get an answer back. In general, I like to see service times below 20ms.
  • Total Throughput (IO/sec) is the average number of host requests that is passed through the LUN per second. This includes both read and write requests. Smaller requests usually result in a higher total throughput than larger requests.  Examining total throughput (along with %Utilization) is a good way to identify the busiest LUNs on the array. In general, here are the IOPs limits by drive type:
RPM        Drive Type      IOPs
7,200      SATA,NL-SAS     ~80
10,000     SATA,NL-SAS     ~130
10,000     FC,SAS          ~140
15,000     FC,SAS          ~180
N/A        EFD             ~1500 (Read/Write, 60/40)
N/A        EFD             ~6000 (Read)
N/A        EFD             ~3000 (Write)
  • Write Throughput (I/O/sec) The average number of host write requests that is passed through the LUN per second. Smaller requests usually result in a higher write throughput than larger requests.  When troubleshooting specific LUNs, check the write IO size and see if the size is what you would expect for the application you are investigating. Extremely large IO sizes coupled with high IOPS may cause write cache contention.
  • Read Throughput (I/O/sec) The average number of host read requests that is passed through the LUN per second. Smaller requests usually result in a higher read throughput than larger requests.
  • Total Bandwidth (MB/s) The average amount of host data in Mbytes that is passed through the LUN per second. This includes both read and write requests. Larger requests usually result in a higher total bandwidth than smaller requests.
  • Read Bandwidth (MB/s) The average amount of host read data in Mbytes that is passed through the LUN per second. Larger requests usually result in a higher bandwidth than smaller requests.
  • Write Bandwidth (MB/s) The average amount of host write data in Mbytes that is passed through the LUN per second. Larger requests usually result in a higher bandwidth than smaller requests. Keep in mind that writes consume many more array resources than reads.
  • Read Size (KB) The average read request size in Kbytes seen by the LUN. This number indicates whether the overall read workload is oriented more toward throughput (I/Os per second) or bandwidth (Mbytes/second). For a finer distinction of I/O sizes, use an IO Size Distribution chart for this LUN.
  • Write Size (KB) The average write request size in Kbytes seen by the LUN. This number indicates whether the overall write workload is oriented more toward throughput (I/Os per second) or bandwidth (Mbytes/second). For a finer distinction of I/O sizes, use an IO Size Distribution chart for the LUNs.

Below is an explanation of additional performance metrics that I don’t use as frequently, but I’m including them for completeness.

  • Forced Flushes/s Number of times per second the cache had to flush pages to disk to free up space for incoming write requests. Forced flushes are a measure of how often write requests will have to wait for disk I/O rather than be satisfied by an empty slot in the write cache. In most well performing systems this should be zero most of the time. 
  • Full Stripe Writes/s Average number of write requests per second that spanned a whole stripe (all disks in a LUN). This metric is applicable only to LUNs that are part of a RAID5 or RAID3 group.
  • Used Prefetches (%) The percentage of prefetched data in the read cache that was read during the last polling interval.
  • Disk Crossing (%) Percentage of host requests that require I/O to at least two disks compared to the total number of host requests. A single disk crossing can involve more than two disk drives.
  • Disk Crossings/s Number of times per second that a request requires access to at least two disk drives. A single disk crossing can involve more than two disks.
  • Read Cache Hits/s Average number of read requests per second that were satisfied by either read or write cache without requiring any disk access. A read cache hit occurs when recently accessed data is re-referenced while it is still in the cache.
  • Read Cache Misses/s Average number of read requests per second that did require one or more disk accesses.
  • Reads From Write Cache/s Average number of read requests per second that were satisfied by write cache only. Reads from write cache occur when recently written data is read again while it is still in the write cache. This is a subset of read cache hits which includes requests satisfied by either the write or the read cache.
  • Reads From Read Cache/s Average number of read requests per second that were satisfied by the read cache only. Reads from read cache occur when data that has been recently read or prefetched is re-read while it is still in the read cache. This is a subset of read cache hits which includes requests satisfied by either the write or the read cache.
  • Read Cache Hit Ratio The fraction of read requests served from both read and write caches vs. the total number of read requests. A higher ratio indicates better read performance.
  • Write Cache Hits/s Average number of write requests per second that were satisfied by the write cache without  requiring any disk access. Write requests that are not write cache hits are referred to as write cache misses.
  • Write Cache Misses/s Average number of write requests per second that did require one or multiple disk accesses. Write requests that cause forced flushes or that bypass the write cache due to their size are examples of write cache misses.
  • Write Cache Rehits/s Average number of write requests per second that were satisfied by the write cache since they had been referenced before and not yet flushed to the disks. Write cache rehits occur when recently accessed data is referenced again while it is still in the write cache. This is a subset of Write Cache Hits.
  • Write Cache Hit Ratio The ratio of write requests that the write cache satisfied without requiring any disk access vs. the total number of write requests to this LUN. A higher ratio indicates better write performance.
  • Write Cache Rehit Ratio The ratio of write requests that the write cache satisfied since they have been referenced before and not yet flushed to the disks vs. the total number of write requests to this LUN. This is a measure of how often the write cache succeeded in eliminating a write operation to disk. While improving the rehit ratio is useful it is more beneficial to reduce the number of forced flushes.
Follow

Get every new post delivered to your Inbox.