Welcome to Windows HPC Community Sign in | Join | Help
in Search

Chronicles of a Cluster Troubleshooter

Applications Performance Improved with Snoop Filter Off on Intel 5000x Chipset

Chronicles of a Cluster Troubleshooter Volume One #1

Date: October 5, 2007

Issue: Performance of application, Eclipse, on Windows Compute Cluster Server V1 is below that of the same application run on the same hardware on Linux.

Hardware Platform: HP Proliant DL140 cluster with Voltaire InfiniBand interconnect.

Software Platform: Windows Server 2003 Compute Cluster Edition Service Pack 2 with Compute Cluster Service Pack 1.

Application: Schlumberger Eclipse reservoir simulation package.

Initial Conditions: Cluster hardware set up, O/S and Cluster Pack installed. Application performance under Windows does not match that under Linux. Large test case fails when run on all nodes.

Investigation and Remediation:

  1. Install the uSane cluster sanity test suite on the head node.
  2. Run the uSane MpiHi test.
    1. Runs on small case.
    2. Fails immediately on testing the whole cluster
      1. Mpiexec error message say that it cannot get credentials from compute node 2.
      2. HP engineer notes that this is the same message that the large application generates. He had assumed this was a Windows CCS issue
  3. Pause node 2 then run MpiHi across the whole network. It runs fine.
  4. Replace node two on theory that the problem may be on the private network NIC or possibly the node software. Fresh image on new hardware is guaranteed to fix either issue in one service action. Post mortem shows the NIC was bad.
  5. Using all nodes in the cluster run the uSane tests: cNodeLate, MPI network latency test; cNodeBand, MPI network bandwidth test; and cFlood, MPI switch flood test with all links active at one time.
    1. Latencies near 10 microseconds.
    2. Bandwidth near 900 MB/s with “–env MPICH_SOCKET_SBUFFER_SIZE 0” on the mpiexec command.
    3. Conclude that the network is fine.
  6. HP engineer adds the “–env MPICH_SOCKET_SBUFFER_SIZE 0” to the application mpiexec line. Performance still subpar.
  7. Discussions with another HP engineer who ran the application on Linux, had revealed that she used the Linux utility ‘setpci’ to improve Linux performance. This utility was not used to set anything that was specific to the PCI bus or the InfiniBand HCAs, but to disable the Intel Greencreek 5000X chipset memory bus Snoop Filter feature.
  8. Used the BIOS to disable Snoop Filter.
  9. Reran the application. Performance was now on par with Linux.

Cores    Snoop On     Snoop Off   Improvement

               seconds        seconds

 

      2        5490             4520                  21%

      4        5337             3899                  37%

      8        2944             2241                  31%

    16        1721             1298                  32%

    32        2919             2698                    8%

  1. Checked with the application developer and discovered that the test case provided was not expected to scale past 16 cores.

 

For an interesting article on Snoop Filter performance impact see: http://www.dell.com/downloads/global/power/ps3q06-20060362-Radhakrishnan.pdf

Until next time, good shooting.

 

Frankie

 

Published Wednesday, October 24, 2007 4:46 PM by Frankie

Comments

No Comments
Anonymous comments are disabled

About Frankie

Globe trotting bug swating HPC gadfly, Frankie is now doing HPC stuff at Microsoft after a lifetime of doing HPC stuff elsewhere. Frankie is more interested in today and tomorrow than yesterday, but will speak dead languages with proper lubrication.

This Blog

Post Calendar

<October 2007>
SuMoTuWeThFrSa
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910

Syndication

©2006 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement
Powered by Community Server, by Telligent Systems