Saturday, March 8, 2014

Cisco VSS ports connected works after clearing arp

Now this problem was a little unique in the Cisco VSS setup. Last week some of the HP ILO connectivity, IPS connectivity, ASA connectivity and Cisco Wireless connectivity was getting lost intermittently. This connectivity lost was only for the management VLAN. Although we got one of the new sites connectivity last week but somehow i thought this is something not related to these issues.

All these devices were working fine but the management port reachablilty was an issues. we tracked the ports and found that the ports stop pinging sometimes within 5 minutes or continue to work for next few hours.

Based on my experience, when we checked that the ping response starts coming back the moment we do a clear arp int vlan the solution seemed to be evident.

I logged in and checked that the interface VLAN ARP timeout is 4 hours and mac-address aging timer is by default 5 minutes so i altered the mac-address aging timer to 4 hours or 14400 seconds.
After this tried clearing the arp, the issue persists. P{ing works for 30-40 Minutes interval and then again connectivity goes.

We verified the DFC, SUP and Modules software against known bugs.

CORESW01#sh mod switch 1
 Switch Number:     1   Role:   Virtual Switch Active
----------------------  -----------------------------
Mod Ports Card Type                              Model              Serial No.
--- ----- -------------------------------------- ------------------ -----------
  1    8  CEF720 8 port 10GE with DFC            WS-X6708-10GE      SAL13442G92
  3   48  CEF720 48 port 10/100/1000mb Ethernet  WS-X6748-GE-TX     SAL1425KZ9U
  5    5  Supervisor Engine 720 10GE (Active)    VS-S720-10G        SAL1426LNQC
  7    8  Intrusion Detection System             WS-SVC-IDSM-2      SAL1423K01P
  8    6  Firewall Module                        WS-SVC-FWM-1       SAL1419HLQ0

Mod MAC addresses                       Hw    Fw           Sw           Status
--- ---------------------------------- ------ ------------ ------------ -------
  1  0026.9925.bf58 to 0026.9925.bf5f   2.1   12.2(18r)S1  12.2(33)SXI4 Ok
  3  c84c.7570.0fa0 to c84c.7570.0fcf   3.4   12.2(18r)S1  12.2(33)SXI4 Ok
  5  0026.cb61.4b48 to 0026.cb61.4b4f   3.2   8.5(4)       12.2(33)SXI4 Ok
  7  5475.d062.6160 to 5475.d062.6167   6.5   7.2(1)       7.0(4)E4     Ok
  8  5475.d062.4bb8 to 5475.d062.4bbf   4.5   7.2(1)       4.0(12)      Ok

Mod  Sub-Module                  Model              Serial       Hw     Status
---- --------------------------- ------------------ ----------- ------- -------
  1  Distributed Forwarding Card WS-F6700-DFC3C     SAL13442GEF  1.4    Ok
  3  Distributed Forwarding Card WS-F6700-DFC3C     SAL1426L9Y9  1.4    Ok
  5  Policy Feature Card 3       VS-F6K-PFC3C       SAL1426LM7Y  1.1    Ok
  5  MSFC3 Daughterboard         VS-F6K-MSFC3       SAL1426LMXY  5.0    Ok
  7  IDS 2 accelerator board     WS-SVC-IDSUPG      71100440010  2.5    Ok

Mod  Online Diag Status
---- -------------------
  1  Pass
  3  Pass
  5  Pass
  7  Pass
  8  Pass

Then we thought of taking a 10 minutes Wireshark output and diverted the problematic interface to the interface in which our laptop was there.

After analyzing the dump, we found that there was nothing suspicious and the device or host was sending the reply of the ping as well.

We tried to see the mac of the device is available on what all interfaces.

sh mac- address d8d3.8561.0991 all de

========================================
PI_E RM  RMA Type Alw-Lrn Trap Modified Notify Capture Flood   Mac  Address  Age  Pvlan  SWbits Index  XTag
----+---+---+----+-------+----+--------+------+-------+------+--------------+----+------+------+------+----
switch 1 Module 1:
Yes  Yes Yes  DY    No     No     Yes     No     No     No    d8d3.8561.0991 0x9F 100    0      0x102E 0 
switch 1 Module 3:
Yes  Yes Yes  DY    No     No     Yes     No     No     No    d8d3.8561.0991 0x1F 100    0      0x102E 0 
Supervisor switch 1 Module 5
 No  Yes  No  DY    No     No     Yes     No     No     No    d8d3.8561.0991 0xFB 100    0      0x102E 0 
switch 2 Module 1:
Yes  Yes Yes  DY    No     No     Yes     No     No     No    d8d3.8561.0991 0xAC 100    0      0x102E 0 
switch 2 Module 3:
Yes  Yes Yes  DY    No     No     Yes     No     No     No    d8d3.8561.0991 0xFF 100    0      0x102E 0 
Supervisor switch 2 Module 5
 No  Yes  No  DY    No     No     Yes     No     No     No    d8d3.8561.0991 0xFA 100    0      0x102E 0 


sh mac- add d8d3.8561.0991

This output showed that Switch 2 Module 1 age is getting 0 when this issue is coming.

Legend: * - primary entry
        age - seconds since last seen
        n/a - not available

  vlan   mac address     type    learn     age              ports
------+----------------+--------+-----+----------+--------------------------
switch 1 Module 1:
*  100  d8d3.8561.0991   dynamic  Yes        250   Po200
switch 1 Module 3:
*  100  d8d3.8561.0991   dynamic  Yes        200   Po200
switch 2 Module 1:
*  100  d8d3.8561.0991   dynamic  Yes          0   Po200
switch 2 Module 3:
*  100  d8d3.8561.0991   dynamic  Yes        175   Po200

We gave this command to find out if mac is moving within the switch.

mac address-table notification mac-move


CORESW01#sh sw vir
Switch mode                  : Virtual Switch
Virtual switch domain number : 100
Local switch number          : 1
Local switch operational role: Virtual Switch Active
Peer switch number           : 2
Peer switch operational role : Virtual Switch Standby

After that we did a packet capture.

service internal
show platform capture elam trigger dbus ipv4 if L3_PT=ICMP IP_DA=10.1.1.100 ICMP_TYPE=0x8
sh plat cap elam start

In this way we found that the DMAC is 5475.d0e5.4500 which is not for the device we are inspecting.

We also did the no ip redirect in the VLAN facing the issue.

When we tracked this MAC we found that this MAC belonged to one of the interface of the ASA.
Then we logged in to the ASA and found the issue that the ASA was doing a proxy arp on this interface so we disabled the proxy arp on ASA Management Interface device.

sysopt noproxyarp management

Bingo. Problem Solved.





Tuesday, July 16, 2013

Windows 2008 R2 and CIFS 3.0 a marriage made in hell

Yesterday I spent too much time in identifying what  breaks the communication between SAMBA and Windows clients. In our environment, we have all data residing on the HPUX environment,  and windows servers do the processing. The communication between HPUX CIFS and Windows was working fine until Thursday night, (Nightmare patching night).

What seemed obvious is you would rollback the patches and things should work fine but all things are not as simple as it looks. I thought first let me look at the SAMBA logging to see if the requests are coming and the reason for rejection.

What surprised me is that the version running does not provide me any detailed debugging option so i thought of upgrading the server from 2.0 to 3.0, by installing the new depot.

This after installation atleast proved me that the issue seems to be with the Windows system since the requests were not coming to the SAMBA.

Anyhow a learning that SMB signing is a feature to be used only in a pure windows environment.

For all those facing these issues i did the following steps to isolate the issue.

-          First I checked the /etc/opt/samba/smb.conf file run the parser on the UAT

-          swlist and swinstall –s depot

-          I upgraded the samba to the 3.1.0 http://hp-ux-br.blogspot.com/2012/07/configuring-cifs-server-samba.html so that i can get detailed logging of the issue and find out the root cause

-          CIFS-CLIENT                   A.02.02.02     HP CIFS Client

-          CIFS-SERVER                   A.03.02.00     HP CIFS Server

-          I reconfigured the users and gave access to root also for Samba directories

-          The in the log I found that the request is not coming to the server.

-          Then we disabled the end point protection

-          Opened a case with HP where they mentioned they do not support individual software so I can open a forum case.

-          In the forum, HP said it a Microsoft issue for which Microsoft released a patch, they told me to refer to these links

-          http://www.networksteve.com/forum/topic.php/KB2536276_kills_SMB_access_to_old_Linux_Samba/?TopicId=22232&Posts=1

-          And download and install this patch

-          http://support.microsoft.com/kb/2560452/en-us

-          Even after patch install, the request did not come to the Unix server.

-          Then we checked that we were able to mount the access from other servers but not these UAT servers

-          In the mean time we were reported of the similar issue on production after restart.

-          Then I checked for some SMB blocking that we did as part of the Vulnerability management remediation.

-          We referred the following link http://www.joseftschiggerl.name/2012/09/server-message-block-smb-signing-enable

-          Since the server was restarted now that is the reason the setting became effective now.

-          The we changed the AD setting

-          Did the gpupdate /force couple of times and restarted the server

-          We were able to map the drives without any issue without Microsoft TAC help