Page MenuHomePhabricator

Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54
Closed, ResolvedPublic

Assigned To
None
Authored By
cmooney
Nov 21 2024, 4:32 PM
Referenced Files
F57745316: image.png
Nov 25 2024, 11:28 AM
F57745141: image.png
Nov 25 2024, 10:18 AM
F57731376: image.png
Nov 22 2024, 9:54 AM
F57729536: image.png
Nov 21 2024, 4:32 PM

Description

Hi @VRiley-WMF @Jclark-ctr,

We've had some link errors on a long-run fibre from rack D5 to E4 for cloud services. Stats from the QSFP+ modules look ok (temperature, light levels) but still I think we need to swap these out. Just been two spikes this week but it's caused considerable disturbance for traffic in the cloud racks:

image.png (703×1 px, 84 KB)

Today the actual link failed for a few minutes in the middle of this burst of errors:

Nov 21 15:33:36  cloudsw1-d5-eqiad fpc0 Local fault detected on port 65 (et-0/0/52)
Nov 21 15:36:01  cloudsw1-d5-eqiad l2cpd[17035]: LLDP_NEIGHBOR_UP: A neighbor has come up for interface et-0/0/52. Now, this interface has 1 neighbor/s .

So we need to get 2 x 40GBase-LR4 QSFP+ modules (like this) and install them at either end of the link.

Traffic has been shifted away from the link so this can happen any time, but ping us first so we can downtime the switches to avoid noise. Thanks!

Event Timeline

cmooney triaged this task as High priority.Nov 21 2024, 4:32 PM
cmooney created this task.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

This port bounced again overnight:

cmooney@cloudsw1-d5-eqiad> show log messages.1.gz | match "10.64.147.5|2620:0:861:fe0e::2|et-0/0/52" | match "ifOper" | except ".0$" 
Nov 22 00:22:55  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:22:56  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:22:58  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:22:59  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:23:04  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:23:06  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:23:06  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:23:07  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:23:11  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:23:13  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:23:14  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:23:16  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:23:20  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:23:20  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52
Nov 22 00:23:21  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_DOWN: ifIndex 682, ifAdminStatus up(1), ifOperStatus down(2), ifName et-0/0/52
Nov 22 00:23:22  cloudsw1-d5-eqiad mib2d[17046]: SNMP_TRAP_LINK_UP: ifIndex 682, ifAdminStatus up(1), ifOperStatus up(1), ifName et-0/0/52

Also still not insignificant errors on the link, a lot less in absolute terms but given there is no traffic on it still high:

image.png (451×1 px, 61 KB)

Icinga downtime and Alertmanager silence (ID=6b283bec-74b8-4f8c-9a46-f9f60c9c4026) set by cmooney@cumin1002 for 1:00:00 on 2 host(s) and their services with reason: replace optics on faulty WMCS link from D5 to E4

cloudsw1-d5-eqiad.mgmt,cloudsw1-e4-eqiad.mgmt

Replaced the transciever in cloudsw1-e4-eqiad et-0/0/54. Will test to see if that works. Trying to locate another one for cloudsw1-d5-eqiad et-0/0/52

Thanks @VRiley-WMF! Seems ok so far but we can make a call Monday based on if we see errors on the link or not (clean since the swap).

If you can check for others or get a count of spares that would be good though, we'll need to replace the one we just used and probably want a few in total on site so good to know how many are there if doing an order.

Link has been clean since the optic was replaced:

image.png (570×1 px, 64 KB)

I'll suggest to WMCS we put traffic back on the link.

Ok the BGP downpref policy has been reverted, and we have routed traffic back running over the link. So far so good few Gbps of traffic in either direction and still no errors either side.

image.png (503×1 px, 79 KB)

Has this still been performing as expected? If so, are we able to close it?

cmooney reopened this task as Open.EditedNov 26 2024, 4:10 PM

Has this still been performing as expected? If so, are we able to close it?

Yep all is good with the link thanks!

@VRiley-WMF in terms of closing we need to at least order one replacement 40G-Base-LR4 optic, however we probably want to order more than that if (as it seems) we are running low. Did you manage to get a total count of how many of that type we have?

I guess we can close this and spin up another task for the replacement order then.

Understood, I will close this this and ask for a replacement!

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy