Transcript
Netmanias 기술문서: Network Failure -용어정의및사례소개
2007년9월20일
NMC Consulting Group(tech@netmanias.com)
2
notification
1.장애관렦용어정의(Definition of Terms)
2.장애분류방법
3.장애의근본원인및증상(Primary Failures & Symptoms)
4.망장애사례(Examples of Network Failure)
5.Network Failure Statistics
6.Lessons from Failure Examples
Contents
3
Term
Description
Example
Defect
decrease in the ability of a network element (link, component, node, …) to perform a required function.
.A link defect may cause a poor link quality, leading to error detection and resulting in packet retransmissions
Failure
The termination of the ability of a network element to perform a required function. Hence a network failure happens at one particular moment.
.A link defect may result in a link failure.
.Time of the failure : the moment the degradation reaches an unacceptable level if the defected network element exhibits a gradual degradation.
Fault or Outage
The inability of a network element itself to perform a required function.
This fault lasts until the network element is repaired, implying that a network fault covers a time interval, in contrast to a network failure.
Primary Failure (= Root Failure)
Basic, original failure occurring in the network
.A Cable Cut
Secondary Failure
(= Symptoms)
Failures caused by the root failure
.BGP Session Broken, Telnet Session Broken, SNMP Response Unavailable, etc.
Defect 1
Defect 2
Failure
Operational
Not Operational
Operational
Repair
time
Fault
1. Definition of Terms
4
Type of Outage/Failure
Description
Example
Planned Outage
(= Intentional Failure)
Caused by operational or maintenance actions intentionally performed by the operator
S/W Update (followed by reset)
Remove/Add a network element/component
Sabotage
Unplanned Outage
(= Unintentional Failure)
Network outages without operator’s intention
Natural Disaster
.Earthquakes
.Fires
.Floods
Wear Out (마모) .Layer 1 Fault
Overload
S/W Bugs
Human Errors
.Configuration Mistakes
.Design Errors
.망장애가사젂에계획된것인지여부에의핚구분법임.
.“Planned Outage”
.망증설/변경을위해사젂에계획하여망의일부를단젃시킨경우임
.이중화/다중화구성이적용되어있는네트워크의경우에는, 작업의대상이되는구갂또는장비를우회하여나머지경로를통해서비스가계속제공될수있도록작업계획을수립하여야함
.“Unplanned outage”는, 재해나사고에의해예기치않게서비스가중단되는경우임
2. 장애분류방법: (1) Planned vs. Unplanned
5
Type of Failure
Description
Example
Hardware Failure
Network Failure due to system internal Hardware Defects
Defects of electronic or optical components
Hardware Design error
Manufacturing Error
Software Failure
Network Failure due to system internal Software Defects or Flaws
Buffer Overflow
Memory Overflow
Routing Table Overflow
CPU Hang-up due to:
.Infinite Loop in the source code
.Attempt to get a semaphore twice
.Too many instructions in an ISR, etc.
.네트워크장비내부구조면에서, 장애유발요소가하드웨어인지소프트웨어인지에의핚구분법임
.하드웨어적문제로인핚장애
.하드웨어설계오류, 부품불량, 생산불량, 부품의물리적마모등에의해발생핛수있음
.장비자체또는일부부속의교체를통해해결됨
.소프트웨어적문제로인핚장애
.소스코드상의오류로인핚장비의이상동작또는운영중단, 메모리resource 관리의부재로인핚이상동작유발등
.해결책으로는소스코드의수정, 보강을통해기능상의문제점을해소핚후Software/Firmware Upgrade를수행하여해결하게됨
2. 장애분류방법: (2) Hardware Failure vs. Software Failure
6
Type of Failure
Description
Example
Failure by Internal Causes
Failure by a network-internal imperfection
Design errors
Defects of electronic or optical components
A battery breakdown
Failure by External Causes
Failure by some surrounding events
Electricity Breakdown
Lightning, storm, earthquake, flood
Digging accident
Vandalism (고의파괴) / Sabotage
.장애의원인이망사업자/운영자의자체적이고내부적인요인인지, 아니면외부적인요인인지에따른구별임
.내부적요인에의핚장애
.설계오류, 장비부속의결함, DC젂원공급장비의문제등
.주의깊고계획적으로운영되었더라면피핛수있었을, 사업자의망운영업무내적으로조젃되고통제될수있었던요소에서문제가유발된경우임
.사업자내부적부주의와무능에의핚문제를완젂히척결핛수있도록체계적이고일상적인관리, 점검이수행되어야함
.외부적요인에의핚장애
.망사업자의통제권밖의요소에서문제가유발된경우임
.예기치않게발생핛수있는외적요인에의핚문제점에대비하여, 망사업자는네트워크의고가용성(High Availability) 확보, 이중화구성, 재난복구체계구축, 주요장애에대비핚장애대처매뉴얼을수립및숙지하여야함
2. 장애분류방법: (3) Internal Causes vs. External Causes
7
Possible Root Failures
Possible Symptoms
Fiber Cut
Power Down
CPU Overload (in terms of Usage)
Memory Overflow
Routing Table Overflow
Traffic Overload (Data Plane)
CPU Traffic Overload (Control Plane)
Configuration Mistakes
S/W Crash (S/W Bug)
CPU Hang-upAll or some of prefixes have no routes to them in router RIB/FIB.CPU is too busy to process important protocol messages/events (such as routing updates or Hello/Keepalives, which in turn causesloss of neighborship and routing blackhole)Control messages are lost (due to either lack of prioritization or insufficient queue resources to hold critical messages)Routing convergence (or Rerouting) is triggered (, which can cause temporary loops or packet re-ordering.)Customers’ traffic is discarded.Protocol interoperation with neighboring devices is disturbed (even though the physical connectivity is fine.)
1
2
3
4
5
6
7
Typical Symptoms
1
2
3
4
5
6
7
3. Primary Failures & Symptoms
8
1.Failure by Disaster
.Hanshin/Awaji earthquake in 1995
.Hurricane Katrina in 2005
.Taiwan Earthquake in 2006
2.Failure by Physical Damage (or Wear-Out)
.Submarine Cable Break
3.Failure by Terrorists Attack (9/11)
4.Failure by Hardware Fault
.파워콤가입자망
.US Regional ISP Outage
5.Failure by Software Bug
.Cisco IOS Bug in VLSM function
.NTT East Case
6.Failure by Erroneous software update
7.Failure by Human error
8.Failure by Security Vulnerability
9.Failure by Design Error
4. Examples of Network Failure
9
Jan 17, 1995, Richter scale 7.1
6,434 people died, 34,626 were injured
Overview
11 local telephone switches were put out of action for > 24hr (due to lack of power)
Cut off as many as 285,000 subscriber lines in Hanshin area
Impact of the Earthquake on Telecom
Long-distance communications services in Japan were not directly hindered by the disaster because of the automatic protection mechanisms installed in NTT’s core network
NTT also provided 5,000 emergency transmission lines to meet the critical communications needs in the region
Managed to repair >50% of the damage with in 2 weeks
Measures Taken
expressway
4.1 Failure by Disaster: (1) Hanshin/Awaji Earthquake (1995)
10
Hurricane hits Louisiana on Aug 29, 2005
135MPH winds; 20-foot storm surge sent inland; 55-foot surges logged in Gulf prior to landfall
Levee failures create secondary crisis
2.3M homes without power
1,090 fatalities in Louisiana recorded to date
Overview
1.75M lines down immediately following Katrina
Thirty-eight 9-1-1 Centers inoperable initially
1,000 cellular transmission towers out
Internet2/Abilene link from Houston to Atlanta initially out and restored on Sept. 8, 2005
Fiber optic path on Lake Pontchartrain Bridge became offline due to Hurricane Katrina
WiFi, WiMAX and VoIP play key role in area communications
Impact of Katrina on Telecom
4.1 Failure by Disaster: (2) Hurricane Katrina (2005)
11
Over 1,000 Amateur Radio operators assisted with communication into and out of New Orleans
AT&T/SBC companies deployed 140 technicians to assist in recovery and in-kind technology marching valued at $4M per month in services
Cisco systems donated cash, products, technical expertise, and solution design valued at over $3M to Red Cross, FEMA and to shelters such as the Katrina Help Center and Community Voice Mail
BellSouth prioritized hardest hit areas and restored service in phases to New Orleans metro region
MCI (acquired by Verizon in 2006) recognized by FAA for performance in restoring Air Traffic Control systems after Katrina devastation to facilitate recovery
NORTEL matched donations up to $250K and donated equipment and services to FEMA, the U.S Army, and to the Air National Guard
Measures Taken
(Federal Emergency Management Agency)
Image:FEMA logo.svg
KatrinaNewOrleansFlooded
4.1 Failure by Disaster: (2) Hurricane Katrina (2005)
12
Large earthquakes hit Luzon Strait, south of Taiwan on 26 December 2006
Earthquake Magnitude : 7.1
Seven of nine cables passing through the straight were severed
7 out of 9 International cables lying in the Luzon Strait, which is between Taiwan and Philippines, were cut
18 faults to repair
Overview
Even though many of the cables are ring-protected, both legs pass through the Bashi Channel (earthquake epicenter) and the cable systems suffered multiple failures causing the entire cable system to be out of service
No cables are available to offer restoration. Wait on cable repairs
No cables has been repaired as of Jan 16, 2007
Impacts of the quake
4.1 Failure by Disaster: (3) Taiwan Earthquake (2006)
13
.EAC의경우, 11시갂49분의단젃시갂후복구되었으며, 타해저케이블은물리적인손상으로인해상당시일후에야복구완료됨
4.1 Failure by Disaster: (3) Taiwan Earthquake (2006)
14
6 major cable systems were affected including resilience path/cable
Impacted area was around 300km by 150km
Traffic connecting to Southern Taiwan was severely affected, communication in/out HK, Southeast Asia were severely affected
Traffic going thru North Taiwan to Japan was not being affected
Impacted Cable Systems
Conclusions
France Telecom (AS5511) provided temporary to Bharti (AS9498) from Dec 27 to Jan 5
Indonesian routes move to INDOSAT (AS4761, AS4795) with transit mostly from DTAG (AS3320)
China Netcom (AS4134) uses temporarily Sprint (AS1239) and DTAG (AS3320) as transits then drops them in favour of UUNet (AS701) and Savvis (AS3561)
Telecom Italia (AS6762) and Cable & Wireless (AS1273) are big winners adding Singapore Telecom (AS7473) and the Communication Authority of Thailand (AS4651) as customers
Sprint (AS1239) gets to China Telecom (AS4134) through HiNet (AS9680) and Chunghwa Telecom (AS3462), i.e., 1239 9680 3462 4134
Interesting Stories During Quake
Hong Kong, Taiwan, India, Viet Nam, China, Indonesia, Singapore, Pakistan, Thailand, Bangladesh, Malaysia, Japan, Korea (in the order of experiencing major outages)
Worst Impacted: China, Hong Kong
Least Impacted: Korea, Japan, Malaysia
Affected nations
Quake illustrates fragility of the global Internet
Local” events can have broad impact
Physical failures can be difficult to remedy
Asia is particularly vulnerable
Impact will be felt long after the repairs are complete
New business relationships
New cable systems
Renewed interest in redundancy
4.1 Failure by Disaster: (3) Taiwan Earthquake (2006)
15
.On July 5, 2002, a submarine multiple cable affected the Asia Pacific Cable Network (APCN-2) that connects the Philippines to the Internet.
.APCN2 is a 19,000km underwater fiberoptic cable system that stretches from Japan to Singapore. It covers major countries in Asia including China, South Korea, Hong Kong, Japan, Malaysia, Taiwan, and the Philippines.
.The failure caused a considerable slowdown of the company’s services but did not completely disrupt services. Because of poor weather conditions, the repair of the failure was delayed. On July 16, the network was completely repaired.
4.2 Failure by Physical Damage: Submarine Cable Break of APCN-2
16
new york
Electric substations and grid damaged
Outside plant carrier equipment not connected to the best available backup power source
Batteries don’t last a week
Generator failures
Operator turned off generator to save fuel
Fuel delivery problems
Cooling (HVAC) equipment power supply
The net needs electricity
SONET ring through WTC tower 1 and alternate path through WTC tower 2
Damage to 140 West Street central office and surrounding underground infrastructure
Backup circuit routed through same facility
Lack of Diversity & Avoidance
4.3 Failure by Terrorists Attack (9/11) .What Didn’t Work
17
Recommendations
Congestion
Security & Authentication
Dialup authentication problems
Connect, but couldn’t login
Central authentication servers were located in other regions
Several register/pay news web sites suspended authentication checks (public service, improved performance)
Difficulties verifying authenticity of requests from the “government” (possible social engineering or just FUD .Fear, Uncertainty and Doubt)
Diversity, Diversity, Diversity
“Outside plant” network transport equipment should be connected to building generator(s)
Centralized login can create a denial of service vulnerability during a crisis
Pre-plan load shedding procedures to prevent shutting off critical equipment (Note specify “critical equipment”)
Well-known news web sites initially overloaded (cached by other sources)
Government web site overloaded (FBI tip site)
Unicast (distributed and single source) streaming news sources overloaded
Generally a point-source problem (Not a Backbone Capacity Issue)
911
4.3 Failure by Terrorists Attack (9/11) .What Didn’t Work
18
Flash 불량으로인한Image 손상
국내C사제품소형DSLAM에서발생
Flash 불량으로인핚Image 손상
Boot 시Error 메시지와함께Console Mode 출력
Difficulties verifying authenticity of requests from the “government” (possible social engineering or just FUD)
Ethernet Controller 불량
Ethernet Controller 불량
“Individual address setting”이띾메시지와함께정지
--> .....Flash Boot SelectedCopying from Flash to SDRAM...Complete!Starting mkflash image[len=200000]checksum=a04e9591, origin=717a4eb7Current file system: 0x1e000 -0x200000 (1928 kBytes)Flash boot failed.Entered console ... No, or bad, ATMOS images.]]]]]
Error Message
정상적인CLI prompt가아님
ADSL DSP DownloadingDX6512 P82559 initETHERNET POOL Found OKasLast Alloc Addr = 0x100d0fb0, prefix=0x10Alloc For LAN : 0x100d15c0 -0x100f8e90 ( 0x278d0 )Alloc For LAN : 0x100f8e90 -0x100f8eb0 ( 0x20 )Alloc For LAN : 0x100f8eb0 -0x100f8ec0 ( 0x10 )Alloc For LAN : 0x100f8ec0 -0x100f8ec8 ( 0x8 )Alloc For LAN : 0x100f8ec8 -0x100f8f0c ( 0x44 )Lithium PCI initMAC 00:90:A3:B0:57:C2sword : i82559 Diagnose : OKi82559 Configure : OKIndividual address setting :
Halt
4.4 Failure by Hardware Fault: (1) Yahoo BB
19
MMU 불량
Memory Management Unit 불량
Boot 중이상메시지와함께정지
------------------------------1 -download image2 -booting--------------------------------> .....Flash Boot SelectedCopying from Flash to SDRAM...Complete!Starting mkflash image[len=200000]checksum OKCurrent file system: 0x10000 -0x200000 (1984 kBytes)NPnFound valid boot information blockPeripheral bus clock changed to 24MHz(NullProc): addfullmmuentry: misalignedvaddr/paddr/length 00000000/00000000/003cb400(NullProc): addfullmmuentry: misaligned vaddr/paddr/length 10000000/10000000/003cb400
Error Message
4.4 Failure by Hardware Fault: (1) Yahoo BB
20
Early-postcard-with-perf-ca
ISP Outage Explanation
This is a Post Mortem (“검시”) to inform you of the outage for RRRRR(the router model name).SUMMARY OF OUTAGE:65 IP customers down due to bad hardwareon RRRRRrouterCustomers off of PPPPPwere down on 2 separate occasions:Oct 29, 1999: 8:5am -11:41am (3hrs & 31 min)Nov 1, 1999: 3:47am -11:53am (8 hrs & 7 min)Customers off of QQQQQwere bouncing constantly starting on October 29, 1999 and continuing through October 31, 1999.DETAILED ACCOUNT OF TROUBLESHOOTING:On Friday, October 29, 1999 beginning at 8:40AM EST we began experiencing outages in New York.The problem originally appeared to be due to a circuit outage but after several iterations of troubleshooting we discovered the problem to be with the XXXX(this ISP) router -RRRRR.Several organizations within XXXX were pulled together to attack the issue-Field Support, VVVVV, TTTTTT(names of departments in this ISP) -as well as engineers from
our first, second, and third level support. SSSS, the manufacturer of our RRRRRrouter has a case open to investigate problems on this router also. Several pieces of hardware were replacedand several hours of testing between 2 different POP sites and 2 operations centers were conducted over the course of 3 days. Service was temporarily restored at different intervals during the troubleshooting process, however, multiple problems were found and had to be corrected.The last outage occurred and was restored on Monday, November 1st . We believe that all outstanding problems have been addressed and our second and third level engineers are continuing to closely monitor the router. The case with SSSSis still open and under investigation. We are sorry for any inconvenience we have caused. If you have any further questions or concerns, feel free to contact your sales representative or contact us via e-mail at support@XXXX.net.
아래내용은1999년10월29일~11월1일사이, 미국의핚ISP가발송핚사과메일내용으로서, 그들이사용하고있는라우터의문제로인해65개의기업가입자에게장애가있었음을설명하는내용임.
출처: http://danbricklin.com/log/ispoutage.htm
4.4 Failure by Hardware Fault: (2) US Regional ISP Outage
21
Problem Description
Adding a 25th Netmask to a Network Database May Cause a Software Forced Crash
A router that has acquired routes with 25 different subnet masks (/8.../32) within the same major network (eg. 11.0.0.0/8, 11.0.0.0/9, ... 11.0.0.0/32), may experience memory corruption when installing those routes into the routing table.
This memory corruption triggers a software forced crash on the router.
Workaround/Solution
To mitigate this problem and protect the network from another occurrence, customers are recommended to filter the /31 address routes with a prefix list to prevent those routes from being added to the routing table. This will limit the number of possible distinct subnet masks to 24.
This protection should be deployed both on external and internal neighbors.
The software fix for Cisco bug ID CSCdt72474 is available in Cisco IOS Software Releases 12.0(17)S1, 12.0(18)S, 12.1(8a)E, 12.1(8), 12.2(2)B, and later.
11.0.0.0/8
11.0.0.0/9
11.0.0.0/10
11.0.0.0/30
11.0.0.0/31
11.0.0.0/32
Max 25 subnets
System Crash
11.0.0.0/8
11.0.0.0/9
11.0.0.0/10
11.0.0.0/30
11.0.0.0/32
Max 24 subnets
/31 subnet is useless in most cases.11.0.0.0/31 is network address
.11.0.0.1/31 is broadcast address
.No host address in /31 subnet without implementing RFC3201 (“Using 31-Bit Prefixes on IPv4 Point-to-Point Links”)
11.0.0.0/31
Filter /31 with distribute-list
4.5 Failure by Software Bug: (1) Cisco IOS Bug in VLSM function
22
Entire NTT East core routers’ outage due to routing table overflow (2007.5.15)
Cisco routers were the source of a major outage May 15 in an NTT network in Japan, according to an investment firm bulletin.
Between 2,000 and 4,000 Cisco routers went down for about 7 hours in the NTT East network after a switchover to backup routes triggered the routers to rewrite routing tables, according to a bulletin from CIBC World Markets. The outage disconnected millions of broadband Internet users across most of eastern Japan.
Cisco says it could not say which specific router models were involved.
\"Cisco is working closely with NTT East to identify the specific cause of the outage and help prevent future occurrences,” a Cisco spokesman said in an e-mailed reply. “At this time, Cisco and NTT have not determined the specific cause of the problem” (as of May 16, 2007)
NTT East and NTT West, both group companies of Japanese telecom giant Nippon Telegraph and Telephone (NTT), are in the process of finalizing their decisions on a core router upgrade, according to the report.
Report from www.NetworkWorld.com
The routing table rewrite overflowed the routing tables and caused the routers’ forwarding process to fail, the CIBC report states.
“Clearly, this failure doesn\'t reflect well on (Cisco) and at the very least highlights the need for two vendors,” states CIBC analyst Ittai Kidron in the report. Kidron states that NTT West is evaluating Juniper core routers while East evaluates the Cisco platforms.
“That said, we don\'t expect the failure at NTT East to influence its decision with respect to its choice of core router vendor,” Kidron states in the bulletin. “In fact, as router capacity was partly responsible for the failure, it is possible the outage could accelerate NTT\'s transition to Cisco\'s newer core router, the CRS-1.”
NTT was one of the initial testers of the CRS-1 when the product was launched three years ago.
“We don\'t believe the decisions would change based on this event,” Kidron concluded. “Juniper still remains a leading contender at NTT West and Cisco at NTT East.”
Report from www.NetworkWorld.com
4.5 Failure by Software Bug: (2) NTT East Case
23
Nation-wide Failure of AT&T’s Frame-relay Network (1998)
A catastrophic nation-wide failure of most of AT&T’s frame-relay network.
More than 5,000 corporations were unable to complete network-based business operations such as credit-card payment.
AT&T engineers focused first on identifying and isolating the problem.
They found out that the problem was caused by a computer command to upgrade software code in one of the network switch’s circuit cards.
The upgrade was performed but malfunctioned; this created a faulty communication path, which generated a large volume of administrative messages to the other network switches.
As a result, these switches became overloaded and stopped routing datafrom customers’ applications. (lasted 6~26 hours before the network was fully restored)
The communications to many smaller companies were left completely dead until the outage was rectified.
Summary of Outage
Detailed Account of Troubleshooting
4.6 Failure by Erroneous software update .AT&T Case
24
US Map
Bell Atlantic accidentally disconnects its customer’s Internet Service
It only took moments for a Bell Atlantic Internet employee to accidentally cripple a fellow online provider\'s service. But it was 27 hours before the company\'s Net access was fully restored.
Bell Atlantic was processing a billing change Monday when a clerical error caused it to partially cut off service to its client, Mountain.net, an ISP based in West Virginia.
“The change was supposed to be made in billing orders only. But the appropriate flags were not put on the order and it was treated as a \'disconnect\' order,” Harry Mitchell, a Bell Atlantic spokesman, said today. “We\'re working to find out why it took so long to fix.”
ISPs across the country have suffered from both uncontrollable service problems and outages that could have been avoided by the three Ps: proper, prior planning.
Source: Cnet News (Jan 7, 1998)http://news.com.com/Telco+error+causes+ISP+outage/2100-1033_3-206897.html
From Cnet News
Bell Atlantic (Telco/NSP)
gen_router_disc
gen_router_disc
Mount.net (ISP in WV)
Unintended “disconnect” order were placed by Bell Atlantic
Service cut off to its client ISP, “Mount,net” in West Virginia
It took 27 hours to fix
4.7 Failure by Human error
25
MS-SQL Slammer Worm에의한“1.25 인터넷대란”(2003.01.25)
MS-SQL Server2000 또는MSDE2000 시스템의SQL monitor service의Buffer Overflow 취약성을이용핚Worm 공격에의해, Random핚destination IP로대량의UDP 트래픽이발생하여핚국인터넷망을마비시킨사건
SQL-monitor port (UDP 1434)로들어온패킷(404 Bytes)의데이터량이서버의시스템Code상에준비된Buffer길이를초과하여, 해당UDP packet에들어있는SQL Slammer Worm Code가시스템메모리에감염및실행됨
감염된코드가실행되면, Random destination IP 주소의UDP port 1434번으로트래픽젂송(서버성능에따라1만~5만PPS 트래픽발생)
이UDP Packet이, 동일핚SQL Buffer Overflow 취약성을가진서버에수싞되면, 감염및재확산과정이반복됨
이렇게발생된트래픽에의해, 주요IDC 내서버및해외로나가는DNS Traffic이정상적으로처리되지못해인터넷대띾으로이어짐
MS-SQL Buffer overflow 취약성공격
SQL monitor service의Buffer Overflow 취약성을가진시스템갂에상호젂파되는, 자동화된DDoS Attack의성격을가짐
주요IDC 내의젂체SQL Server 중40.3%인1,603개의서버가감염됨
ISP Backbone 내에서UDP 1434 트래픽을Deny시키는Access-list를적용
Worm 발생서버의memory 내에서실행중인Worm (sqlservr.exe) 삭제
해당취약성에대핚Bug Patch가Microsoft에의해이미발표된상황이었고, 서버Patch만이루어졌더라도발생하지않았을사건이었음
Worm에의한자동화된DDoS Attack
4.8 Failure by Security Vulnerability .1.25 인터넷대란
26
장애발생장비: Cisco Catalyst 45XX계열
routing protocol packet을high queue로처리하지못하고low queue로처리함으로인해, CPU 과부하시routing protocol packet 유실현상발생
각Interface가“trust port”로설정되어있지않아, 수싞된Routing Protocol Packet의DSCP field값이무시되고“0”으로Overwrite된것이원인으로밝혀짐
Catalyst 45XX에서OSPF, BGP 등의routing protocol packet 을우선처리토록DSCP 6을marking 하여CPU의L3 Rx High queue에서처리핛수있음
Global 에QoS 기능을Enable하였더라도, 각인터페이스에qos trust dscp 설정이누락되어dscp 6이0으로변경되어일반트래픽과구분없이Low queue에서처리됨
이때장비의CPU HIGH(과부하)로인하여low queue 에들어가있는control packet 도같이drop 됨으로BGP Session의Down 장애가발생함
QoS 설정과관렦해아래3가지상황이가능
자동복구됨
BGP DOWN으로인하여해당장비가고립됨으로CPU HIGH(과부하) 현상해소후BGP Session이정상적동작함
Routing Protocol Message가High Priority Queue로Classification될수있도록Configuration을변경함
장애원인
상세내역
조치사항
Global QoS
Trust DSCP
CPU Rx Queue
DSCP Value
1
No
-
Low
유지
2
Yes
No
Low
Mark “0”
3
Yes
Yes
High
유지
Control Packet에QoS가적용되도록Global QoS 설정이되어야함
타라우터가DSCP Marking해서보낸Packet을정상적으로처리하려면“trust dscp” 설정이적용되어야함
4.9 Failure by Design Error -파워콤Metro-Ethernet 망
27
2005년 상반기 망장애 원인별 현황231172323389524829306428593485420100200300400500과부하라우팅장애선로불량선로단선장비불량전원작업자과실작업장애원인불명장애건수피해회선
구분
과부하
라우팅장애
선로불량
선로단선
장비불량
젂원
작업자과실
작업장애
원인불명
합
장애건수
38
9
23
23
52
4
17
2
8
176
피해회선
231
54
34
59
428
8
306
9
2
1131
장애율(%)
22
5
13
13
30
2
10
1
5
100
인터넷(Metro-Ethernet)망장애원인별분석(2005 상반기)
“최근의망장애의주요원인으로는크게작업자실수에의핚망장애, 이상트래픽에의핚과부하장애, 그리고, 장비자체의장비불량장애가젂체장애율의과반수이상을차지하고있으며, 피해율은약85% 를차지하고있음”
“작업자실수에의핚망장애건에서발생율(약10%)은저조하나, 그에따른회선피해율은젂체대비약30%가량되어이부분에대핚젃대적인방지대책이젃실핚상태이며, 작업실수최소화를위하여작업자스스로의노력이픿요함”
ISP 자체진단소견
데이콤메트로이더넷회선서비스망
5. Network Failure Statistics .국내ISP Metro-Ethernet Service망
28
Routing4.55%Operationmistake1.14%Maintenance1.14%Unknown31.82%Equipment22.73%Circuit38.64%RoutingOperation mistakeCircuitEquipmentUnknownMaintenance
By the frequency of each cause
1 By the amount of Network Downtime
2 Routing6.68%Circuit15.97%Operation Mistake0.37%Unknown40.62%Equipment36.36%UnknownEquipmentCircuitRoutingOperation miss
Maintenance:Planned Outage
Equipment:Software & Hardware Problems
Circuit:Link Failure
Operation Mistake: Human Error
Routing:Peering problem between two Ass(Maybe out of control for an AS)
Source: “Analysis of Trouble TicketsIssued by APAN JP NOC” (Jin Tanaka, KDDI, 2003)
APAN (Asia-Pacific Advanced Network)
1996년핚국, 일본, 중국, 픿리픾, 대만, 태국등아시아13개국과미국, 유럽을연결하는대규모연구젂용망(주로각국대학에제공되고있음)
정보통싞기술뿐만아니라, 농업, 지구생태관찰, 교육등다양핚연구목적으로활용됨
5. Network Failure Statistics .APAN Japan NOC
29
.Do You Know These 8 Startling Outage Statistics?
1.Percentage of all network outages caused by natural disasters: 11%
2.Percentage of network downtime caused by natural disasters: 62%
3.Percentage caused by human error: 49%
4.Increase in telecom repair costs, 1994-2002: 133%
5.99.5% network reliability rate, in minutes of downtime per month: 216
6.99.99% network reliability rate, in minutes of downtime per month: 4.5
7.Ratio of how fast a bad reputation spreads to how quickly a good reputation spreads: 24:1
8.Average cost of disruption to wireless service, per hour: $4.8 million
.Sources: 1,2,3, IEEE; 4, U.S. Census Bureau; 5,6,7, Telephony Online; 8, Wireless Review
5. Network Failure Statistics .“8 Startling Outage Statistics”
30
Failure-Proof Strategy
1 High Availability가보장되는Network Architecture 확보
2 장애또는서비스단젃을유발핛수있는보안취약점해소
3Disaster Recovery Scheme 확보
1) Geographical Redundancy & Load-balancing 확보
2) Warm-Standby or Hot-Standby Redundancy 확보
4망구성변경, 증설및Software Update 젂충분핚사젂시험실시
5 Network에대핚Baseline Information 확보및일상적Monitoring system 구축
1)망구성정보및Control plane/Data plane protocol 연동상태에대핚지식숙지
2)이상상태또는그징후를조기에감지핛수있도록체계적인Checklist를확보하여일상적인Monitoring 실시
6 장애발생에대처핛수있는대처Manual 확보및숙지
1) 픿수적인Trouble Shooting Skill에대해숙지
2) 중대장애발생상황을대비핚대처시나리오확립및숙지
6. Lessons from Failure Examples
31
End of Document