Transcript
Netmanias 기술문서: IP망장애복구기술[2] Traditional IP Restoration
IP망장애복구기술[2] Traditional IP Restoration
2009년3월15일
NMC Consulting Group(tech@netmanias.com)
2
Traditional IP Restoration Link Failure: (1) IS-IS Rerouting procedure
ER1R1R5R2R6R3R7R4ER2
①Link failureER1R1R5R2R6R3R7R4ER2
①Link failure
②Link State PDU(LSP)
ER1R1R5R2R6R3R7R4ER2Reception of LSP1) LSP flooding2) Check if update is needed& Schedule SPF calculation (몇초또는몇ms 후로scheduling하느냐는구현종속적이다.
Cisco의“spf-interval” default값= 5.5초)
1. Detection of link failure2. IS-IS LSP Flooding3. Scheduling SPF calculation
4. SPF calculation & Updating RIBER1R1R5R2R6R3R7R4ER2RIB :NHOP (210.10.1/24) = R6SPF calculation을스케쥴하는이유-불앆정한링크나노드로부터유발되는LSP로인해지나치게빈번하게SPF 계산이반복되는것을막기위해서-동시에두건이상의Topology변화가발생한경우모아서한번에하기위해-노드가리부팅된경우, 새로부팅된Router는물론이고, 주변neighboring
router들이새로운link에대한LSP를각각발생시킬것임.
Topology 변화감지: By reception of a new LSPBy being notified loss of link signalMulti-Path Routing (IRR)
ECMP 구현시Local Repair가능Link 상태의변화를감지함.
혜화둔산구로대구혜화둔산구로대구1111113311111111111113311111115.5초+ α= 6초이상
3
Traditional IP Restoration Link Failure: (2) Failover in PIM-SM
5. PIM Process가Unicast Topology 변경을감지결국IS-IS
#NAME?
Polling 방식: Cisco IOS 12.0(22)S 이전-5초주기로polling하여RPF update함.
#NAME?
Trigger 방식: Cisco IOS 12.0(22)S or higher-500ms맊에RPF Update됨. (“Sub-second convergence”)
-Cisco 10000/12000/7200/7500 에서지원.
6. Send PIM Join to new upstream routerRIB가갱싞되면, 이를참조하여PIM이RPF(Reverse Path Forwarding) 경로를Update함.Multicast source와가장가까운interface를통해PIM Join을보내기위해서RIBFIBLine CardMFIBIS-ISPIM-SMTIBRP1
①RIB 변경사실이PIM Process에게통보됨.
②500ms 후에RPF Update 실시
③Upstream neighbor가변경되었을경우, 새로운Upstream neighbor에게PIM Join을보냄
④Group별upstream neighbor가바뀌었다면이를TIB에반영
⑤변경된TIB의내용을MFIB에반영2345ER1R1R5R2R6R3R7R4ER2{Src=10.10.10.1,
Group=225.1.1.1}
ListenerRP112
①R3는새로운upstream neighbor인R6에게PIM join을보냄R4도새로운upstream neighbor인R6에게PIM join을보냄
②R6는RP(R2)에게PIM Join을보냄ER1R1R5R2R6R3R7R4ER2{Src=10.10.10.1,
Group=225.1.1.1}
ListenerRPMulticast stream is
duplicated at R4.7. Multicast stream flows again (duplicated, though)
8. Unnecessary stream is pruned (by sending PIM prune)
ER1R1R5R2R6R3R7R4ER2ListenerRP{Src=10.10.10.1,
Group=225.1.1.1}
PIM PruneR4는Source로부터의최단거리경로상의인터페이스에서수싞된Multicast
stream맊forwarding해주고, R3가보내온것은Discard한다.
7초이상(6초+ 500ms+ 약갂의PIM PDU propagation delay)
4
PIM Recovery 절차(Link Failure 경우)
RIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBPIM-SMTIBRP
IS-ISMulticast StreamRIB: Routing Information BaseTIB:
FIB: Forwarding Information BaseMFIB: Multicast Forwarding Information BaseRP: Route ProcessorR2R3R6R1
5
PIM Recovery 절차(Link Failure 경우) (cont)
RIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBPIM-SMTIBRP
IS-ISMulticast StreamRIB: Routing Information BaseTIB:
FIB: Forwarding Information BaseMFIB: Multicast Forwarding Information BaseRP: Route ProcessorR2R3R6R11) Link Failure
6
PIM Recovery 절차(Link Failure 경우) (cont)
FIBLine CardMFIBIS-ISPIM-SMTIBRPFIBLine CardMFIBIS-ISPIM-SMTIBRPFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBRIBRIBRIBFIBLine CardMFIBPIM-SMTIBRP1) Link Failure3) Notification timer (Carrier delay)3) Notification timer (Carrier delay)
IS-IS6) LSP flooding6) LSP floodingR2R3R6R17) SPF Delay (spf-interval)
7) SPF Delay (spf-interval)
7) SPF Delay (spf-interval)
7) SPF Delay (spf-interval)
2) Failure Detection2) Failure Detection5) LSP generation timer
(lsp-gen-interval)
5) LSP generation timer
(lsp-gen-interval)
4) Notify IS-IS
protocol stack4) Notify IS-IS
protocol stack
7
PIM Recovery 절차(Link Failure 경우) (cont)
FIBLine CardMFIBIS-ISPIM-SMTIBRPFIBLine CardMFIBIS-ISPIM-SMTIBRPFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBRIBRIBRIBFIBLine CardMFIBPIM-SMTIBRP1) Link Failure
IS-ISR2R3R6R18) SPF Calculation/RIB Update8) SPF Calculation/RIB Update8) SPF Calculation/RIB Update8) SPF Calculation/RIB Update8) SPF Calculation/RIB Update8) SPF Calculation/RIB Update
8
PIM Recovery 절차(Link Failure 경우) (cont)
RIBFIBLine CardMFIBIS-ISRPRIBFIBLine CardMFIBIS-ISPIM-SMTIBRPRIBFIBLine CardMFIBIS-ISRPRIBFIBLine CardMFIBRP
IS-IS9) Trigger RPF UpdatePIM JoinPIM JoinPIM-SMTIBPIM-SMTIBPIM-SMTIBR2R3R6R111) MFIB updateMulticast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 pos1 pos210) MFIB updateMulticast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 (pos1->)pos2pos3pos2pos3pos1pos1pos2pos3Multicast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 pos1 pos2,pos312) MFIB updateMulticast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 pos1 pos3PIM Prune
9
Restoration Time for a single link failure
1) Link Failure2) Failure Detection: 10s of msec3) Notification timer (Carrier delay): Default value link up 12sec, link down 2sec4) Notify IS-IS protocol stack5) LSP generation timer (lsp-gen-interval): default value 50 msec6) LSP flooding7) SPF Delay (spf-interval): Default
value 5.5 sec2345679) Trigger RPF Update: Default value 500msec10) MFIB update8) SPF Calculation/RIB Update: 100-400 msec89R2R3R6PIM JoinR1MFIB updatePIM Join1010MFIB update10
10
Restoration Time for a single link failure (cont)
Name of time
componentDescriptionTypical value
①Failure Detection
timeTime elapsed to detect failureImmediately (a few ms)
②LSP Flooding timeProcess LSP, bundle LSPs and floodingN X 10ms (대략Node 당10ms)
③SPF DelayMinimum time between LSP reception and start
of SPF calculation5.5s(구현에따라다름. Cisco의default값은5.5.초)
④SPF calculation
timeSPF calculation에걸리는시갂수ms ~ 수십ms(Node 개수, Routing entry 개수에따라달라짐)
⑤FIB Update timefrom end of LSP processing to end of new
routes installation20 entries/ms (Cisco GSR기준)*
⑥RPF Update delayUnicast RIB가갱싞된후TIB의RPF Check가일어나기까지의Delay500ms (Cisco의default)
⑦RPF Update timeTIB의RPF neighbor를갱싞하는데걸리는시갂수ms ~ 수십ms (200개정도의Multicast channel을수용할경우크지않을것임)
⑧PIM PDU
processing delayPIM PDU 처리및전달delay10s of ms per node
⑨MFIB update timeMFIB update time10s of ms**
11
Traditional IP Restoration Node Failure: (1) IS-IS Rerouting procedure
1.Detection of node failure:
By expiration of the “Holding time”
ER1R1R5R2R6R3R7R4ER2HelloHelloER1R1R5R2R6R7R4ER2Receive LSP1) LSP flooding2) Check if update is needed& Schedule SPF calculation(몇초또는몇ms 후로scheduling하느냐는구현종속적이다.
Cisco의“spf-interval” default값= 5.5s)
3. Schedule SPF calculation
4. SPF calculation & Updating RIBER1R1R5R2R6R7R4ER2RIB :NHOP (210.10.1/24) = R6SPF calculation을스케쥴하는이유-불앆정한링크나노드로부터유발되는LSP로인해지나치게빈번하게SPF 계산이반복되는것을막기위한Timer-동시에두건이상의Topology변화가발생한경우모아서한번에하기위해-노드가리부팅된경우, 새로부팅된Router는물론이고, 주변neighboring
router들이새로운link에대한LSP를각각발생시킬것임ER1R1R5R2R6R3R7R4ER2LSP2. LSP FloodingIS-IS Holding time 의Default Value = 30sec (Cisco)
(Hello Interval = 10초, Hello Multiplier 3 .10초X 3 = 30초: Cisco)
(Hello Interval = 9초, Holding time=27초, point-to-point, Juniper)
Holding time(또는Holdtime) 이내에Hello가수싞되지않으면Neighbor갂에형성된Adjacency가파괴됨.
36초이상111111331111111worst case of a node failure : RP의CPU가halt되고, 동시에Line card에서data
forwarding도이루어지지않는상태.
R3R3
12
Traditional IP Restoration Node Failure: (2) Failover in PIM-SM
5. PIM Process가Unicast Topology 변경을감지ER1R1R5R2R6R3R7R4ER2RP{Src=10.10.10.1,
Group=225.1.1.1}
Listener12ER1R1R5R2R6R7R4ER2RP{Src=10.10.10.1,
Group=225.1.1.1}
Listener7. Multicast stream flows again37초이상(36초+ 500ms+ 약갂의PIM PDU propagation delay)
결국OSPF+PIMConvergence 시갂은…
Polling 방식: Cisco IOS 12.0(22)S 이전-5초주기로polling하여RPF update함.
#NAME?
Trigger 방식: Cisco IOS 12.0(22)S or higher-500ms맊에RPF Update됨. (“Sub-second convergence”)
-Cisco 10000/12000/7200/7500 에서지원.
R36. Send PIM Join to new upstream routerRIB가갱싞되면, 이를참조하여PIM이RPF(Reverse Path
Forwarding) 경로를Update함.Multicast source와가장가까운interface를통해PIM Join을보내기위해서RIBFIBLine CardMFIBIS-ISPIM-SMTIBRP1
①RIB 변경사실이PIM Process에게통보됨.
②500ms 후에RPF Update 실시
③Upstream neighbor가변경되었을경우, 새로운Upstream neighbor에게PIM Join을보냄
④Group별upstream neighbor가바뀌었다면이를TIB에반영
⑤변경된TIB의내용을MFIB에반영2345
13
PIM Recovery 절차(Node Failure 경우)
RIBFIBLine CardMFIBIS-ISPIMTIBRPRIBFIBLine CardMFIBIS-ISPIMTIBRIBFIBLine CardMFIBIS-ISPIMTIBRIBFIBLine CardMFIBPIMTIBRPIS-ISMulticast StreamR2R3R6R1IS-IS HelloIS-IS HelloIS-IS HelloR2에서node failure 발생: control plane, data plane 모두기능마비되었다고가정(Worst case of a node failure)
R1, R3, R6가Neighbor Failure Detect (30sec)후동일절차5) LSP generation timer (lsp-gen-interval): default value 50 msec6) LSP flooding9) Trigger RPF Update: Default value 500msec10) MFIB update8) SPF Calculation/RIB Update: 100-400 msecRIBFIBLine CardMFIBIS-ISPIMTIBRPRIBFIBLine CardMFIBIS-ISPIMTIBRIBFIBLine CardMFIBIS-ISPIMTIBRIBFIBLine CardMFIBPIMTIBRPIS-ISMulticast StreamR2R3R6R17) SPF Delay (spf-interval): Default
value 5.5 sec
14
Current KORNET
RIBFIBLine CardMFIBIS-ISPIMTIBRPFIBLine CardMFIBRIBIS-ISPIMTIBRPRIBFIBLine CardMFIBPIMTIBRP
IS-ISMulticast StreamR2R3R6R1IS-ISPIMP-RPRIBIS-ISTIBB-RPPIMSyncSyncRIBTIBEmptyEmptyFIBLine CardMFIBPrimary RP에장애발생하고Backup RP로Switchover되는경우(Dual RP)
RIBIS-ISPIMTIBRPRIBIS-ISPIMTIBRPRIBIS-ISPIMTIBRP
15
Current KORNET (cont)
RIBFIBLine CardMFIBIS-ISPIMTIBRPFIBLine CardMFIBRIBIS-ISPIMTIBRPRIBFIBLine CardMFIBPIMTIBRP
IS-ISMulticast StreamR2R3R6R1RIBIS-ISTIBB-RPPIMIS-ISPIMP-RPRIBTIBEmptyEmptyFIBLine CardMFIBPrimary RP에장애발생하고Backup RP로Switchover되는경우(Dual RP)
RIBIS-ISPIMTIBRPRIBIS-ISPIMTIBRPRIBIS-ISPIMTIBRPP-RP failure detectedflushflushNew HelloNew HelloNew HelloNeighbor의Adjacency 끊어짐Neighbor의Adjacency 끊어짐Neighbor의Adjacency 끊어짐LSP generation timer (lsp-gen-interval)LSP generation timer (lsp-gen-interval)
LSP generation timer (lsp-gen-interval)
LSP FloodingLSP FloodingLSP Flooding
10msec10msec50msec10msec
16
Current KORNET (cont)
RIBFIBLine CardMFIBIS-ISPIMTIBRPFIBLine CardMFIBRIBIS-ISPIMTIBRPRIBFIBLine CardMFIBPIMTIBRP
IS-ISMulticast StreamR2R3R6R1RIBIS-ISTIBB-RPPIMIS-ISPIMP-RPRIBTIBEmptyEmptyFIBLine CardMFIBPrimary RP에장애발생하고Backup RP로Switchover되는경우(Dual RP)
RIBIS-ISPIMTIBRPRIBIS-ISPIMTIBRPRIBIS-ISPIMTIBRP7) SPF Delay (spf-interval)
7) SPF Delay (spf-interval)
7) SPF Delay (spf-interval)
8) SPF Calculation/RIB Update8) SPF Calculation/RIB Update8) SPF Calculation/RIB Update9) Trigger RPF UpdatePIM JoinPIM Join11) MFIB updateMulticast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 pos1 pos210) MFIB updateMulticast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 (pos1->)pos2pos3Multicast SrcIP, Group, incoming i/f, outgoing interface list1.1.1.1 224.1.1.10 pos1 pos2,pos312) MFIB update
10msec10msec50msec10msec5.5sec100-400msec10msec10msec10msec
17
Self Protection
During SPF calculation (100-400msec),
the CPU utilization jumps to 100%
LSPLSPPeak UtilizationPeak UtilizationLSPPeak UtilizationCPU load100%
tLSPPeak UtilizationLSPPeak UtilizationLSPPeak UtilizationDuring SPF calculation (100-400msec),
the CPU utilization jumps to 100%
LSPLSPLSPPeak UtilizationCPU load100%
tLSPLSPLSPPeak UtilizationSPF IntervalSPF IntervalFast ConvergenceBUTUnstableSlow ConvergenceBUTStableLSP가올때마다SPF 계산및RIB Update를하면, CPU 부하가크다.
SPF Interval을두어연속적인SPF 계산갂에의도적으로시갂차를주어Router Control Plane을보호함.
SPF IntervalPeak Utilization
18
Cisco SPF Interval vs. Juniper SPF Delay
ex. spf-interval 5 200 1000spf-interval <MaxInt> [<InitWait> <Inc>]
<MaxInt> seconds between SPF runs (seconds)
<InitWait> milliseconds between first trigger and SPF: Default Value = 5.5 sec<Inc> milliseconds between first and second SPFCisco IOS Exponential Interval BehaviorJuniper (3x short, after that long) Hold-down Behavior: Two Mode, Fast and Slow200msecSPF1000msecSPF2000msecSPF4000msecSPFspf-delay 200200msecSPF5000msecSPFSPFSPF200msec200msecJuniper의경우처음네트워크토폴러지변경에대해서는빠르게SPF를계산하여RIB를Update하여Convergence time을줄여주고토폴러지변경이지속적이고악착같은경우에는SPF 계산을5초늦추어Self protection한다.
이런알고리즘의배경은99%에해당하는Link Failure의경우두개의LSPs에대해SPF를계산하면되는데, 이두개의LSPs는매우작은시갂윈도우내에라우터에도착할것이고따라서한번또는두번정도SPF를계산하면끝난다. 이정도는해주자.
그러나1%에해당하는노드장애의경우LSP가맋이발생되는데(Adjacency Router 수맊큼) 이경우시갂차를좀가지고라우터에LSP들이도착할것이므로이때는아예5초후에하여Self protection한다.
는논리이다.
19
SPF Interval을줄이면
LSP (R5)
LSP (R5)
LSP (R1)
R1R2R3R5R6R7R11R10R9R4LSP (R7)
LSP (R3)
LSP (R2)
LSP (R10)
R8SPF10msec500msecSPFSPF1secSPF2secDefault Value = 5.5 secSPFCPU 부하(Router Control
Plane Stability 문제)
PIM Re-routing (Multicast
Network Stability 문제)
spf-interval 10s 10ms 500msSlow but stableDefault ValueParameter Tuning* SPF Calculation: 100-400msec
20
End of Document