1 背景
1.1 報警情況
最近整理筆記,打算全部遷移到EVERNOTE。整理到鎖這一部分,里邊剛好有個自己記錄下來的案例,重新整理分享下給大家。
某日中午,收到報警短信,DB死鎖異常,單分鐘死鎖120個。
死鎖的xml文件如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
|
< deadlock-list > < deadlock victim = "process810b00cf8" > < process-list > < process id = "process810b00cf8" taskpriority = "0" logused = "0" waitresource = "RID: 13:1:1541136:62" waittime = "7682" ownerId = "3396587959" transactionname = "UPDATE" lasttranstarted = "2016-01-08T12:03:51.067" XDES = "0xa99746d08" lockMode = "U" schedulerid = "41" kpid = "17308" status = "suspended" spid = "108" sbid = "0" ecid = "0" priority = "0" trancount = "2" lastbatchstarted = "2016-01-08T12:03:51.067" lastbatchcompleted = "2016-01-08T12:03:51.067" lastattention = "1900-01-01T00:00:00.067" clientapp = "Microsoft SQL Server Management Studio - 查詢" hostname = "test-server" hostpid = "1433" loginname = "xinysu" isolationlevel = "read committed (2)" xactid = "3396587959" currentdb = "13" lockTimeout = "4294967295" clientoption1 = "671098976" clientoption2 = "390200" > < executionStack > < frame procname = "adhoc" line = "7" stmtstart = "214" stmtend = "484" sqlhandle = "0x020000003acf4f010561e479685209fb09a7fd15239977c60000000000000000000000000000000000000000" > UPDATE FinanceReceiptNoRule SET NowSeqValue=@ReturnNum,ISRUNNING='0',LastWriteTime=GETDATE() WHERE IsRunning='1' AND SeqCode=@SeqCode </ frame > </ executionStack > < inputbuf > declare @SeqCode varchar(60) declare @ReturnNum bigint set @SeqCode='CGJS20160106' while(1=1) begin UPDATE FinanceReceiptNoRule SET NowSeqValue=@ReturnNum,ISRUNNING='0',LastWriteTime=GETDATE() WHERE IsRunning='1' AND SeqCode=@SeqCode end </ inputbuf > </ process > < process id = "process18fd5d8cf8" taskpriority = "0" logused = "248" waitresource = "KEY: 13:72057594040090624 (b3ade7c5980c)" waittime = "4" ownerId = "3396522828" transactionname = "user_transaction" lasttranstarted = "2016-01-08T12:03:05.310" XDES = "0x18c1db63a8" lockMode = "U" schedulerid = "57" kpid = "16448" status = "suspended" spid = "161" sbid = "0" ecid = "0" priority = "0" trancount = "2" lastbatchstarted = "2016-01-08T12:03:58.737" lastbatchcompleted = "2016-01-08T12:03:33.847" lastattention = "2016-01-08T12:03:33.850" clientapp = "Microsoft SQL Server Management Studio - 查詢" hostname = "test-server" hostpid = "1433" loginname = "xinysu" isolationlevel = "read committed (2)" xactid = "3396522828" currentdb = "13" lockTimeout = "4294967295" clientoption1 = "671090784" clientoption2 = "390200" > < executionStack > < frame procname = "adhoc" line = "6" stmtstart = "210" stmtend = "400" sqlhandle = "0x020000001b4f23368af7bba99098c10dec46585804f1b4ce0000000000000000000000000000000000000000" > Update dbo.FinanceReceiptNoRule Set [IsRunning]='1' where SeqCode=@SeqCode and IsRunning='0' </ frame > </ executionStack > < inputbuf > declare @SeqCode varchar(60) declare @ReturnNum bigint set @SeqCode='CGJS20160106' while(1=1) begin Update dbo.FinanceReceiptNoRule Set [IsRunning]='1' where SeqCode=@SeqCode and IsRunning='0' end </ inputbuf > </ process > </ process-list > < resource-list > < ridlock fileid = "1" pageid = "1541136" dbid = "13" objectname = "fin_test.dbo.FinanceReceiptNoRule" id = "lock51e8a3980" mode = "X" associatedObjectId = "72057594040025088" > < owner-list > < owner id = "process18fd5d8cf8" mode = "X" /> </ owner-list > < waiter-list > < waiter id = "process810b00cf8" mode = "U" requestType = "wait" /> </ waiter-list > </ ridlock > < keylock hobtid = "72057594040090624" dbid = "13" objectname = "fin_test.dbo.FinanceReceiptNoRule" indexname = "PK_FINANCERECEIPTNORULE" id = "lock7b2c6bc80" mode = "U" associatedObjectId = "72057594040090624" > < owner-list > < owner id = "process810b00cf8" mode = "U" /> </ owner-list > < waiter-list > < waiter id = "process18fd5d8cf8" mode = "U" requestType = "wait" /> </ waiter-list > </ keylock > </ resource-list > </ deadlock > </ deadlock-list > |
表格結構跟模擬數據如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
--涉及表格: CREATE TABLE [dbo].[FinanceReceiptNoRule]( [SeqCode] [ varchar ](60) NOT NULL , [NowSeqValue] [ bigint ] NULL , [SeqDate] [ varchar ](14) NOT NULL , [IsRunning] [ varchar ](1) NULL , [LastWriteTime] [datetime] NULL , [Prefix] [ varchar ](4) NULL ) ON [ PRIMARY ] GO --數據模擬 INSERT [dbo].[FinanceReceiptNoRule] ([SeqCode], [NowSeqValue], [SeqDate], [IsRunning], [LastWriteTime], [Prefix]) VALUES (N 'TEST20150108' , 1469, N '20150108' , N '0' , CAST (N '2015-01-08 05:05:49.163' AS DateTime), N 'TEST' ) GO INSERT [dbo].[FinanceReceiptNoRule] ([SeqCode], [NowSeqValue], [SeqDate], [IsRunning], [LastWriteTime], [Prefix]) VALUES (N 'TEST20150109' , 1377, N '20150109' , N '0' , CAST (N '2015-01-09 04:50:26.610' AS DateTime), N 'TEST' ) GO ALTER TABLE [dbo].[FinanceReceiptNoRule] ADD CONSTRAINT [pk_FinanceReceiptNoRule] PRIMARY KEY NONCLUSTERED ( [SeqCode] ASC ) WITH (PAD_INDEX = OFF , STATISTICS_NORECOMPUTE = OFF , SORT_IN_TEMPDB = OFF , IGNORE_DUP_KEY = OFF , ONLINE = OFF , ALLOW_ROW_LOCKS = ON , ALLOW_PAGE_LOCKS = ON ) ON [ PRIMARY ] GO |
1.2 如何監控
捕獲死鎖有多種方式可以捕獲,這里介紹2種:SQL SERVER Profiler工具跟Extended Events。Profiler相對比較耗資源,但是由于只監控死鎖這一項,所以性能影響不是很大,其可視化界面較易上手;Extended Events耗費資源較少,實時記錄到倒數第二個死鎖,同時需要SQL語句來分析查詢記錄文件。
如何使用 Profiler監控?
打開 SSMS,點擊<工具>,選擇 <SQL Server Profiler>,如下圖。
登錄到需要監控的DB實例,填寫相應的跟蹤屬性,首先是<常規>頁面,如下圖。這里注意2個方面,第一,選擇 <TSQL-Locks>模板,這個模板即可以用來監控死鎖,也可以拿來觀察 鎖申請與釋放情況,非常詳細,有事沒事可以多拿來看SELECT UPDATE DELETE等語句對鎖的申請及釋放情況;第二,監控結果存儲,建議可以存放到某個表格中去,方便定期分析與統計。
接著填寫<事件選擇>項,只需要選擇 <deadlock graph> Events,其他都不需要打勾,最后點擊運行就可以開始監控了。
可以用一個萬年常用的例子來檢查是否監控正常,開3個查詢窗口,按照以下順序執行則會發生資源占用及申請互斥導致死鎖,執行完第5步,等待1-3s則發生死鎖。腳本提供如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
--session 1 CREATE TABLE Test_DL( id int not null primary key , name varchar (100)); INSERT INTO Test_DL(id, name ) select 1, 'a' ; INSERT INTO Test_DL(id, name ) select 2, 'b' ; --session2 2 2 2 2 2 2 2 2 2 BEGIN TRANSACTION UPDATE Test_DL SET Name = 'a-test' WHERE ID=1 --session3 3 3 3 3 3 3 3 3 3 BEGIN TRANSACTION UPDATE Test_DL SET Name = 'b-test' WHERE ID=2 --session2 2 2 2 2 2 2 2 2 2 SELECT * FROM Test_DL WHERE ID=2 --session3 3 3 3 3 3 3 3 3 3 SELECT * FROM Test_DL WHERE ID=1 模擬死鎖SQL |
監控到的死鎖界面如下:
如何使用Extended Events監控?
建立擴展事件監控的腳本如下:(擴展事件很贊,2012版支持可視化操作,感興趣的可以上 MSDN了解:https://msdn.microsoft.com/zh-cn/library/bb630282.aspx,本文就不分析語法等知識點了)
1
2
3
4
5
6
|
CREATE EVENT SESSION [DeadLock] ON SERVER ADD EVENT sqlserver.xml_deadlock_report ADD TARGET package0.event_file( SET filename=N 'F:eventsdeadlockdeadlock.xel' ,max_file_size=(20)), ADD TARGET package0.ring_buffer( SET max_events_limit=(100),max_memory=(10240),occurrence_number=(50)) WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY= OFF ,STARTUP_STATE= ON ) GO |
查詢SQL如下,這里需要注意:查詢是基于buffer還是基于filer分析,一般buffer存儲的個數都是有限的,比如上文我們只分配了4M存儲,file分析則是完整的,但是要看保留的文件個數。這里我們給出buffer的查詢SQL如下,file的查詢大家感興趣的可以動手寫下。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
DECLARE @deadlock_xml XML SELECT @deadlock_xml=( SELECT ( SELECT CONVERT (XML, target_data) FROM sys.dm_xe_session_targets st JOIN sys.dm_xe_sessions s ON s.address = st.event_session_address WHERE s. name = 'deadlock' AND st.target_name = 'ring_buffer' ) AS [x] FOR XML PATH( '' ) , TYPE ) SELECT dateadd( hour ,+6,tb.col.value( '@timestamp[1]' , 'varchar(max)' )) TimePoint, tb.col.value( '(data/value/deadlock/process-list/process/executionStack/frame)[1]' , 'VARCHAR(MAX)' ) statement_parameter_k, tb.col.value( '(data/value/deadlock/process-list/process/executionStack/frame)[2]' , 'VARCHAR(MAX)' ) statement_k, tb.col.value( '(data/value/deadlock/process-list/process/executionStack/frame)[3]' , 'VARCHAR(MAX)' ) statement_parameter, tb.col.value( '(data/value/deadlock/process-list/process/executionStack/frame)[4]' , 'VARCHAR(MAX)' ) [statement], tb.col.value( '(data/value/deadlock/process-list/process/@waitresource)[1]' , 'VARCHAR(MAX)' ) waitresource_k, tb.col.value( '(data/value/deadlock/process-list/process/@waitresource)[2]' , 'VARCHAR(MAX)' ) waitresource, tb.col.value( '(data/value/deadlock/process-list/process/@isolationlevel)[1]' , 'VARCHAR(MAX)' ) isolationlevel_k, tb.col.value( '(data/value/deadlock/process-list/process/@isolationlevel)[2]' , 'VARCHAR(MAX)' ) isolationlevel, tb.col.value( '(data/value/deadlock/process-list/process/@waittime)[1]' , 'VARCHAR(MAX)' ) waittime_k, tb.col.value( '(data/value/deadlock/process-list/process/@waittime)[2]' , 'VARCHAR(MAX)' ) waittime, tb.col.value( '(data/value/deadlock/process-list/process/@clientapp)[1]' , 'VARCHAR(MAX)' ) clientapp_k, tb.col.value( '(data/value/deadlock/process-list/process/@clientapp)[2]' , 'VARCHAR(MAX)' ) clientapp, tb.col.value( '(data/value/deadlock/process-list/process/@hostname)[1]' , 'VARCHAR(MAX)' ) hostname_k, tb.col.value( '(data/value/deadlock/process-list/process/@hostname)[2]' , 'VARCHAR(MAX)' ) hostname FROM @deadlock_xml.nodes( '//event' ) as tb(col) |
這個SQL可以查詢的出非常詳細的資源爭奪情況,如果想要有效的使用擴展事件,建議大家詳細查看下官網的xml語法(SQL SERVER對xml的支持也是棒棒噠,期待2016版中的json支持)
是不是很清晰,一目了然,有了這個就可以去分析拉!
2 分析
根據xml文件內容或者擴展事件的監控內容,都可以整理為以下信息(開頭的那個死鎖分析):
查看事務1及事務2的執行計劃如下:
結合表格及執行計劃,可以大致推測死鎖過程:
會話1:
- 根據主鍵SeqCode查找到鍵值所在的 索引頁 Index_Page,找到該頁上面的 keyhashvalue 鍵值行 Index_key,對Index_Page持有IU鎖,對Index_key持有U鎖;
- 由于該表是堆表,bookmark lookup是通過 RID查找 ,即通過行標識符查找,找到RID所對應的行數據所在的 數據頁 Data_Page,然后在該頁面上找到RID指向槽號上的行數據,對該行數據持有U鎖;
- 這個時候,已經查找到了需要更新的行數據,可以把數據頁 Data_Page上的IU鎖 升級為IX鎖,RID指向的行數據 從U鎖升級為X鎖,升級結束后,釋放索引頁跟鍵值行上面的 IU鎖及U鎖。
- 則此時,會話1 持有 Data_Page 上的IX鎖、RID行上的 X鎖.
這個過程中,剛好會話2進行這樣的鎖申請:
-
找出事務2中持有鎖資源
是哪個索引,可以根據sys.partitions 可以查看到72057594038910976是主鍵pk_FinanceReceiptNoRule,主鍵列是:SeqCode。
- 根據主鍵SeqCode查找到鍵值所在的 索引頁 Index_Page,找到該頁上面的 鍵值行 Index_key,對Index_Page持有IU鎖,對Index_key持有U鎖;
- 由于該表是堆表,bookmark lookup是通過 RID查找 ,即通過行標識符查找,找到RID所對應的行數據所在的 數據頁 Data_Page,然后在該頁面上找到RID指向槽號上的行數據,準備該行數據持有U鎖,但是發現RID行上被會話1持有了X鎖,導致其申請 U鎖 Timeout。
- 則此時 會話2 持有 Index_Page上的IU鎖、Index_key上的U鎖、Data_Page上的IU鎖,請求 RID行的 U鎖。
假設這個時候,會話1 中又執行了一次update操作(同一個事務中):
根據主鍵SeqCode查找到鍵值所在的 索引頁 Index_Page,找到該頁上面的 鍵值行 Index_key,對Index_Page持有IU鎖,準備對Index_key持有U鎖,但是發現 Index_key被會話2持有了U鎖。
那么這個時候死鎖就產生了(詳見下圖):
- 會話1 持有 Data_Page 上的IX鎖、RID行上的 X鎖,申請 Index_key 的U鎖(等待會話2釋放)
- 會話2 持有 Index_Page上的IU鎖、Index_key上的U鎖、Data_Page上的IU鎖,請求 RID行的 U鎖(等待會話1釋放)
3 解決
想法子除去RID查找,直接index就找到數據,就不會發生這個死鎖,也就是,在主鍵上面重新建立聚集索引,丟棄原先的非聚集索引主鍵。因為這樣排除了RID的U鎖申請與持有,直接是保持X鎖 直至事務結束,同時可以直接根據主鍵來修改鍵值所在的數據頁,減少的RID查詢行的時間。
修改后的執行計劃如下:
其鎖申請釋放的流程如下(詳見截圖):
- 根據主鍵SeqCode查找到鍵值所在的 索引頁 Index_Page,找到該頁上面的 keyhashvalue 鍵值行 Index_key,對Index_Page持有IU鎖,對Index_key持有U鎖;
- 由于該表已經是聚集索引表,主鍵所在的頁上包含 行數據,則可以直接 對Index_Page持有IU鎖升級為IX鎖,對Index_key持有U鎖升級為X鎖,避免了RID逐個找行數據的鎖申請
以上就是本文的全部內容,希望本文的內容對大家的學習或者工作能帶來一定的幫助,同時也希望多多支持服務器之家!