suse内核BUG一例:update_group_power:cpu_power
引言:
最近业务服务器有5台都先后在3天内宕机,查出来的原因是suse11sp1版本的内核bug。
系统报错信息:
系统messages日志报错如下
Jun 10 14:00:07 sharedbpro kernel: [ 282.962529] update_group_power: cpu_power = 3925366004
Jun 10 14:00:07 sharedbpro kernel: [ 282.962559] update_group_power: cpu_power = 3925397578
Jun 10 14:00:07 sharedbpro kernel: [ 282.965803] update_group_power: cpu_power = 3928638515
Jun 10 14:00:07 sharedbpro kernel: [ 282.966201] update_group_power: cpu_power = 3929034454
Jun 10 14:00:07 sharedbpro kernel: [ 282.966369] update_group_power: cpu_power = 3929206061
Jun 10 14:00:07 sharedbpro kernel: [ 282.966397] update_group_power: cpu_power = 3929235611
Jun 10 14:00:07 sharedbpro kernel: [ 282.966507] update_group_power: cpu_power = 3929344069
Jun 10 14:00:07 sharedbpro kernel: [ 282.966535] update_group_power: cpu_power = 3929373135
Jun 10 14:00:07 sharedbpro kernel: [ 282.969804] update_group_power: cpu_power = 3932639635
Jun 10 14:00:07 sharedbpro kernel: [ 282.970188] update_group_power: cpu_power = 3933021527
Jun 10 14:00:07 sharedbpro kernel: [ 282.970353] update_group_power: cpu_power = 3933189985
Jun 10 14:00:07 sharedbpro kernel: [ 282.970381] update_group_power: cpu_power = 3933218987
Jun 10 14:00:07 sharedbpro kernel: [ 282.970490] update_group_power: cpu_power = 3933327365
Jun 10 14:00:07 sharedbpro kernel: [ 282.970518] update_group_power: cpu_power = 3933356585
Jun 10 14:00:07 sharedbpro kernel: [ 282.973789] update_group_power: cpu_power = 3936624686
Jun 10 14:00:07 sharedbpro kernel: [ 282.974194] update_group_power: cpu_power = 3937026810
Jun 10 14:00:07 sharedbpro kernel: [ 282.974360] update_group_power: cpu_power = 3937196506
Jun 10 14:00:07 sharedbpro kernel: [ 282.974388] update_group_power: cpu_power = 3937226236
Jun 10 14:00:07 sharedbpro kernel: [ 282.974496] update_group_power: cpu_power = 3937333589
Jun 10 14:00:07 sharedbpro kernel: [ 282.974525] update_group_power: cpu_power = 3937363466
Jun 10 14:00:07 sharedbpro kernel: [ 282.977789] update_group_power: cpu_power = 3940624812
Jun 10 14:00:07 sharedbpro kernel: [ 282.978185] update_group_power: cpu_power = 3941017715
Jun 10 14:00:07 sharedbpro kernel: [ 282.978351] update_group_power: cpu_power = 3941187161
问题现象:
系统日志内出现类似“update_group_power: cpu_power = xxxxxxxx”的报错,一般报错时间都会超过10分钟,且是连续报错,在日志中看着很是壮观,满篇都是。
到达一定的时间之后,系统就会宕机,我第一时间我通过ILO登录看见控制台显示是黑屏假死,当时直接重启系统然后启动数据库,观察一切恢复正常。
解决办法:
根据厂商判断,确定此现象为一bug。
解决办法为更新系统内核到稳定版本sp2或sp1最高版,或更新系统所有文件到sp2版本;
小贴士:
卤肉在这里强调一下,我们作为运维的dba应该遵从业务优先,先恢复应用,然后再查问题原因,当然必要的短时间(一两分钟内)可以做的信息收集工作还是可以做的。