W
wqyuwss
Hi,
We have several core dumps in our product. These core dump can be
reproduced in the same place. That is system function call
std::basic_istream<char,std::char_traits<char>>::getline. The result of
pstack for the core dump is
core 'core of 12214: ../bin/QBE_V5 -X 30017
ffffffff7b944318 __type_0 std::__find_if<const
ffffffff7b952c18 long
ffffffff7b9537c4 std::istream &std::istream::getline(char*,long,char)
(1001e7160, 100243bd0, 400, a, 1001e7160, 0) + 7c
ffffffff7e6b905c int Service::readObj(std::ifstream
&,XEPersistentObj*&) (1001e7160, ffffffff7ffff2a8, 2, 188d04,
ffffffff7bac93b8, 10) + 54
000000010001df08 int initialize(unsigned) (1001ce030, 0, 1001ce1d0,
1001ce1d0, 3400, 1) + 660
000000010002c37c main (3, ffffffff7ffffac8, ffffffffffecb898,
1001bb020, 1001959e8, 134400) + ac
000000010001b6dc _start (0, 0, 0, 0, 0, 0) + 17c
When we debug it with dbx, dbx tells us it's a object specific hardware
error, SIG_BUS error. The result of dbx is
t@null <mailto:t@null> (l@1 <mailto:l@1> ) program terminated by
signal BUS (object specific hardware error)
0xffffffff7b944318: __find_if+0x0020: ldsb [%o4], %o0
(dbx) regs
current thread: t@null <mailto:t@null>
current frame: [1]
g0-g1 0x0000000000000000 0xffffffff7b953748
g2-g3 0x0000000000000000 0x000000010022ae0c
g4-g5 0x0000000000000001 0x0000000000000936
g6-g7 0x0000000000000000 0xffffffff7de02000
o0-o1 0xffffffff7a000bc7 0xffffffff7a0014fd
o2-o3 0x000000000000000a 0xffffffff7fffebbe
o4-o5 0xffffffff7a000bc7 0x000000000000024d
o6-o7 0xffffffff7fffe221 0xffffffff7b9442e8
l0-l1 0xffffffff7de02000 0x0000000000000000
l2-l3 0x000000010023ef40 0xffffffff7b3ebec4
l4-l5 0x0000000000000000 0x0000000000000000
l6-l7 0x0000000000000001 0x0000000000000000
i0-i1 0xffffffff7a000bc7 0xffffffff7a0014fd
i2-i3 0x0a00000000000001 0xffffffff7a0014fd
i4-i5 0x000000000000000a 0x0000000000000936
i6-i7 0xffffffff7fffe3c1 0xffffffff7b952c18
y 0x0000000000000000
ccr 0x0000000000000044
pc 0xffffffff7b944318:__find_if+0x20 ldsb [%o4], %o0
npc 0xffffffff7b94431c:__find_if+0x24 cmp %o0, %o2
(dbx) examine $o4 /s
dbx: warning: unknown language, 'c' assumed
0xffffffff7a000bc7: "EngineCkptInput 99 f " ...
int Service::readObj( ifstream& strm, XEPersistentObj*& retObj )
{
char* tmp=0;
tmp = new char [BUFSIZ];
strm.getline(tmp,BUFSIZ);
Looking at code, we can not find any suspecting place. It's a pure
system call. I searched similar case through google and got two link
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread...
and
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread....
A SUN engineer said, "It's an error returned by software somewhere deep
down the VM system's hat layer; without knowledge of the mapping at the
address, how it was accessed, it's hard to tell what really is the
matter. Basically, the HAT layer is very low-level part of the virtual
memory system. HAT information describes how a memory page is mapped
on the physical side of the VM (i.e. RAM). " He also suggested "Start
by finding out which address is giving the problem, which instruction
is using the address and how. "
In the implementation of function getline, a large buffer will be
allocated and data will be loaded into the buffer. Then data will be
continuously compared with a required char. The ldsb loads bytes from
the big buffer to register. After loading a byte from register o4 to
o0, the data in register o4 and o2 will be compared to check if
condition is meet.
According to sun sparc instruction, ldsb instruction is used to load a
signed byte from memory into register. It can't cause the core dump of
memory address alignment. The address giving the problem also shows
correct content loading from the services.dat with the dbx command
"examine". So we really don't know why the core dump happened.
Our product will be delivered to customer in few days. It's greatly
urgent for us. Your input and help will be highly appreciated by us.
P.S. OS version is Solaris 10 64bit.
% /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Netra t 1400/1405 (4 X
UltraSPARC-II 440MHz)
System clock frequency: 110 MHz
Memory size: 4096 Megabytes
Best Regards
Leslie
We have several core dumps in our product. These core dump can be
reproduced in the same place. That is system function call
std::basic_istream<char,std::char_traits<char>>::getline. The result of
pstack for the core dump is
pstack core | c++filt
core 'core of 12214: ../bin/QBE_V5 -X 30017
ffffffff7b944318 __type_0 std::__find_if<const
(__type_0,__type_0,__type_1,const std::random_access_iterator_tag&) (ffffffff7a000bc7, ffffffff7a0014fd, a00000000000001, ffffffff7a0014fd, a, 936) + 20
ffffffff7b952c18 long
,std::_Scan_for_char_val<std::char_traits<char> > >(std::basic_istream<__type_0,__type_1>*,std::basic_streambuf<__type_0,__type_1>*,long,__type_0*,__type_2,__type_3,bool,bool,bool) (ffffffff7a000bc7, 1001e7170, 3ff, 100243bd0, 0, a00000000000005) + 84
ffffffff7b9537c4 std::istream &std::istream::getline(char*,long,char)
(1001e7160, 100243bd0, 400, a, 1001e7160, 0) + 7c
ffffffff7e6b905c int Service::readObj(std::ifstream
&,XEPersistentObj*&) (1001e7160, ffffffff7ffff2a8, 2, 188d04,
ffffffff7bac93b8, 10) + 54
000000010001df08 int initialize(unsigned) (1001ce030, 0, 1001ce1d0,
1001ce1d0, 3400, 1) + 660
000000010002c37c main (3, ffffffff7ffffac8, ffffffffffecb898,
1001bb020, 1001959e8, 134400) + ac
000000010001b6dc _start (0, 0, 0, 0, 0, 0) + 17c
When we debug it with dbx, dbx tells us it's a object specific hardware
error, SIG_BUS error. The result of dbx is
t@null <mailto:t@null> (l@1 <mailto:l@1> ) program terminated by
signal BUS (object specific hardware error)
0xffffffff7b944318: __find_if+0x0020: ldsb [%o4], %o0
(dbx) regs
current thread: t@null <mailto:t@null>
current frame: [1]
g0-g1 0x0000000000000000 0xffffffff7b953748
g2-g3 0x0000000000000000 0x000000010022ae0c
g4-g5 0x0000000000000001 0x0000000000000936
g6-g7 0x0000000000000000 0xffffffff7de02000
o0-o1 0xffffffff7a000bc7 0xffffffff7a0014fd
o2-o3 0x000000000000000a 0xffffffff7fffebbe
o4-o5 0xffffffff7a000bc7 0x000000000000024d
o6-o7 0xffffffff7fffe221 0xffffffff7b9442e8
l0-l1 0xffffffff7de02000 0x0000000000000000
l2-l3 0x000000010023ef40 0xffffffff7b3ebec4
l4-l5 0x0000000000000000 0x0000000000000000
l6-l7 0x0000000000000001 0x0000000000000000
i0-i1 0xffffffff7a000bc7 0xffffffff7a0014fd
i2-i3 0x0a00000000000001 0xffffffff7a0014fd
i4-i5 0x000000000000000a 0x0000000000000936
i6-i7 0xffffffff7fffe3c1 0xffffffff7b952c18
y 0x0000000000000000
ccr 0x0000000000000044
pc 0xffffffff7b944318:__find_if+0x20 ldsb [%o4], %o0
npc 0xffffffff7b94431c:__find_if+0x24 cmp %o0, %o2
(dbx) examine $o4 /s
dbx: warning: unknown language, 'c' assumed
0xffffffff7a000bc7: "EngineCkptInput 99 f " ...
int Service::readObj( ifstream& strm, XEPersistentObj*& retObj )
{
char* tmp=0;
tmp = new char [BUFSIZ];
strm.getline(tmp,BUFSIZ);
Looking at code, we can not find any suspecting place. It's a pure
system call. I searched similar case through google and got two link
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread...
and
http://groups.google.com/group/comp.unix.solaris/browse_thread/thread....
A SUN engineer said, "It's an error returned by software somewhere deep
down the VM system's hat layer; without knowledge of the mapping at the
address, how it was accessed, it's hard to tell what really is the
matter. Basically, the HAT layer is very low-level part of the virtual
memory system. HAT information describes how a memory page is mapped
on the physical side of the VM (i.e. RAM). " He also suggested "Start
by finding out which address is giving the problem, which instruction
is using the address and how. "
In the implementation of function getline, a large buffer will be
allocated and data will be loaded into the buffer. Then data will be
continuously compared with a required char. The ldsb loads bytes from
the big buffer to register. After loading a byte from register o4 to
o0, the data in register o4 and o2 will be compared to check if
condition is meet.
According to sun sparc instruction, ldsb instruction is used to load a
signed byte from memory into register. It can't cause the core dump of
memory address alignment. The address giving the problem also shows
correct content loading from the services.dat with the dbx command
"examine". So we really don't know why the core dump happened.
Our product will be delivered to customer in few days. It's greatly
urgent for us. Your input and help will be highly appreciated by us.
P.S. OS version is Solaris 10 64bit.
% /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Netra t 1400/1405 (4 X
UltraSPARC-II 440MHz)
System clock frequency: 110 MHz
Memory size: 4096 Megabytes
Best Regards
Leslie