Help: Cannot acquire data by split

A

Amy Lee

Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy
 
T

Tad J McClellan

Amy Lee said:
Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];


You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?
 
J

Josef Moellers

Amy said:
Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Are you absolutely sure it's TAB delimited? It looks more like the first
delimiter is replaced by a blank.
It looks as if there is no white space in your fields, so maybe you can
get away with using "\s+" as the splitting pattern.

Josef
 
A

Amy Lee

Amy Lee said:
Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];


You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?
Thank you all. I'm sure that I did a mistake. The first blank is just a
space. So sorry for that.

Amy
 
T

Tim Greer

Amy said:
Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

That inconsistent output would be likely caused by some fields not being
separated by an actual tab character, but more than one white space.
If that's a risk, it's safer to just use \s+ in place of \t+, because
it will cover both white space and tab separations of the fields.

Also, you can save some typing, and just write it as:
my ($query_seq, $sub_seq) = (split /\s+/)[0,1];

I assume you're splitting on $_ somewhere in your code.
 
S

sln

Amy Lee said:
Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again
You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record when you
only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln
 
S

sln

Amy Lee said:
Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again
You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record when you
only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's original.
But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln
 
T

Tim Greer

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2


And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again
You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];


But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record
when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t, since the inconsistent
results will be due to non tabs in the original example.
 
S

sln

On Mon, 12 Jan 2009 08:48:48 -0600, Tad J McClellan


Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2


And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again


You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];


But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record
when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t, since the inconsistent
results will be due to non tabs in the original example.

Interresting, I was just thinking if split() is deep C<> you may be
right. Then I started thinking it uses the regex engine. Nope, not faster then!

sln
 
S

sln

On Mon, 12 Jan 2009 08:48:48 -0600, Tad J McClellan


Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2


And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again


You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];


But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record
when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t,
Can't, Amy has spaces in her elements, the delimeter is \t. \s is a slight
safety factor IMO.
since the inconsistent
results will be due to non tabs in the original example.

sln
 
T

Tim Greer

On Mon, 12 Jan 2009 20:50:09 GMT, (e-mail address removed) wrote:

On Mon, 12 Jan 2009 08:48:48 -0600, Tad J McClellan


Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0
58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0
64 81 282 265 0.007 36.2


And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again


You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];


But the output


The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.


Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here
frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient
for what you seem to want. To parse every single element in the
record when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t,
Can't, Amy has spaces in her elements, the delimeter is \t. \s is a
slight safety factor IMO.
since the inconsistent
results will be due to non tabs in the original example.

sln

Sorry, I can't make out where you replied to me above? Were you saying
the OP had spaces in one or more of their fields? If so, then yea, tab
delimited files would need to use \t, but her inconsistent results
reported would indicate that they weren't actual tabs. If it's a
matter of fields with a white space, perhaps try using (\t|\s{2,}) so
you can split on either, but not on fields with a single white space?
Or, for that matter, \s{4,} so it'll only split on tabs or white space
with 4 or more (seen as tabs to some editors) and not risk splitting on
a field's white space? Of course, I'm only guessing at their problem.
 
S

sln

Amy said:
Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

That inconsistent output would be likely caused by some fields not being
separated by an actual tab character, but more than one white space.
If that's a risk, it's safer to just use \s+ in place of \t+, because
it will cover both white space and tab separations of the fields.

Also, you can save some typing, and just write it as:
my ($query_seq, $sub_seq) = (split /\s+/)[0,1];
^^^^
won't work, spaces in fields
I assume you're splitting on $_ somewhere in your code.

sln
 
T

Tim Greer

Amy said:
Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

That inconsistent output would be likely caused by some fields not
being separated by an actual tab character, but more than one white
space. If that's a risk, it's safer to just use \s+ in place of \t+,
because it will cover both white space and tab separations of the
fields.

Also, you can save some typing, and just write it as:
my ($query_seq, $sub_seq) = (split /\s+/)[0,1];
^^^^
won't work, spaces in fields
I assume you're splitting on $_ somewhere in your code.

sln

Where were there spaces in their fields in their example? And, tabs
won't work either, if they really don't have tabs as the delimiters --
which is why I offered a solution for that in another post.
 
E

Eric Pozharski

(e-mail address removed) wrote:
*SKIP*
Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t, since the inconsistent
results will be due to non tabs in the original example.

Remebering strange benchmarks the previous time, I've done it thrice --
all the same (for me)

perl -wle '
use Benchmark qw|countit cmpthese timethese|;
my $lit = qq{abc\txyz};
my $t = timethese 1_000_000, {
split => sub { @arr = split /\t/, $lit; },
plugged => sub { @arr = $lit =~ /\s*([^\t]+)\s+([^\t]+)\s*/; },
unplugged => sub { @arr = $lit =~ /(\S+)\s+(\S+)/; }, };
cmpthese $t;
'
Benchmark:
timing 1000000 iterations of
plugged, split, unplugged
....

plugged: 12 wallclock secs (10.65 usr + 0.03 sys = 10.68 CPU) @
93632.96/s (n=1000000)

split: 5 wallclock secs ( 3.04 usr + 0.01 sys = 3.05 CPU) @
327868.85/s (n=1000000)

unplugged: 13 wallclock secs ( 9.83 usr + 0.03 sys = 9.86 CPU) @
101419.88/s (n=1000000)

Rate plugged unplugged split
plugged 93633/s -- -8% -71%
unplugged 101420/s 8% -- -69%
split 327869/s 250% 223% --
 
T

Tim Greer

Eric said:
Remebering strange benchmarks the previous time, I've done it thrice
-- all the same (for me)

perl -wle '
use Benchmark qw|countit cmpthese timethese|;
my $lit = qq{abc\txyz};
my $t = timethese 1_000_000, {
split => sub { @arr = split /\t/, $lit; },
plugged => sub { @arr = $lit =~ /\s*([^\t]+)\s+([^\t]+)\s*/;
}, unplugged => sub { @arr = $lit =~ /(\S+)\s+(\S+)/; }, };
cmpthese $t;
'
Benchmark:
timing 1000000 iterations of
plugged, split, unplugged
...

plugged: 12 wallclock secs (10.65 usr + 0.03 sys = 10.68 CPU) @
93632.96/s (n=1000000)

split: 5 wallclock secs ( 3.04 usr + 0.01 sys = 3.05 CPU) @
327868.85/s (n=1000000)

unplugged: 13 wallclock secs ( 9.83 usr + 0.03 sys = 9.86 CPU) @
101419.88/s (n=1000000)

Rate plugged unplugged split
plugged 93633/s -- -8% -71%
unplugged 101420/s 8% -- -69%
split 327869/s 250% 223% --

I did some benchmarks on an older, slower system, and while split was
faster, as I suspected, it wasn't such a large gap. I ran the
benchmark on a newer, faster system and I had similar results as yours
above, were it was markedly faster. Thanks for the additional
confirmation (showing it wasn't as slight of a speed increase as I
initially reported, but a large increase).
 

Members online

No members online now.

Forum statistics

Threads
474,212
Messages
2,571,101
Members
47,695
Latest member
KayleneBee

Latest Threads

Top