Help: Cannot acquire data by split

Amy Lee · Jan 12, 2009

Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

Tad J McClellan · Jan 12, 2009

Amy Lee said:
Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Josef Moellers · Jan 12, 2009

Amy said:
Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Are you absolutely sure it's TAB delimited? It looks more like the first
delimiter is replaced by a blank.
It looks as if there is no white space in your fields, so maybe you can
get away with using "\s+" as the splitting pattern.

Josef

Amy Lee · Jan 12, 2009

Amy Lee said:
Amy Lee said:

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

Click to expand...

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

Click to expand...

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Thank you all. I'm sure that I did a mistake. The first blank is just a
space. So sorry for that.

Amy

Tim Greer · Jan 12, 2009

Amy said:
Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

That inconsistent output would be likely caused by some fields not being
separated by an actual tab character, but more than one white space.
If that's a risk, it's safer to just use \s+ in place of \t+, because
it will cover both white space and tab separations of the fields.

Also, you can save some typing, and just write it as:
my ($query_seq, $sub_seq) = (split /\s+/)[0,1];

I assume you're splitting on $_ somewhere in your code.

sln · Jan 12, 2009

Amy Lee said:
Amy Lee said:

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];

Click to expand...

^ ^
list

my $sub_seq = (split /\t/)[1];

Click to expand...

^ ^
list again

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

Click to expand...

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record when you
only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

sln · Jan 12, 2009

Amy Lee said:
Amy Lee said:

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used following
syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];

Click to expand...

Click to expand...

^ ^
list

my $sub_seq = (split /\t/)[1];

Click to expand...

Click to expand...

^ ^
list again

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

Click to expand...

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Click to expand...

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record when you
only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's original.
But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

Tim Greer · Jan 12, 2009

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];

Click to expand...

^ ^
list

my $sub_seq = (split /\t/)[1];

Click to expand...

^ ^
list again

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Click to expand...

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record
when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Click to expand...

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t, since the inconsistent
results will be due to non tabs in the original example.

sln · Jan 12, 2009

[email protected] said:
[email protected] said:

On Mon, 12 Jan 2009 08:48:48 -0600, Tad J McClellan

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record
when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Click to expand...

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

Click to expand...

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t, since the inconsistent
results will be due to non tabs in the original example.

Interresting, I was just thinking if split() is deep C<> you may be
right. Then I started thinking it uses the regex engine. Nope, not faster then!

sln

sln · Jan 12, 2009

[email protected] said:
[email protected] said:

On Mon, 12 Jan 2009 08:48:48 -0600, Tad J McClellan

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient for
what you seem to want. To parse every single element in the record
when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Click to expand...

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

Click to expand...

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t,

Can't, Amy has spaces in her elements, the delimeter is \t. \s is a slight
safety factor IMO.

since the inconsistent
results will be due to non tabs in the original example.

sln

Tim Greer · Jan 12, 2009

[email protected] said:
[email protected] said:

On Mon, 12 Jan 2009 20:50:09 GMT, (e-mail address removed) wrote:

On Mon, 12 Jan 2009 08:48:48 -0600, Tad J McClellan

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0
58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0
64 81 282 265 0.007 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
^ ^
list
my $sub_seq = (split /\t/)[1];
^ ^
list again

You can do it all at once:

my($query_seq, $sub_seq) = (split /\t/)[0, 1];

But the output

The code you have shown does not make any output!

I expect that you have made an error in the code that you have not
shown us.

Show us your output statement.

Please post a short and complete program *that we can run* that
demonstrates your problem.

Have you seen the Posting Guidelines that are posted here
frequently?

Hey Amy.
Wow, you don't want to do that. In fact split is very inefficient
for what you seem to want. To parse every single element in the
record when you only want the first two is overkill.

while (<DATA>)
{
($element1,$element2) = /(.*?)\t(.*?)\t/;
}

sln

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

Click to expand...

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t,

Click to expand...

Can't, Amy has spaces in her elements, the delimeter is \t. \s is a
slight safety factor IMO.

since the inconsistent
results will be due to non tabs in the original example.

Click to expand...

sln

Sorry, I can't make out where you replied to me above? Were you saying
the OP had spaces in one or more of their fields? If so, then yea, tab
delimited files would need to use \t, but her inconsistent results
reported would indicate that they weren't actual tabs. If it's a
matter of fields with a white space, perhaps try using (\t|\s{2,}) so
you can split on either, but not on fields with a single white space?
Or, for that matter, \s{4,} so it'll only split on tabs or white space
with 4 or more (seen as tabs to some editors) and not risk splitting on
a field's white space? Of course, I'm only guessing at their problem.

sln · Jan 13, 2009

Amy said:
Amy said:

Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

Click to expand...

That inconsistent output would be likely caused by some fields not being
separated by an actual tab character, but more than one white space.
If that's a risk, it's safer to just use \s+ in place of \t+, because
it will cover both white space and tab separations of the fields.

Also, you can save some typing, and just write it as:
my ($query_seq, $sub_seq) = (split /\s+/)[0,1];

^^^^
won't work, spaces in fields

I assume you're splitting on $_ somewhere in your code.

sln

Tim Greer · Jan 13, 2009

Amy said:
Amy said:

Hello,

Here's the data I use.

ath-MIR166f Marcal1-PA 100.00 18 0 0 58 75 2226 2243 0.008 36.2
dme-mir-278 GABA-B-R3-PA 100.00 18 0 0 64 81 282 265 0.007 36.2
dme-mir-100 CG6630-PA 100.00 19 0 0 28 46 5108 5126 0.002 38.2
cbr-mir-268 nop5-PA 95.65 23 1 0 57 79 1428 1406 0.001 38.2
mmu-mir-199a-2 CG7806-PA 100.00 18 0 0 8 25 107 124 0.008 36.2
rno-mir-320 CG15161-PA 100.00 18 0 0 2 19 265 248 0.006 36.2

And at every line the contents are separated by tab, and I used
following syntax to get the first, second columns.

my $query_seq = (split /\t/)[0];
my $sub_seq = (split /\t/)[1];

But the output is quite odd. The output is

Marcal1-PA Marcal1-PA
GABA-B-R3-PA GABA-B-R3-PA
CG6630-PA CG6630-PA
nop5-PA nop5-PA
CG7806-PA CG7806-PA
CG15161-PA CG15161-PA

So could tell me how to fix that? Thanks very much.

Amy

Click to expand...

That inconsistent output would be likely caused by some fields not
being separated by an actual tab character, but more than one white
space. If that's a risk, it's safer to just use \s+ in place of \t+,
because it will cover both white space and tab separations of the
fields.

Also, you can save some typing, and just write it as:
my ($query_seq, $sub_seq) = (split /\s+/)[0,1];

Click to expand...

^^^^
won't work, spaces in fields

I assume you're splitting on $_ somewhere in your code.

Click to expand...

sln

Where were there spaces in their fields in their example? And, tabs
won't work either, if they really don't have tabs as the delimiters --
which is why I offered a solution for that in another post.

Eric Pozharski · Jan 13, 2009

(e-mail address removed) wrote:
*SKIP*

Sorry, I grabbed Tad's reply it was a mistake, thought I had Amy's
original. But, this could be even quicker:
($element1,$element2) = /\s*([^\t]+)\s+([^\t]+)\s*/;

sln

Click to expand...

I'd imagine that the split example will actually be about the same in
efficiency as the regex, since it's splitting at [0,1] and, in fact,
I'd wager that the split method in the examples would probably be
faster. Feel free to benchmark. I'm betting that be it iterations of
a million times (or even 10 million), that split would probably be
slightly faster than the regex. Also, you probably want to use /^\s*
to start your regex (not that it'll matter for speed as much as the
obvious reason), and use \s in place of \t, since the inconsistent
results will be due to non tabs in the original example.

Remebering strange benchmarks the previous time, I've done it thrice --
all the same (for me)

perl -wle '
use Benchmark qw|countit cmpthese timethese|;
my $lit = qq{abc\txyz};
my $t = timethese 1_000_000, {
split => sub { @arr = split /\t/, $lit; },
plugged => sub { @arr = $lit =~ /\s*([^\t]+)\s+([^\t]+)\s*/; },
unplugged => sub { @arr = $lit =~ /(\S+)\s+(\S+)/; }, };
cmpthese $t;
'
Benchmark:
timing 1000000 iterations of
plugged, split, unplugged
....

plugged: 12 wallclock secs (10.65 usr + 0.03 sys = 10.68 CPU) @
93632.96/s (n=1000000)

split: 5 wallclock secs ( 3.04 usr + 0.01 sys = 3.05 CPU) @
327868.85/s (n=1000000)

unplugged: 13 wallclock secs ( 9.83 usr + 0.03 sys = 9.86 CPU) @
101419.88/s (n=1000000)

Rate plugged unplugged split
plugged 93633/s -- -8% -71%
unplugged 101420/s 8% -- -69%
split 327869/s 250% 223% --

Tim Greer · Jan 13, 2009

Eric said:
Remebering strange benchmarks the previous time, I've done it thrice
-- all the same (for me)

perl -wle '
use Benchmark qw|countit cmpthese timethese|;
my $lit = qq{abc\txyz};
my $t = timethese 1_000_000, {
split => sub { @arr = split /\t/, $lit; },
plugged => sub { @arr = $lit =~ /\s*([^\t]+)\s+([^\t]+)\s*/;
}, unplugged => sub { @arr = $lit =~ /(\S+)\s+(\S+)/; }, };
cmpthese $t;
'
Benchmark:
timing 1000000 iterations of
plugged, split, unplugged
...

plugged: 12 wallclock secs (10.65 usr + 0.03 sys = 10.68 CPU) @
93632.96/s (n=1000000)

split: 5 wallclock secs ( 3.04 usr + 0.01 sys = 3.05 CPU) @
327868.85/s (n=1000000)

unplugged: 13 wallclock secs ( 9.83 usr + 0.03 sys = 9.86 CPU) @
101419.88/s (n=1000000)

Rate plugged unplugged split
plugged 93633/s -- -8% -71%
unplugged 101420/s 8% -- -69%
split 327869/s 250% 223% --

I did some benchmarks on an older, slower system, and while split was
faster, as I suspected, it wasn't such a large gap. I ran the
benchmark on a newer, faster system and I had similar results as yours
above, were it was markedly faster. Thanks for the additional
confirmation (showing it wasn't as slight of a speed increase as I
initially reported, but a large increase).

Help: Cannot acquire data by split

Amy Lee

Tad J McClellan

Josef Moellers

Amy Lee

Tim Greer

sln

sln

Tim Greer

sln

sln

Tim Greer

sln

Tim Greer

Eric Pozharski

Tim Greer

Members online

Forum statistics

Latest Threads