Renumbering

  • Thread starter Francesco Pietra
  • Start date
F

Francesco Pietra

Hi;

I would like to renumber, starting from 1, column 6 (i.e, 428 become
1, 429 becomes 2, etc for a very long list)

ATOM 3424 N LEU B 428 143.814 87.271 77.726 1.00115.20 2SG3426
ATOM 3425 CA LEU B 428 142.918 87.524 78.875 1.00115.20 2SG3427
ATOM 3426 CB LEU B 428 141.559 88.057 78.392 1.00115.20 2SG3428
ATOM 3427 CG LEU B 428 140.577 88.341 79.544 1.00115.20 2SG3429
ATOM 3428 CD1 LEU B 428 141.102 89.464 80.454 1.00115.20 2SG3430
ATOM 3429 CD2 LEU B 428 139.159 88.615 79.017 1.00115.20 2SG3431
ATOM 3430 C LEU B 428 142.680 86.253 79.615 1.00115.20 2SG3432
ATOM 3431 O LEU B 428 142.725 86.226 80.842 1.00115.20 2SG3433
ATOM 3432 N SER B 429 142.432 85.155 78.878 1.00134.86 2SG3434
ATOM 3433 CA SER B 429 142.175 83.908 79.534 1.00134.86 2SG3435
ATOM 3434 CB SER B 429 141.666 82.805 78.590 1.00134.86 2SG3436
ATOM 3435 OG SER B 429 140.392 83.155 78.069 1.00134.86 2SG3437
ATOM 3436 C SER B 429 143.451 83.432 80.141 1.00134.86 2SG3438
ATOM 3437 O SER B 429 144.543 83.756 79.676 1.00134.86 2SG3439

Distinctive character is column 5, i.e., it must be set that only
lines containing "B" should be renumbered.

As you can see, the number of lines for a particular value in column 6
changes from situation to situation, and may even be different for the
same name in column 4. For example, LEU can have a different number of
lines depending on the position of this amino acid (leucine).

I was unable to set non-proportional characters, sorry.

Thanks for help

francesco pietra
 
J

John Machin

Hi;

I would like to renumber, starting from 1, column 6 (i.e, 428 become
1, 429 becomes 2, etc for a very long list)

ATOM 3424 N LEU B 428 143.814 87.271 77.726 1.00115.20 2SG3426 [snip]
ATOM 3437 O SER B 429 144.543 83.756 79.676 1.00134.86 2SG3439

Distinctive character is column 5, i.e., it must be set that only
lines containing "B" should be renumbered.

As you can see, the number of lines for a particular value in column 6
changes from situation to situation, and may even be different for the
same name in column 4. For example, LEU can have a different number of
lines depending on the position of this amino acid (leucine).

The above paragraph is extremely unclear.
Thanks for help

You haven't asked a question, and haven't given any background. What
is your experience with Python? Any other language? Are you expecting
somebody to write a script for you, or just give you a few hints? Is
this homework? Have you put any effort into trying to write a script
yourself? Etc etc etc
 
B

bearophileHUGS

Francesco Pietra, few notes:
- In Python and C item numbering generally starts from 0, so you talk
about column 0, 1, etc.
- You can also use the Italian Python newsgroup if know Italian.
- The number of lines with a particular number doesn't seem important
to solve your problem.
- You don't need to try to set non-proportional characters on usenet.

This is a first try at a solution, you can tell us if it's correct:

data = """\
ATOM 3424 N LEU B 428 143.814 87.271 77.726
1.00115.20 2SG3426
ATOM 3425 CA LEU B 428 142.918 87.524 78.875
1.00115.20 2SG3427
ATOM 3426 CB LEU B 428 141.559 88.057 78.392
1.00115.20 2SG3428
ATOM 3427 CG LEU B 428 140.577 88.341 79.544
1.00115.20 2SG3429
ATOM 3428 CD1 LEU B 428 141.102 89.464 80.454
1.00115.20 2SG3430
ATOM 3429 CD2 LEU B 428 139.159 88.615 79.017
1.00115.20 2SG3431
ATOM 3430 C LEU B 428 142.680 86.253 79.615
1.00115.20 2SG3432
ATOM 3431 O LEU B 428 142.725 86.226 80.842
1.00115.20 2SG3433
ATOM 3432 N SER B 429 142.432 85.155 78.878
1.00134.86 2SG3434
ATOM 3433 CA SER B 429 142.175 83.908 79.534
1.00134.86 2SG3435
ATOM 3434 CB SER B 429 141.666 82.805 78.590
1.00134.86 2SG3436
ATOM 3435 OG SER B 429 140.392 83.155 78.069
1.00134.86 2SG3437
ATOM 3436 C SER B 429 143.451 83.432 80.141
1.00134.86 2SG3438
ATOM 3437 O SER B 429 144.543 83.756 79.676
1.00134.86 2SG3439"""

lines = (line.split() for line in data.splitlines())
for parts in lines:
if parts[4] == "B":
parts[5] = str( int(parts[5]) - 427)
parts[2] = parts[2].ljust(4)
print " ".join(parts)

It prints:

ATOM 3424 N LEU B 1 143.814 87.271 77.726 1.00115.20
2SG3426
ATOM 3425 CA LEU B 1 142.918 87.524 78.875 1.00115.20
2SG3427
ATOM 3426 CB LEU B 1 141.559 88.057 78.392 1.00115.20
2SG3428
ATOM 3427 CG LEU B 1 140.577 88.341 79.544 1.00115.20
2SG3429
ATOM 3428 CD1 LEU B 1 141.102 89.464 80.454 1.00115.20
2SG3430
ATOM 3429 CD2 LEU B 1 139.159 88.615 79.017 1.00115.20
2SG3431
ATOM 3430 C LEU B 1 142.680 86.253 79.615 1.00115.20
2SG3432
ATOM 3431 O LEU B 1 142.725 86.226 80.842 1.00115.20
2SG3433
ATOM 3432 N SER B 2 142.432 85.155 78.878 1.00134.86
2SG3434
ATOM 3433 CA SER B 2 142.175 83.908 79.534 1.00134.86
2SG3435
ATOM 3434 CB SER B 2 141.666 82.805 78.590 1.00134.86
2SG3436
ATOM 3435 OG SER B 2 140.392 83.155 78.069 1.00134.86
2SG3437
ATOM 3436 C SER B 2 143.451 83.432 80.141 1.00134.86
2SG3438
ATOM 3437 O SER B 2 144.543 83.756 79.676 1.00134.86
2SG3439

Your data is probably in a file, so you have to change the first line
of the code:

lines = (line.split() for line in open("namefile.txt"))

If your output file must to hard-coded columns (and generally those
files don't need such property) then you have to make that code of
mine more complex...

Bye,
bearophile
 
B

bearophileHUGS

John Machin:
Is this homework? Have you put any effort into trying to write a script
yourself? Etc etc etc

You are right, I am sorry -.-

Bye,
bearophile
 
P

Philipp Pagel

Francesco Pietra said:
ATOM 3424 N LEU B 428 143.814 87.271 77.726 1.00115.20 2SG3426
ATOM 3425 CA LEU B 428 142.918 87.524 78.875 1.00115.20 2SG3427 [...]

As you can see, the number of lines for a particular value in column 6
changes from situation to situation, and may even be different for the
same name in column 4. For example, LEU can have a different number of
lines depending on the position of this amino acid (leucine).

Others have alreade given good hints but I would like to add a bit of
advice.

The data you show appears to be a PDB protein structure file. It is
important to realize that these are fixed-width files and columns can be
empty so splitting on tab or whithespace will often fail. It is also
important to know that the residue numbering (cols 23-26) is not
necessarily contiguous and is not even unique without taking into
account the 'insertion code' in column 27 which happens to be empty in
your example. I would recommend to use a full-blown PDB parser to read
the data and then iterate over the residues and do whatever you would
like to acomplish that way. Biopython has such a parser:

www.biopython.org

cu
Philipp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,246
Members
46,839
Latest member
MartinaBur

Latest Threads

Top