Use of HTMLparser to change language

P

pranav

Greetings All,

I have huge number of HTML files, all in english. I also have their
counterpart files in Spanish. The non english files have their look
and feel a little different than their english counterpart.

My task is to make sure that the English HTML files contain the
Spanish text, with retaining the English look and feel.

The most obvious and stupid way is to open the English and Spanish
files in some HTML Editor. Look for the english text, see its
counterpart in spanish and then replace it. (I don't know spanish, but
as i said the look and feel is only little different, so i can easily
guess which text is what + google translate).

I am sure there is a python way of solving this problem.

Can anyone help me out with some solution.

Thanks,

Pranny
 
M

Marco Mariani

pranav said:
I am sure there is a python way of solving this problem.

The common sense approach (nothing to do with python) would be to
rewrite everything to be dynamically generated with a template language
- in python those would be TAL, mako, genshi, jinja, whatever ...
anything is better than the current solution.
 
P

pranav

The common sense approach (nothing to do with python) would be to
rewrite everything to be dynamically generated with a template language
- in python those would be TAL, mako, genshi, jinja, whatever ...
anything is better than the current solution.

Hmm ya, that is THE best thing, if this were an application. These are
plain HTML files and needs to be shipped with CDs. Also, it is not in
the design to use anything executable.
 
V

Vlastimil Brom

2009/3/20 pranav said:
Greetings All,

I have huge number of HTML files, all in english. I also have their
counterpart files in Spanish. The non english files have their look
and feel a little different than their english counterpart.

My task is to make sure that the English HTML files contain the
Spanish text, with retaining the English look and feel.
....

Pranny

Hi, I guess, this task can probably not be solved fully automatically
unless there is some exact structure of the HTML, but it doesn't seem
likely.
If you would prefer to work with static sources, you can try to
identify the differences in the markup of english and spanish pages.

e.g. using BeautifulSoup http://www.crummy.com/software/BeautifulSoup/
or at least approximately with regular expressions,
e.g.:
tags_only_source = re.findall(r"<[^>]+>", html_source)
should return the tags source for simple code (neglecting nesting,
commented code, strings containing tags source ...)

the difflib library then could help in identifying the differences in code, cf:
http://docs.python.org/library/difflib.html
....
a
b
+ Q
c
a
d
- e
- f
s
d
f
+ A
+ A

(sample strings used here as arguments for ndiff can also be lists of
strings returned by findall() above.)

If you are lucky and the differences are rather small and regular, you
can then try to modify the markup in the spanish pages to be more
similar to the english ones;
again possibly using BeautifulSoup or even re.sub(...)
(of course, saving the modified sources as new files in some other directory)
(The opposite - taking the english markup and feeding it with english
text - would be more tricky, I guess.)

However, all that is likely to help only with the part of the task,
which will almost certainly require, more or less "manual" work.
Someone more experienced can probably propose a more effective
approach...

hth,
vbr
 
M

Marco Mariani

pranav said:
Hmm ya, that is THE best thing, if this were an application. These are
plain HTML files and needs to be shipped with CDs.

So what? Template engines are perfectly able to generate files instead
of sending them off the net.
Also, it is not in the design to use anything executable.

Do you mean "put executable in the CDs", but I'm not proposing that.

Anyway, I would probably start by looping all of the files with 'tidy' -
to have the same formatting and hopefully attribute order, examining a
few them with something like 'vimdiff' and see if I could come up with
some rules to implement with BeautifulSoup. False positives (i.e. files
that should be equal but aren't) are ok because they can give new rules
to implement.

With the same retro-engineered rules I could create the templates from
the static files.
 
T

Terry Reedy

pranav said:
Greetings All,

I have huge number of HTML files, all in english. I also have their
counterpart files in Spanish. The non english files have their look
and feel a little different than their english counterpart.

My task is to make sure that the English HTML files contain the
Spanish text, with retaining the English look and feel.

But then they will not be English files, but Spanish files.
The most obvious and stupid way is to open the English and Spanish
files in some HTML Editor. Look for the english text, see its
counterpart in spanish and then replace it. (I don't know spanish, but
as i said the look and feel is only little different, so i can easily
guess which text is what + google translate).

So it seems to me that your task is to convert the look and feel of the
Spanish files to match the look and feel of the English files (which
someone prefers). So you could think of the problem as changing the
markup in the Spanish files rather than as changing the text in the
English files. If there is a consistent style in both sets, then you
should be able to formulate a set of rules to convert the Spanish files.
If the English files are each idiosyncratic, then I do not envy you.

[Comment: stylesheets (or templates) make it a LOT easier to quickly
change the style of multiple files.]

tjr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,142
Messages
2,570,820
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top