Use of HTMLparser to change language

pranav · Mar 20, 2009

Greetings All,

I have huge number of HTML files, all in english. I also have their
counterpart files in Spanish. The non english files have their look
and feel a little different than their english counterpart.

My task is to make sure that the English HTML files contain the
Spanish text, with retaining the English look and feel.

The most obvious and stupid way is to open the English and Spanish
files in some HTML Editor. Look for the english text, see its
counterpart in spanish and then replace it. (I don't know spanish, but
as i said the look and feel is only little different, so i can easily
guess which text is what + google translate).

I am sure there is a python way of solving this problem.

Can anyone help me out with some solution.

Thanks,

Pranny

Marco Mariani · Mar 20, 2009

pranav said:
I am sure there is a python way of solving this problem.

The common sense approach (nothing to do with python) would be to
rewrite everything to be dynamically generated with a template language
- in python those would be TAL, mako, genshi, jinja, whatever ...
anything is better than the current solution.

pranav · Mar 20, 2009

The common sense approach (nothing to do with python) would be to
rewrite everything to be dynamically generated with a template language
- in python those would be TAL, mako, genshi, jinja, whatever ...
anything is better than the current solution.

Hmm ya, that is THE best thing, if this were an application. These are
plain HTML files and needs to be shipped with CDs. Also, it is not in
the design to use anything executable.

Vlastimil Brom · Mar 20, 2009

2009/3/20 pranav said:
Greetings All,

I have huge number of HTML files, all in english. I also have their
counterpart files in Spanish. The non english files have their look
and feel a little different than their english counterpart.

My task is to make sure that the English HTML files contain the
Spanish text, with retaining the English look and feel.
....

Pranny

Hi, I guess, this task can probably not be solved fully automatically
unless there is some exact structure of the HTML, but it doesn't seem
likely.
If you would prefer to work with static sources, you can try to
identify the differences in the markup of english and spanish pages.

e.g. using BeautifulSoup http://www.crummy.com/software/BeautifulSoup/
or at least approximately with regular expressions,
e.g.:
tags_only_source = re.findall(r"<[^>]+>", html_source)
should return the tags source for simple code (neglecting nesting,
commented code, strings containing tags source ...)

the difflib library then could help in identifying the differences in code, cf:
http://docs.python.org/library/difflib.html
....
a
b
+ Q
c
a
d
- e
- f
s
d
f
+ A
+ A

(sample strings used here as arguments for ndiff can also be lists of
strings returned by findall() above.)

If you are lucky and the differences are rather small and regular, you
can then try to modify the markup in the spanish pages to be more
similar to the english ones;
again possibly using BeautifulSoup or even re.sub(...)
(of course, saving the modified sources as new files in some other directory)
(The opposite - taking the english markup and feeding it with english
text - would be more tricky, I guess.)

However, all that is likely to help only with the part of the task,
which will almost certainly require, more or less "manual" work.
Someone more experienced can probably propose a more effective
approach...

hth,
vbr

Marco Mariani · Mar 20, 2009

pranav said:
Hmm ya, that is THE best thing, if this were an application. These are
plain HTML files and needs to be shipped with CDs.

So what? Template engines are perfectly able to generate files instead
of sending them off the net.

Also, it is not in the design to use anything executable.

Do you mean "put executable in the CDs", but I'm not proposing that.

Anyway, I would probably start by looping all of the files with 'tidy' -
to have the same formatting and hopefully attribute order, examining a
few them with something like 'vimdiff' and see if I could come up with
some rules to implement with BeautifulSoup. False positives (i.e. files
that should be equal but aren't) are ok because they can give new rules
to implement.

With the same retro-engineered rules I could create the templates from
the static files.

Terry Reedy · Mar 20, 2009

pranav said:
Greetings All,

I have huge number of HTML files, all in english. I also have their
counterpart files in Spanish. The non english files have their look
and feel a little different than their english counterpart.

My task is to make sure that the English HTML files contain the
Spanish text, with retaining the English look and feel.

But then they will not be English files, but Spanish files.

The most obvious and stupid way is to open the English and Spanish
files in some HTML Editor. Look for the english text, see its
counterpart in spanish and then replace it. (I don't know spanish, but
as i said the look and feel is only little different, so i can easily
guess which text is what + google translate).

So it seems to me that your task is to convert the look and feel of the
Spanish files to match the look and feel of the English files (which
someone prefers). So you could think of the problem as changing the
markup in the Spanish files rather than as changing the text in the
English files. If there is a consistent style in both sets, then you
should be able to formulate a set of rules to convert the Spanish files.
If the English files are each idiosyncratic, then I do not envy you.

[Comment: stylesheets (or templates) make it a LOT easier to quickly
change the style of multiple files.]

tjr

Javascript set language function issue	2	Nov 24, 2024
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
How should i edit the code of this program; so it looks like the wirframe i attached; the buttons with a cross myst be positioned vertically ?	1	Sep 29, 2022
Can I use calc to change multiple parent sizes?	0	Nov 20, 2021
Looking to change programming direction	1	Aug 10, 2022
[C Language] Need help transferring Linux CodeBlocks Project to Windows CodeBlocks Project	1	Jun 19, 2023
Unexpected behaviour with HTMLParser...	5	Oct 9, 2007

Use of HTMLparser to change language

pranav

Marco Mariani

pranav

Vlastimil Brom

Marco Mariani

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads