Clean up an XHTML OpenOffice document for web publishing using RegEx
Preparing a document saved as XHTML from Open Office (StarOffice/OpenOffice format) for Drupal or other web site or wiki content management system.
Resolution
Using SubEthaEdit in this case, but TextWrangler and any regular expression text editor aught to work (open source free software text editors recommended for Mac would be great to know, if you care to tell us).
Removing all the junk with extremely rudimentary regex skills:
First, delete everything from and including <head> to </head>. This isn't hard in regular expressions but I'll have to look it up, I did it manually here.
Find:
style="margin-left:1.25cm;"
Replace:
Find:
class="P..." style="margin-left:1.25cm;"
Replace:
Find:
<p class="P..." style="margin-left:0.25cm;">
Replace:
\n
Find:
</p>
Replace:
Find all these and replace with nothing also:
<span class="T...">
<span class="T..">
</span>
<div class="Sect.">
<div style="text-align:right">
<div style="text-align:left">
</div>
Epilogue:
Is there a better way? Probably, but all the links on the page that describes it are broken!
The basic concept of the potentially better way is that instead of trying to retroactively fix OpenOffice's, err, overzealous XHTML and CSS, to fix its XML transformation file, "XSLT for Import" (in OO.o2.2 see Tools » XML Filter Settings).
On my setup the file would seem to be here: /Applications/NeoOffice.app/Contents/share/xslt/export/xhtml/ooo2xhtml.xsl
On the other hand, putting all these regex rules in PHP to be applied to documents that people who just export the standard way from OpenOffice would be really great.
A Drupal module that you can have different sets of regex rules for different versions of OO? Even, heaven forbid, Microsoft Word HTML Export?
If nothing else such a Drupal module would be great for helping people learn regular expressions... Agaric wants it! We'll put up money, or accept money, to help code it.
Comments
Post new comment