User login

Clean up an XHTML OpenOffice document for web publishing using RegEx

Preparing a document saved as XHTML from Open Office (StarOffice/OpenOffice format) for Drupal or other web site or wiki content management system.

Resolution

Using SubEthaEdit in this case, but TextWrangler and any regular expression text editor aught to work (open source free software text editors recommended for Mac would be great to know, if you care to tell us).

Removing all the junk with extremely rudimentary regex skills:

First, delete everything from and including <head> to </head>. This isn't hard in regular expressions but I'll have to look it up, I did it manually here.

Find:
style="margin-left:1.25cm;"
Replace:

Find:
class="P..." style="margin-left:1.25cm;"
Replace:

Find:
<p class="P..." style="margin-left:0.25cm;">
Replace:
\n

Find:
</p>
Replace:

Find all these and replace with nothing also:
<span class="T...">
<span class="T..">

</span>

<div class="Sect.">
<div style="text-align:right">
<div style="text-align:left">
</div>

Epilogue:

Is there a better way? Probably, but all the links on the page that describes it are broken!

The basic concept of the potentially better way is that instead of trying to retroactively fix OpenOffice's, err, overzealous XHTML and CSS, to fix its XML transformation file, "XSLT for Import" (in OO.o2.2 see Tools » XML Filter Settings).

On my setup the file would seem to be here: /Applications/NeoOffice.app/Contents/share/xslt/export/xhtml/ooo2xhtml.xsl

On the other hand, putting all these regex rules in PHP to be applied to documents that people who just export the standard way from OpenOffice would be really great.

A Drupal module that you can have different sets of regex rules for different versions of OO? Even, heaven forbid, Microsoft Word HTML Export?

If nothing else such a Drupal module would be great for helping people learn regular expressions... Agaric wants it! We'll put up money, or accept money, to help code it.

Searched words: 
OpenOffice HTML export convert to Drupal wiki regular expressions converting openoffice xhtml to web

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <blockquote> <small> <h2> <h3> <h4> <h5> <h6> <sub> <sup> <p> <br> <strike> <table> <tr> <td> <thead> <th> <tbody> <tt> <output>
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.