User login

Enter the Unholy: Microsoft Word to the Web

Summary of the issue below, and then the beginnings of a search for a solution.

Currently TinyMCE with the WYSIWYG API is in use.

Problems with the wysywg editor, which is a life or death issue for the site.

Problem 1: gibberish code inserted after cut and paste from Word
Problem 2: arbitrary line breaks inserted into text – they are invisible in the editor, visible in preview and in publish, but are not at all fixable by using the editor.

We will have busy people posting live, and the standard method that we know they will use to post is by cutting and pasting from a Word document, and several of the standard ways of doing this leads to incorporating a mountain of gibberish formatting code from Word into the post that is invisible until you 'preview' or 'publish'. If they publish, they will publish gibberish that they cannot alter (part of the workflow of the site), and if they preview, they will see gibberish. They also see arbitrary line breaks that are not fixable by any means I can discover.

The problem is obvious: this result leads the user to believe that the software he's using is buggy and doesn't work, and if I were her, I wouldn't spend any time thinking about how to fix it, I'd just give up and maybe send a message to the Editor saying that the editor software is broken.

Developers have implemented solutions that remove gibberish from some of the cut&paste paths – I list the paths below and note if they now work (that is, paste without gibberish) or not. Please note that a persistent problem with ALL uses of the editor is the arbitrary insertion of bad line breaks.

There are a several standard ways of cutting and pasting: all start with using mouse to block text. Then there are different paths:
1. cut and paste using ctrl-c & ctrl-v (this gets rid of gibberish, but introduces bad line breaks)
2. cut and paste using the mouse button (gibberish + bad breaks)
3. hit the 'paste from word' icon in the editor (this gets rid of gibberish, but introduces bad line breaks)
4. (least likely) cut and paste using the browser 'edit' bar. (gibberish + bad breaks)

We need to address all the possibilities, including restricting user options, making an obligatory ‘paste from word into this box’ and then transfer it into the wysywig, anything we can think of.

General thoughts, aside from the observation that Microsoft is the root of all evil:

In particular is the issue of migrating content from Word into a WYSIWYG and ultimately HTML. The amount of special features available to a Word document can mean there will always be an edge case that the WYSIWYG editor cannot handle gracefully and will return bad markup.

I'd say the masters of this right now would be Google Docs, and people still bring up its flaws. [Also, conversion to regular HTML (as opposed to what is displayed in their own rich text editor, which is heavily using JavaScript).] It does have improvements to the workflow that we don't have, but I do not know how we can integrate those improvements, or if they've even open sourced that knowledge. Additionally, even Google Docs has a problem with some of the points, namely rejection of some commands depending upon the browser you're using.

Google handles the issue of importing content from Word by actually having an "import your document" interface. Their import is not perfect either and also succumbs to bad formatting on the edge-cases of pushing the limits of Word's functionality.

The best that the developers we have at hand can do is fix each bug as it is discovered through hacking the code, and hacking in add-on functionality (like perhaps a Word import feature).

If we do need to have a perfect WYSIWYG interface, we may need to find a WYSIWYG expert.

Another restatement of the problem:

The problem

The vast majority of content on the website will be text previously written for other purposes (publications, etc). When a normal joe user (who is near-technically illiterate with a low patience level) is inputting content for use on the website, 80-95% of the time, they will be wanting to use content that currently exists within a Word Document (or other rich-text format). We need to ensure that what the user inputs will 1) retain only a limited amount of formatting (lists, emphasis, headers, etc), 2) not have garbled text (like random word formatting notes).

We are using TinyMCE as our WYSIWYG editor on a Drupal environment (using WYSIWYG API, Better Formats and Image Assist modules). We are able to get text into TinyMCE that is formatted very close to how we want it by using the "Paste From Word" functionality (the edge-case failures when degrading Word formatting is not in the scope of this proposal).

The problem is that the "Paste from Word" functionality causes a popup box to appear which instructs the user to perform "ctrl+v" to paste their text into the dialog box and press OK. The user is still able to paste in text while bypassing that functionality, potentially confusing them if they wonder why things don't look correct -- or worse, outputting garbled word mess. We can force a user to go through the "Paste from Word" dialog process for every single pasting method, however, that would greatly degrade the experience of editing content if they just wanted to move, for example, a paragraph around while in the WYSIWYG interface and would be forced to use the dialog box process for each cut and paste.

What we want is to hide the dialog box workflow for the user and just automatically filter all text.

Proposed solution

Extract the filtering capability from the "Paste from Word" functionality, and make it possible to run the filter on the entire text that is within TinyMCE. Whenever we 'detect' that a user has pasted new content, we will temporarily disable the textarea and run that function for them automatically.

The paste detection will be done by calculating the quantity of text that has changed in the past 50ms. If the quantity of text has changed to a degree that is impossible for a user to enter by hand (such as 10 characters or so), then we will assume the user has pasted in new content and will run the filter.

We need someone who is proficient enough with Javascript where they will be able to work with TinyMCE and manipulate it to be able to behave in this manner.

Ben is currently looking into the TinyMCE codebase to see if it will be possible for him to do this. However, if he finds by the end of the day that this task requires Javascript expertise, we will need assistance.

[So far I've just been doing it.]

And now we return you to our regularly scheduled scouring of the internet.

Not finding anything good out there

Commercial software:
http://www.technoriversoft.com/wordtohtmlconversion.html

Old post:
http://www.technoriversoft.com/wordtohtmlconversion.html

Findings

wysiwyg automatically upload pasted graphics does not appear to be possible, period, so making sure they don't show / give a notice that they have to upload separately

Resolution

Searched words: 
how does google import word documents open source convert from word

Comments

DocVert

Or another option would be to have some REST service running to do the DOC -> HTML conversion.

Have a look at http://holloway.co.nz/docvert/

Roel

This might help

http://drupal.org/project/safehtml
It's a good start to clean up the garbage BEFORE it's saved to database. Hope this helps!

Haris

Word puts strange code instead of text when I paste into google

Word puts strange code instead of text when I paste into google search box.
Problem - simple request for google is to find a zip code for an address or maybe map it. No matter how I copy the text from my Word 2003 document, it pastes as un-intelligible garbage!!! If Microsoft can't fix it, may google should be smarter & convert it to actual text.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <blockquote> <small> <h2> <h3> <h4> <h5> <h6> <sub> <sup> <p> <br> <strike> <table> <tr> <td> <thead> <th> <tbody> <tt> <output>
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.