Creating an ePub of a Complex Scholarly Book from MS Word

For the previous books, the ePub had been generated from InDesign. A separate InDesign book had to be prepared from the same InDesign files used in the book for print format, with certain modifications to prepare for the conversion. Would it be possible to do an effective conversion from Word that would pass all the tests needed for distribution?

I started with a copy of the final Word .docx used to generate the print-ready PDF. I removed  manual line breaks and manual page breaks by global replacement. I should not have done this globally, since this caused more work later when I found places in the ePub where a few of these should have been retained. Only the breaks inserted for better final layout should have been removed. I tried to substitute Times New Roman for Minion Pro by simply redefining the default font and the Normal style of this document, but this did not work. I converted a few styles manually for the change of font, but there were too many styles to deal with, and changes did not seem to have the cascading effect I expected. So I just did a global search and replace. It took more than one try to actually eliminate all the Minion Pro, or apparently do so.

I had learned that the free program calibre could convert a Word file to ePub, and one source said this did not work with .docx, but did work with .rtf. Therefore, after the adjustments just mentioned, I resaved the .docx as .rtf, added the latter file to calibre, and used the conversion command. The conversion failed with the error message that calibre had encountered unexpected features in the RTF. I noted that the latest version of calibre I had downloaded had a setting for converting from .docx. When I added the .docx to calibre and converted, the process completed. If the .rtf version has worked, it probably would have had much cleaner html than what resulted from .docx.

It was possible to make some edits of the result right in calibre: I restored some line breaks and I edited the css file to remove references to various fonts that were not supposed to be in my document at all (Calibri, Tahoma, Arial, Palatino Linotype). I believe they must have been concealed in a few paragraph symbols in Word, since I had searched previously in Word to find and replace fonts like Calibri. This appears to be another indication that global searching in Word 2016 is not fully reliable, or perhaps it has something to do with the extraordinary messiness (and overkill of tagging) of the XML used by Microsoft for .docx. But I prefer to work on an ePub archive in Oxygen XML Editor or BBEdit. In order to get an ePub archive outside of calibre to work on (and eventually upload), the proper command to use in calibre is Save to disk “as EPUB only in single directory.”

Here were some of the actions needed to get the ePub to look right (I checked with iBooks and Adobe Digital Editions) and to pass epubcheck validation and validation on the Lulu site.

1. The file toc.ncx created by calibre had only one navpoint, for Notes. As with previous books, this file needed to be edited to have the proper chapter titles and, in this case, the proper navpoints. It may be noted that the footnotes are all at the end of the book with calibre’s conversion, whereas the previous ePubs created in InDesign had notes for each chapter at the end of each chapter.

2. In various places I saw unwanted changes in the size of the font. When I explored the html for those locations, I found an enormous amount of unnecessary coding in the file: spans were applied to stretches of text with no rationale, including a number of the type <span id=”id_OLE_LINK21″> … </span>. I guess that this is something caused by Word rather than calibre. Eliminating all spans containing OLE did no harm and removed many, but not all of the font size anomalies.  Others were removed when I changed the class of a span to the same as that of a nearby portion of text that was in the correct size. I could detect no reason why a different style had been applied to the two stretches of text, nor why the css settings for these two styles differed from each other. Eventually, I turned to the css file and removed all the font-size statements that indicated an enlarged font (1.2em or 2.223em). I have no idea what was the origin of these, but eventually the fonts in the book as displayed in iBooks or Adobe Digital Editions were of uniform size.

3. The Appendix to chapter 1 contained some very extensive two column tables. Somehow, in one of the tables “colspan=”2″” had been added as an attribute to a number of cells, producing an XML error that caused the rest of the chapter not to be displayed at all. Removing all the colspan=”2″ attributes in the file solved this problem.

4. In the printed book the logo on the title page is in .eps format, and calibre converted this to .emf. But that would not display in the reading applications. A jpeg version had to be used instead, with changes to the manifest in content.opf and the html of the page itself.  The plates had been in .tiff format in the book, and those displayed in iBooks but not in Adobe Digital Editions. I replaced them with jpeg versions. Since I use the .jpg suffix when I save a jpeg file, I initially mistyped the media-type in the manifest as image/jpg rather than the required image/jpeg. epubcheck caught that, as well as my failure to remove the manifest entry for the .emf version of the logo.

5. I deleted the ugly cover image that calibre had created, waiting for the simple cover that is created when processing the ePub at Lulu.com.

6. Because of my overenthusiastic removal of page breaks in the Word version before conversion, the seven plates with their captions were not appearing on separate pages and were all in one html file. Adding a horizontal rule and a little more space between them was a slight improvement, but not good enough. So I divided the one html file into seven, one for each plate, and adjusted the manifest and spine in content.opf so that they appeared in the right place. Where there had been one file named by calibre’s converter as “index_split_019.html”, the file of that name now contained only Plate 1 and the remaining plates were in files named with the ending “_019a” though “_019e”. In the manifest element the original file had the attribute “id=”id2456″”, and six new files likewise had a through e added to the filename. In the spine element, six lines of the type “<itemref idref=”id2456[a-e]”/>” had to be added to follow the original <itemref idref=”id2456″/>.

7. Although epubcheck once again completed without errors, the validation that takes place when uploading to Lulu revealed that there was still an extraneous calibre file that needed to be deleted (META-INF/calibre_bookmarks.txt), since it was not in the manifest and certainly not needed. Also it was necessary to add to the dc:date element the attribute opf:event=”publication”. After these changes Lulu validation worked and the ePub version was published.

The process probably took about the same amount of time as producing an ePub using InDesign, or perhaps a little longer because this was the first time using calibre and the first time investigating what had gone wrong in the css file and in some of the tagging assigned to spans of text.

Addendum January 19, 2018

The ePub was not accepted for wider distribution until certain fixes were made at the request of Lulu’s system for internal validation.

1. The message from Lulu said that the author’s middle initial was missing from metadata but present on the title page and marketing image. When I checked the metadata in the file I had uploaded, the middle initial was already in the metadata. I did not realize until a second warning that in processing my uploaded file, the system used by Lulu had changed the metadata file (reordered items and also changed content). It had created a different dc:creator element using the first name field on the first page of the setup sequence for a Lulu publication, where no middle initial was present. This is where the correction was needed, not in the ePub that I was uploading.

2. The ePub I uploaded did not have the title as the first line of the first content file, but some empty paragraphs, and the title was in a <p> element. It has to be an <h1> instead, with no other items before it.

3. The first file listed in toc.ncx had the <text> element set to Front Matter. This text needed to be changed to the title of the book.

[A Russian translation of this page is available here, courtesy of Timur Kadirov.]