Monthly Archives: December 2010

HTML to ePub using Sigil

I was looking for a way to convert HTML books into an ePub file. The general layout of the file should be preserved (including images), while all the stuff that doesn’t make sense on an ebook reader (such as navigation elements and the usual “back to top” links) should be removed.

After trying Calibre rather extensively, I came across an app named Sigil, which does exactly what I want: You just throw in your HTML files (it automatically imports images referenced by them) and add some metadata.

Before proceeding, you should use your favorite scripting language (or modify the attached quick-and-dirty PHP script) to remove everything but the main part of the chapter from the HTML files. (Make sure to remove any tables or divs surrounding the entire content because that might break page-by-page navigation on your ebook reader).

Sigil works very smooth if your HTML files are in alphabetical order. If they’re not, don’t despair: take the index.html file that (hopefully) came with them and us your favorite scripting language (or modify the attached quick-and-dirty PHP script) to grab all the links from it (be sure to remove anchors and duplicates) and generate an XML structure like <spine toc="ncx">
<itemref idref="file1.html" />
<itemref idref="file2.html" />
</spine>
. Manually replace the spine section in the content.opf file inside the generated ePub with the lines you just created. Then re-open the ePub in Sigil and check whether it found any HTML files you forgot to include (they will show up at the top of the file list) – if there are any, move them to the place where you want them.

Once you have everything the way you want it, check the auto-generated table of contents using the TOC Editor option. Chances are that you have everything in there duplicated if the links in your index.html file are recognized as chapter headlines. In that case, just uncheck those (if you don’t feel like unchecking 500 items, I’ve attached an AppleScript to do that, just select the bottom-most line you want unchecked and adjust the number of lines inside the script).