[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#7614) Markup error in slapd.conf.5

Howard Chu <hyc@symas.com>:
> Hm. I use my own man2html http://highlandsun.com/hyc/man2html.c
> which gives pretty good looking output for us. Really, if you're
> developing a tool that claims to read troff input, it has to
> actually do so. I mean, the point of tools such as this is to be
> able to convert existing documents without modifying them, isn't it?

There's a subtle difference between tools that translate purely at a
presentation level and tools that do content analysis.

A purely presentation-level translation such as your man2html is
indeed less likely to be thrown by weird troff markup.  And much of
the time it will produce markup that doesn't look bad, especially on a
relatively small collection of pages in a consistent house style.

But there are things such a tool cannot do that become more important
when you are translating a very large corpus of man pages with
multiple authors - such as an entire Linux distribution's man page tree.

Here is an example: the treatment of file paths in FILES sections.
Some man-page authors mark them up as bold text.  Some mark them up
as italic. And some give them no highlight at all.

A purely presentation-level translator will simply translate any font
change from troff to HTML.  In the resulting output, filenames will
have three different visual signatures.  Readers will thus have to
work a little harder than they should to recognize filenames.

A tool that does content analysis, on the other hand, can recognize
presentation-level cliches that mean "this is a filename", such as a
line in FILES beginning with .B or .I and containing a /, and map it
to a DocBook <filename> pair.  The DocBook stylesheet will then ensure
that all filenames are visually marked in the *same* way in generated
HTML, rather than three different ways.

Now multiply this effect by all the different things that can be 
recognized by content - things like Unix error codes, program listings, 
command synopses, C function prototypes, references to other manual pages.

The effect is a higher-quality and more visually uniform translation.  The
gain in quality increases with the diversity of the author population. For
very large collections like an entire Linux distribution's man page tree
it is quite significant.

The tradeoff is that, while a presentation-level translator will
cheerfully produce a visual garble from ill-formed troff, a tool with
a real parser and a content analyzer will have a few more cases in
which it just can't cope at all.  

Not *many* cases, mind you; in the decade I've been developing mine,
perhaps 2-3%.  But these are worth fixing anyway, because they're
likely to break third-party man-page readers.  Nothing but troff
itself interprets troff perfectly.
		<a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>