Tear down this PDF

The PDF document format is digital publishing’s worst enemy. For a large part, the news industry still relies on this 18-year-old format to sell its content online. PDF is to e-publishing what the steam locomotive is to the high-speed train. In our business, progress is called XML and HTML5.

Picture today’s smartphone reading experience. We’ll start with a newspaper purchased on a digital kiosk. For a broadsheet, a format still largely used by dailies, the phone’s “window” covers 1/60th of the paper’s page. Multiply by 30 pages of news. You’ll need 1800 pans and zooms to cover the entire publication (plus, each time time you pinch out, you can take a leisurely sip of your coffee as the image redraws).

Next, we have two iPhone screen captures of American Photo, purchased on Zinio. The more compact magazine format doesn’t help. Note that you need to scroll laterally to read a full line (as for the “Text” function, meant to insure easier reading, it is ineffective) :

Am I being too derisive, or can we say this is not the best way to read?

The battle for online news will be won on mobility. We’re just at the beginning of the smartphone era. We can count on better screens, faster processors combined to extended battery life, more storage, better networks… The bulk of news consumption will come from people on the move, demanding constant updates and taking a quick glance at what is stored in their mobile device — regardless of networks conditions. Speed, lightness and versatility will be key success factors. There won’t be much tolerance for latency.

In that respect, PDF is just a lame duck.

Back in 1993, the Portable Document Format was a fantastic digital publishing breakthrough. All of a sudden, using a sophisticated mathematical description of images, texts, typefaces, layout elements, the most complex graphic creation could be encapsulated into a single file. Large font sets and dedicated software were no longer needed. The PDF reader, licensed from Adobe Systems under the name of Acrobat, soon became free or pre-loaded in various OS platform. PDF became an open standard in 2008. As for the performance, it was stunning: see a 6400% magnification below:

Great for high-quality book publishing… And a completely pointless stunt for a mobile news product.

The newspaper industry jumped on PDF. The new format let a production crew send the full publication to the printing plant using huge, high definition PDF files directly transferred to the printing plates. When the web arose, the industry kept using the same format to make the publication available for downloading. After years of file optimization, a newspaper or a magazine still weighs 20 to 50 megabytes. The download is manageable over ADSL or cable, but impractical on a mobile network. But wait, it can get worse: on the Android platform, for example, the reader can actually ad weight to the original PDF file. This is the consequence of a good intention: giving the publisher the choice between a finished product that is easier to leaf through, but requires a heavier file, and one that downloads faster, but is more difficult to read.

Publishers’ inclination to keep using PDF is based on one idea: the graphical elements of a publication — layout, typefaces — are an essential component of a printed brand. By extension this visual identity is seen as a “label of trust” for the news brand, with the design-perfect PDF being the medium of choice.

Now, three things:
#1, this widely shared assertion is not supported by strong facts. There is no survey (to my knowledge) that links visual identity to reader loyalty, to feelings of trust;
#2, on this matter, if there remains any lingering bond with readers, it will fade away with the new generation of news consumers: they are much less sensitive than their elders to the notion of “trusted brand”, let alone to any design associated to it;
#3, the web has evolved. The HMTL5 standard has shown the ability to render any graphic design without the PDF format’s downsides (see this previous Monday Note: Rebooting Web Publishing Design).

Why not, therefore, jumping off the PDF train? The short answer is XML management. Our techiest Monday Note readers will forgive this shortcut: the Extensible Markup Language is a version of the web language readable by both machines and humans. An article encoded in XML is not an image but a set of character strings associated to various “tags” that describe what they are, where they belong; the description also provides contextual information to be retrieved at will. In theory, any publishing system, big or small, should be able to produce clean XML files. It should also be able to generate a “zoning file” that maps the coordinates of a story, or any other element in the page (see the red box below that indicates the position of the story in a newspaper front page). Armed with such position data, smartphone software can provide the right reading experience, limiting the need for the painful panning and zooming I mentioned above.

Unfortunately, no one lives in theory’s wonderland.

In fact, very few newspapers are able to produce usable XML or zoning files. Part of the reason lies in outdated editorial systems that were not designed (not upgraded either) to handle such sophisticated, web-friendly files. IT managers have been slow to embrace the web engineering culture and it didn’t occur to publishers than a “human upgrade” was badly needed deep in the bowels of their company…  (This, by the way, leaves another wide open field to internet pure players and their web-savvy tech teams).

This backwardness has created its own ecosystem… in low-wage countries. Every night, all over the world, highly specialized contractors collect the PDF files of hundreds of newspapers and send them to India, Romania or Madagascar. Down there, it takes a few hours to electronically dismantle the image files and to convert them to dynamic XML text files, with proper tagging and zoning. Thanks to the time difference, the converted static newspaper is sent back to the publishers by dawn, ready to be uploaded on an internet platform, right before the physical version hits the streets.

Many will find these shortcomings appalling. For a large part it is. The good news is the evolution has merely begun. Still, very few publishers realize that upgrading of their production chain is a crucial competitive asset. As for the PDF, it remains immensely useful for many applications, but it is no longer suitable for news content that thrives on nomadic uses.

frederic.filloux@mondaynote.com

Be Sociable, Share!

No related posts.

22 Comments

  1. Henrik Holmegaard, technical writer
    Posted February 7, 2011 at 1:43 pm | Permalink

    Jean-Louis Gassée was director of development at Apple when the developer discussion pro et con the Adobe font model and the Adobe document model erupted into the pages of the New York Times and Seybold Publications. Drilling down through the developer discussions, it was clear with mathematical certainty that Adobe’s models would not work even for conventional English composition.
    1. Adobe’s Type 1 font model draws glyph alternates by substituting characters instead of substituting glyphs so that ‘Adobe Offices’ with an ffi ligature is composed by changing U+0066 U+0066 U+0069 into any arbitrary character such as U+0059. Adobe’s OpenType model does the same, only now glyph alternates are drawn by code points in the Private Use Area of ISO-IEC 10646-Universal Standard Character Set. Adobe has no product declaration for its type software, unsurprising since product declaration would make the type impossible to sell.
    2. Adobe’s document models from PostScript level 1 through PostScript level 3 and PDF version 1.0 to PDF version 1.6 are incompatible with the intelligent font model of the Unicode Consortium. These document models encode glyph information and do not encode character information. What actually happens is that the glyph run mapping from the input of character information through the selectors for alternate glyph appearances in the intelligent font through to the output of imageable composition is lost. The intelligent font itself is not embedded intact. Only the glyph data is embedded with PostScript-compatible glyph names, either by subsetting the glyph data into sets of 256 and embedding the sets as several fonts or by tiling the sets together and saving as a single font (so-called CID format).
    3. A design decision shared by Xerox Interpress and Adobe PostScript is that the ISO 646 non-printing control characters for tele-typewriters (TAB, SPACE) are seen as device dependent and not encoded in the document model where they are replaced by MOVETO commands. This poses the problem that in PostScript and PDF the difference between and kerning and tracking in the font model and tabulating and word spacing in the character model is lost. Just as the basic information about what the character was that was represented by any particular glyph has to be inferred from glyph names, so the basic information about whether a move in graphic coordinate geometry represents a word space or a span table space at the level of logical organisation, or a kern or track at the level of layout organisation is lost and has to be inferred by the PDF consumer.
    According to the New York Times, there was no marketing or management available at Apple and Microsoft who could (or would) sum up the argument for TrueType over Adobe Type 1 and Adobe PostScript. Only lead engineer David Opstad at Apple posted contributions to the developer discussion, beginning in April 1992 in the weeks up to the World Wide Developer conference that introduced the idea of drawing glyph alternates by glyph substitution instead of by character substitution in order to keep the source character string inviolate.
    PostScript cannot encode the intelligent font model of the International Colour Consortium. PDF 1.3 can encode the intelligent font model of the International Colour Consortium, but PDF has prehistoric font machinery. Synthesising the customers input of character information from PostScript glyph identifiers doesn’t even work for English.
    Reference:
    King to Holmegaard http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html
    /hh

  2. Henrik Holmegaard, technical writer
    Posted February 7, 2011 at 1:47 pm | Permalink

    For:
    >PostScript cannot encode the intelligent font model of the International Colour Consortium. PDF 1.3 can encode the intelligent font model of the International Colour Consortium, but
    Read:
    PostScript cannot encode the intelligent profile model of the International Colour Consortium. PDF 1.3 can encode the intelligent profile model of the International Colour Consortium, but
    /hh

  3. Posted February 7, 2011 at 5:07 pm | Permalink

    Great article. I love the New York Times, for example. And read it on my iPhone everyday. Except, days when it has taken – seconds! — to download, I have instead gone to other sources.

  4. ralphg
    Posted February 7, 2011 at 5:42 pm | Permalink

    Perhaps you are having PDF problems, because you view the files on an iOS product. On my Android phone, Adobe’s PDF reader has a Reflow option that makes the line width match the screen width.

    As for publishers taking to PDF (and I am a technical publisher), it is really the only format that accurately reflects the typeset page. WYSIWYG.

    My one frustration with PDF is its inability to limit copying and printing of individual .pdf files. I’ve spoken with Adobe twice on this matter, and they have been unable to come up with an internal locking mechanism.

    Thus, for publishers for myself, the greatest failing of PDF is that enables piracy of our technical documents.

  5. Posted February 7, 2011 at 6:32 pm | Permalink

    Excellent. I had sections in ctndigital.com that railed on against zmags and zinio for all these reason and then some, but removed that and instead crafted an internal paper on “Design for Digital”. Primarily for use-case and conversion analysis, and carrying the assertion forward for “Column Glide” reading. NOOK COLOR ( just got one, you should too, its quite a learning thing) exposes text reads in “ArticleView” , a mode that is nearly identical to our ColumnGlide.

    Perhaps the solution is as you describe above, overflow a column for sliding, and both presentation and user needs fit the device appropriately for 4″, 7″ and 10″ and even GoogleTV readable user experiences.

    We as content creators need the Framework designers out there to act upon what you have crafted here, and do what needs to happen.

  6. Posted February 7, 2011 at 6:49 pm | Permalink

    > There is no survey (to my knowledge) that links visual identity to reader loyalty, to feelings of trust;

    Maybe nothing is meeting your criteria, but a general, fuzzy understanding that good design fosters trust and improves readability is being expanded and clarified with serious research. Here’s just one, but it’s among the most cited, so is a good start:
    http://portal.acm.org/citation.cfm?id=998272

    You might well say that good design can be achieved by any good reader and everyone sticking to standards. But I presume that publishers educated enough to understand this will see no reason it would be better than web browsers. And their content routinely looks awful on one browser or another.

    I also would object to your insistence that PDF works too hard because you can zoom so far and get sharp type. All smartphones (AFAIK) use glyphs, not bitmapped fonts, and can do this with ANY application that allows that much zooming. So can any desktop app. It’s inherent in the OS, regardless of sending the type descriptions over which PDF does.

    We’re also starting to get printing from handsets, display on larger devices (and I mean TV sized, not just tablets). It better be able to support that as well.

  7. Bob Forsberg
    Posted February 7, 2011 at 7:20 pm | Permalink

    Comes across as if you had a glitch with PDF and told yourself you would “show them” by writing a rambling article.

  8. Posted February 7, 2011 at 11:31 pm | Permalink

    I remember in the late 90s or early 2000s, when CSS was gaining traction, people were talking about HTML files could be rendered for different screen sizes based on browser detection and separate CSS optimized for each type. I believe that Gawker’s mobile page (m.gawker.com) uses an iPhone specific css:

    … But I wonder how much the mobile application store trend, with dedicated news apps and APIs not based on Web standards, undermines not only mobile CSS called from HTML4/XHTML but also HTML5.

  9. pedant
    Posted February 8, 2011 at 4:18 am | Permalink

    RalphG seems to want it both ways.

    PDF is the only format that “reflects the typeset page”.

    Android has a reader which will “reflow” that page – that is, relaying out the page.

    Which is it?

    The real problem with PDF, from many consumer users perspective, is that it has chosen the former over the latter. Reflowable PDF files are really an admission that the document should have been in HTML instead where the author does *not* solely mandate the layout.

  10. Posted February 8, 2011 at 5:26 am | Permalink

    “PDF is to e-publishing what the steam locomotive is to the high-speed train. “ Okay, whatever…. ;-)

    Very long, but it sounds like you’re saying “the format is wrong” because some uses of that format are not yet the best they can be.

    If so, then couldn’t the same could be said of text in weblogs…?

    jd/adobe

  11. Posted February 8, 2011 at 5:38 am | Permalink

    about 15 years ago, Apple bought NeXT Computer, and with it, a total commitment to DIsplay PostScript (which they had helped Adobe invent) and later, to PDF as the base 2D display technology for Mac OS X. At the same time, they abandoned Apple’s own, superior graphics technology, based on Object-Oriented principles, instead of graphics derived from a printer language. The internet has been held back for 15 years because of this decision. Before anyone decides on a new technology to replace PDF, people should look at Apple’s old QuickDraw GX technology, which was explicitly designed to be better than anything Adobe could come up with, bound as it was to the requirements and limitations of existing printers.

  12. Posted February 8, 2011 at 12:32 pm | Permalink

    Nice piece.

    I was one of them to predict and expect a lot from the “papersize PDF to devices” conversion during the iPhone/iPad and tablets launches. but we we look back of the reason they did it, the first reason which cross my mind is that they were lacking time to develop another way.

    Many publishers were just too worried to launch their “online presence” on time of the device release, or before their competitor. It’s again, I’m afraid, a matter of conflict between the short term vision to the long term one. I hope they were not just lazy.

    I am the one who believes that web standards should prevail, since they offer great flexibility on screen devices.

  13. VeeTee
    Posted February 8, 2011 at 1:05 pm | Permalink

    This is a joke, right?

    CSS was NOT designed as layout language, it’s simple styling language – the visual experience is poor, it doesn’t support even basics as columns, ascending selectors, scopes for positions or expressions, the fluid model is pain – nearly everything must be manually coded, so it costs hours to look as least ok in major browsers. Instead you can make perfectly nice layout in few seconds/minutes in professional layout environment. I can’t imagine someone to typeset hundreds of pages in html..

    There is no good layout language for HTML or XML, and that’s part of problem why monetizing is so hard – web things look visually inferior to pdf, there’s no way to “make” them look better, I have tablet PC and reading pdf is SUPERIOR in every single way to reading webpage – I can even make comments in pdf with stylus and send it with comments to anyone else..

    There’s no way any publisher will use HTML and there’s just one right direction – to invent some new content/styling/layout language, which addresses our current needs…

  14. Posted February 8, 2011 at 1:38 pm | Permalink

    PDF is a widespread format because it is a good technology to protect contents and/or manage rights attached to documents. Maybe the solution reside in a mix of PDF and web technologies. PDF chunk of content could be embedded into HTML file. Is there anybody here who knows something a solution that mix PDF and web ?

  15. ralphg
    Posted February 8, 2011 at 6:38 pm | Permalink

    pedant: Of course, I want it both ways. I don’t understand why someone would want an inflexible approach to viewing documents.

    When needed (ie, most of the time), the PDF presents the document precisely as its creator intended. Flip a switch, and the text adapts itself to the device’s screen.

    To me, this sounds like a solution, not a problem.But there is always going to be an Eeyore in every crowd.

  16. Mike
    Posted February 8, 2011 at 6:43 pm | Permalink

    Interesting piece. What do you (and the commenters) think of TeX? That remains one of my favorite writing technologies…

  17. Henrik Holmegaard, technical writer
    Posted February 10, 2011 at 2:31 pm | Permalink

    Saijanai wrote
    > people should look at Apple’s old QuickDraw GX

    Apple introduced an integrated graphics library and document model in 1994. Thus the graphics primitives and commands to manipulate the graphics primiitives in the systen level libary were a superset of what was used in the document model. This way the problem that Apple QuickDraw and Adobe PostScript were graphics models with different graphics constructs was solved. Apple also introduced TrueType 2 and ColorSync 2 in 1994 as advanced table-based appearance transforms so that in drawing on digital graphic devices it is possible to specify characters and colours at the level of information processing, and apply user-selectable appearances at the level of image presentation. In the presentation image for ‘Adobe Oces’ where is a ligature, Adobe’s approach was to substitute character codes for other character codes (Adobe OYces) whereas Apple’s was to substitute glyph codes for other glyph codes which protected the level of information processing.

    So why did the Apple Portable Digital Document model not become an overnight success in the Macintosh customer market when it was introduced in September 1994, at the same time as Adobe Acrobat 2.0. Apple PDD did not support search a. because the input of character information could be in several character sets at the same time with no assurancet of mapping to ISO10646/Unicode, and b. because glyph codes were considered private and font-dependent so that the glyph codes were not a fallback with which to infer ISO10646/Unicode character codes. Adobe PDF allowed that Adobe on the one hand sold type software that changed the customer’s input of character information and on the other hand could recover the character information from font-independent glyph identifiers.

    Unfortunately, when Adobe began to sell type software in the intellligent font model of the Unicode Consortium that is supposed to draw glyph alternates, not by substituting character codes but by substituting glyph codes, Adobe chose to sell type software where glyph alternates are mapping to code points in the Private Use Area of ISO10646/Unicode – without product declaration that warns the customer. Again this allows Adobe to claim that PDF supports search, and indeed Adobe does have support for search when overwriting the international standard character set, but the issue is how to get the ISO10646/Unicode character string and the glyph runs that maps from the character infortion through the table-based appearance transforms of the intelligent font model, and the intelligent font model itself into the document model. Which is the core of the current commercial conflict between Microsoft XPS (that does have this ability) and Adobe PDF (that does not).

    What the New York Times in 1989 dubbed the Font War was not a war about spline programming languages (Type 1 versus TrueType), but a war about the character model, the font model, and the document model for the change from Print, then Distribute to Distribute, then Print. The war has had cold periods and hot periods, but it has never come to a close.

    /hh

  18. Dan b
    Posted June 12, 2011 at 5:00 am | Permalink

    FlexPaper is probably the best option out there for anyone who wants to publish and host their own documents.

    http://flexpaper.devaldi.com

  19. sex shop
    Posted September 12, 2011 at 10:05 am | Permalink

    I’m not sure where you’re getting your information, but good topic. I needs to spend some time learning much more or understanding more. Thanks for magnificent information I was looking for this information for my mission.

  20. Posted July 2, 2012 at 9:42 pm | Permalink

    I ended up here a couple weeks ago and I truly cannot get enough!
    Please keep writing!

  21. Posted February 15, 2013 at 7:10 pm | Permalink

    When I initially commented I clicked the “Notify me when new comments are added” checkbox
    and now each time a comment is added I get three e-mails with the same comment.

    Is there any way you can remove me from that service? Appreciate it!

  22. Posted May 4, 2013 at 8:56 am | Permalink

    Unique Chanel Handbag ,

10 Trackbacks

  1. [...] This post was mentioned on Twitter by John Brissenden, Mark Porter, Jean de Bressy, Grégory Rozières, Brian Merritt and others. Brian Merritt said: RT @ScepticGeek: "PDF is to e-publishing what the steam locomotive is to the high-speed train" http://j.mp/dRjfx4 [...]

  2. By worst enemy « Werner Lauff on February 7, 2011 at 7:01 pm

    [...] Frédéric Filloux schreibt in seinem Blog Monday Note: [...]

  3. By Tear down this PDF | Monday Note | All about PDF on February 8, 2011 at 4:41 am

    [...] For a large part, the news industry still relies on this 18-year-old format to sell its content. pdf – Google Blog Search Related Posts:How to Manage PDF Files With an Effective PDF Software?Unlock PDF Files Online for [...]

  4. By Korta klipp – 08 Februari 2011 on February 8, 2011 at 8:03 am

    [...] Tear down this PDF | Monday Note [...]

  5. By links for 2011-02-09 | A Web editor's tale on February 9, 2011 at 11:04 am

    [...] Tear down this PDF | Monday Note (tags: pdf publishing news onlinenewspapers future xml html5) [...]

  6. [...] Tear down this PDF (Monday Note) [...]

  7. [...] a critical component. Currently, digital kiosks offer mostly PDF-based editions. As  discussed in a previous Monday Note, PDF is by no means the future of digital media. PDF once was a fantastic invention, but it [...]

  8. By The Capsule’s Price | Monday Note on September 25, 2011 at 9:12 pm

    [...] willing to get rid of the PDF’s bulkiness (for more on the subject, read a recent Monday Note Tear down this PDF). For “PDF-shovel” editions, the result is unsatisfactory: broadsheet newspapers that [...]

  9. [...] rid of the PDF’s bulkiness (for more on the subject, read a recent Monday Note Tear down this PDF). For “PDF-shovel” editions, the result is unsatisfactory: broadsheet newspapers that [...]

  10. By The Capsule’s Price | TechnologyNews on November 1, 2011 at 5:11 pm

    [...] are willing to get rid of the PDF’s bulkiness (for more on the subject, read a recent Monday Note Tear down this PDF). For “PDF-shovel” editions, the result is unsatisfactory: broadsheet newspapers that are six [...]

Post a Comment

Your email is never shared. Required fields are marked *

*
*