Schritt für Schritt – callas-Produkte optimal nutzen

Main Features

This chapter does not go into technical details but presents an overview of the main features in pdfChip so that you know what is possible. Use the pdfChip Reference Manual to look up the details about those features if they would come in handy.

HTML, CSS and Javascript

Because pdfChip is built on top of a WebKit HTML rendering engine, it supports almost all features of HTML and CSS and can take full advantage of Javascript. In general pdfChip follows all HTML rules, so you can include CSS, Javascript, images etc. just as you would while designing a web site. Except for the pdfChip specific features (customised HTML elements and CSS properties) you can even preview your HTML file in a browser or in an HTML design tool.

As with regular HTML for a web site, you can choose to insert your CSS and Javascript in your main HTML file or you can separate them in CSS and Javascript files; pdfChip will process them correctly either way.

To define items not supported by HTML or CSS, pdfChip uses custom HTML elements and custom CSS properties. Their name always begins with cchip, data-cchip or -cchip and they are normally ignored by a browser (the reason there are different prefixes is to keep the HTML and CSS W3C-standard compliant). The tutorial examples provide interesting techniques to use this fact so that the HTML looks one way in a browser but another way when pdfChip converts it into PDF.

Page sizes and page boxes

While HTML normally lives in a browser, PDF is often meant to be printed. This means that the page size is very important and that additional page boxes may have to be defined; a page box is a rectangle that has special meaning and is used in a particular way by professional publishing solutions. pdfChip supports this by using the @page CSS rule set:

@page {
    size: 229mm 317mm;
    margin: 20mm;
    -cchip-trimbox: 10mm 10mm 209mm 297mm;
}

This example defines an A4 sized page of 20,9cm wide by 29,7cm tall (using the cchip-trimbox property) and provides an additional white area around that page to add information that shouldn't be printed, such as the name of the document, time and date, color bars or printer marks (using the size property). Check the pdfChip Reference Manual for all page box definitions.

Additionally the margin property is used to provide 2cm of margin inside the page - keeping away the HTML content from the edge of the page.

Professional color

HTML and CSS provide a number of properties to define color, such as the color property to set the foreground property and the background-color property to set the background color. These properties either accept predefined names (such as "white" or "yellow") or an RGB color code.

pdfChip provides additional color definitions in order to be able to create PDF documents that will print correctly or that comply to standards. The example below shows a background and foreground color for a paragraph element.

p {
    background-color: -cchip-cmyk(1.0,0.0,0.0,0.0);
    color: -cchip-cmyk( 'Spot Black', 0.0, 0.0, 0.0, 1.0, 0.5 );
}

The background color is defined as CMYK using the cchip-cmyk value; given the color values provided the background color will be set to a pure cyan CMYK color. The foreground color uses a modified value that causes pdfChip to generate a spot color (or named color) called "Spot Black" and sets the background color to 50% of that spot color.

Keep in mind that both color properties will be completely ignored if you open the HTML in a browser, as the browser will not understand the cchip-cmyk value. It is only because pdfChip uses a customised processing engine that this works.

Font support

CSS already allows you to specify high-quality fonts for your web pages. The only problem is that not all browsers support the same types of fonts, which often leads to very convoluted @font-face definitions. Because pdfChip is only dependent on WebKit, an CSS file for use by pdfChip typically requires a simple font definition, such as the following:

@font-face {
    font-family: 'FreeUniversal-Regular';
    src: url('../fonts/freeuniversal-regular.ttf');
    font-weight: normal;
    font-style: normal;
}

This specifies the location of the Free Universal font and defines how it can be used in the remainder of the CSS file. The WebKit engine supports most modern fonts (including TrueType and OpenType) and because you only need to make sure it works correctly in pdfChip, testing is much simpler too.

Using PDF and SVG

Of course you can use various types of images, such as PNG and JPEG images in your HTML. pdfChip inserts those images in the generated PDF. If you use an image more than once, it will only be inserted once in the resulting PDF document.

Browsers can include SVG (Scalable Vector Graphics) files just as they can images; the following example includes an SVG image and works correctly in all browsers and in pdfChip.

<img src="./images/penguin.svg"/>

You can also embed the SVG code immediately into your HTML file and this too is supported by browsers and pdfChip alike. The following example inserts a five-pointed star using SVG.

<svg width="100mm" height="100mm">
    <polygon points="100,10 40,198 190,
                     78 10,78 160,198" 
             style="fill:lime;stroke:purple;
                    stroke-width:5;
                    fill-rule:nonzero;"/>				
</svg>

It's important to note that pdfChip doesn't rasterize the SVG; it isn't converted into an image. Instead it is inserted in the PDF so that there is no quality loss, even if the PDF is afterwards scaled up or printed on a high-resolution output device.

But pdfChip goes further than supporting regular images and SVG files; it also supports the insertion of PDF documents directly. Look at the following HTML fragment:

<img src="./images/callas-logo.pdf#page=2"/>

A regular browser will not display this image, because it doesn't support PDF documents as the source for images. pdfChip does, and for this example will insert the second page (page=2) of the given PDF document into the result PDF.

Again no rasterisation takes place - even better - the PDF file is taken as is and inserted into the result PDF with as little changes as possible. This means that pdfChip can easily be used to accomplish impositioning for example (a process where a large sheet is filled with pages from an input PDF so that it can be printed and afterwards cut and folded to a magazine or newspaper for example). But even for less advanced workflows, it means that a resolution independent graphic (a PDF) can be used instead of a plain image.

ISO compliant PDF

Over the years the ISO (International Standards Organisation) developed a number of important standards around PDF; the two most important once are:

PDF/X: a standard to allow optimal file exchange in graphic arts workflows and,
PDF/A: a standard to allow long-term (50 years or more) PDF file archival.

pdfChip supports both standards through custom HTML elements. Consider the following example HTML:

<meta property="cchip-pdfx" content="PDF/X-1a">
<link rel="cchip-outputintent" 
      href="./templates/outputintent.pdf"/>

The custom meta element with its name set to "cchip_pdfx" instructs pdfChip that the PDF it outputs should have the correct PDF/X identification tags inserted. The content attibute is set to "PDF/X-1a" which identifies the PDF/X version further as PDF/X-1a, currently the most commonly version of that standard.

The link element is also important here; PDF/X files need to contain an output intent and the link element points to a PDF document that contains the output intent we want for our resulting file (a template file if you want). pdfChip will parse the PDF file that is pointed to ("outputintent.pdf" in our example) and copy its output intent into the PDF file is generates.

pdfChip supports more standards; you can find the full list and instructions in the pdfChip Reference Manual. Beware of a potential pitfall however: when pdfChip sees these instructions, it merely inserts the correct standard tags to identify the file it generates as a standards-compliant file. It is still your responsability to ensure that all content in the generated PDF conforms to that standard!

Inserting custom metadata

Metadata is often very important in document workflows and PDF uses XMP (Extensible Metadata Platform) to carry metadata inside the PDF document. Because metadata is so important, pdfChip has a way to insert it into the resulting PDF document.

<meta property="" content="callas documentation" 
            data-cchip-xmp-ns="http://www.gwg.org/jt/xmlns/"  
            data-cchip-xmp-prefix="gwg-at" 
            data-cchip-xmp-property="Publication" 
            data-cchip-xmp-type="Text">

This custom metaelement inserts an XMP tag called "gwg-at:Publication" which is of type "Text" and has the value "callas documentation". The prefix links to a namespace defined as "http://www.gwg.org/jt/xmlns/".

It's important to note that some standards (such as PDF/A) require that every piece of metadata inserted in a PDF document is also clearly defined by a metadata definition and pdfChip will correctly insert that information as well. The pdfChip Reference Manual has more details on the subject.

Support for JavaScript

Perhaps the most powerful aspect of pdfChip and its WebKit foundation is that Javascript is fully supported, and that you can use it just as you would in a browser environment. WebKit really does behave like a browser in almost every aspect and that means you can include Javascript functions to examine and change the HTML DOM (Document Object Model) for example. This you can use Javascript to change properties of elements in your HTML file or to insert completely new elements altogether.

You can insert script tags in your HTML file or - just as you're used to on a web site perhaps - you can link to separate script files. Script files you have written or that you downloaded from the Internet. In some of the tutorial examples you'll see JQuery used to manipulate HTML elements and insert new elements. You'll see such advanced scripting functionality come back again as we discuss supporting MathML in the following section.

In the tutorial samples JQuery was downloaded and included in the sample's file structure. However, you can also refer to online Javascript; just be careful if the Javascript calls you make are asynchronuous, pdfChip provides support functions to make sure this works well during conversion.

Javascript also allows implementing scenarios where a lot of external data (data coming from a database for example) needs to be integrated. While your Javascript functionality will not be able to extract data from the database directly, there are way to connect to a URL and gather data (using proxy classes written in another scripting language such as PHP to interrogate the database and return the information requested as XML) and there are ways to read for example CSV files. Together with the possibilities to easily create as many pages as you want during conversion of PDF, this is ideal for many variable data or transactional printing workflows.

Beautiful formulas with MathML

In some workflows it is important to be able to include nicely formatted mathematical formulas in the generated PDF document (think about textbooks for example). HTML has the possibility to define formulas by using MathML. The following is a MathML representation of probably the most famous formula of all times, thanks to Albert Einstein:

<math xmlns="http://www.w3.org/1998/Math/MathML">
    <mrow>
        <mi>E</mi>
        <mo>=</mo>
            <mrow mathcolor='#cc0000'>
                <mi>m</mi>
                <mo>⁢</mo>
                <msup><mi>c</mi><mn>2</mn></msup>
            </mrow>
    </mrow>
</math>

Converting this MathML into a beautiful formula can be done in a number of different ways; the tutorial shows how to use the MathJax Javascript library to accomplish this.

Inserting barcodes

Barcodes have become almost omnipresent on printed material and the variety of barcodes used is staggering. Annoyingly barcodes are not supported in HTML; there are work-arounds through the use of barcode fonts, but these sometimes lack quality and are limited in the types of barcode they can represent. There is no good solution for 2D barcodes such as QR codes just to name one.

pdfChip itself does support barcodes, through the use of the barcode generator TBarCode from TEC-IT Datenverarbeitung GmbH (www.tec-it.com). Just about any barcode you can think of is supported by inserting a custom object in the HTML file as such:

<object class="barcode" type="application/barcode" 
        style="width:30mm; height:30mm;">
    <param name="type" value="QR-Code">
    <param name="data" value="http://www.callassoftware.com">
</object>

To use this functionality, you must have an object element in your HTML file and its type must be set to "application/barcode". The different param nodes of this object then provide the necessary input for the barcode generator, most importantly the type of barcode you want to insert and its value. pdfChip would convert the above example in the following QR-code, linking to the callas software web site:

The pdfChip Reference Manual provides full information on all of the supported barcode types and what their parameters should be. It is very important to stress however that pdfChip does no barcode validation, so the parameters you specify should be correct and suitable for the type of barcode you want. If not, pdfChip will return an error or create an incorrect barcode.

Generating multiple pages

How can you generate multiple 'copies' of your HTML content? If you have a business card layout in HTML, or a form letter... how can you generate a PDF file with thousands of pages, where each page has been tweaked (for example to change names, or addresses or background images or...)?

pdfChip supports this through the use of a predefined Javascript function called cchipPrintLoop(). If you define this function in your HTML file or in one of the Javascript files included in your HTML file, it will be called automatically by pdfChip. In it you can setup a loop that modifies the HTML DOM (replacing place holder elements with data you load from a CSV file perhaps) and then calls the cchip.printPages function. This is a member function of the cchip object and it outputs your HTML file in the state it is at that moment and inserts the generated PDF into the output PDF. You can call cchip.printPages multiples times and each time the generated PDF pages will be added to your output. A simple example could look like this:

function cchipPrintLoop() {
    for (var i=0; i < 10; i++) {
        /* Modify HTML DOM here */
        cchip.printPages();
    }
}

In this example, the HTML DOM isn't actually modified (there's just the comment explaining where you could do this) so the output PDF will consist of 10 identical copies, all concatenated together into your output PDF document. The tutorial contains a few examples of more complex setups where you can see how this could be used to create variable data type documents for example.

Remark that the generated PDF in this example isn't necessarily 10 pages long! If you have an HTML file which converts into a multiple page document, you'll get 10 multiple page PDF files concatenated together. So if your HTML generates a two-page letter, the resulting PDF if you use the above print loop function will be 10 times 2 pages, or 20 pages.

Advanced pagination

Different than the previous section, advanced pagination comes into play not if you want multiple copies of the same document, but if you have long document which paginates into multiple pages. Think about a book for example: very long HTML that generates a PDF file with potentially hundreds of pages.

The problem with such files is how to add features such as running headers or page numbers, and pdfChip has special support for such environments through something called overlays and underlays. How does this work?

The problem with pagination

The problem with pagination is that you cannot place page numbers in your original HTML file for example, because you do not yet know how the content will be paginated. And it's hard to predict (and guessing is never a good strategy) where an advanced layout engine such as WebKit will break content into pages.

What you need to overcome this is a sort of two-stage process, where your HTML file would be divided into pages and where you then get the possibility to add additional content to your document. And that is exactly what pdfChip allows, it actually even has a three-stage process.

Multiple processing steps

In the first chapter of this book, the command-line for pdfChip was introduced as:

pdfChip <Path to HTML file> <Path to PDF file>

This command-line provides the simple one-stage conversion process that is also used in most tutorial examples. But the command-line allows additional arguments like this:

pdfChip <Path to HTML file> 
        --underlay=<Path to underlay HTML file>
        --overlay=<Path to overlay HTML file>
        <Path to PDF file>

We still start with the main HTML file. This is the HTML that contains the content we want to convert into a PDF file. But this is followed by an --underlay and/or --overlay command (both are completely optional). If one of these arguments is present, pdfChip does a second and/or third processing step.

First the main HTML file is converted into PDF; after this the pagination is done. The HTML has been converted using the WebKit layout engine and it is now known how exactly the document is going to be converted into PDF pages. The additional passes for the underlay or overlay can use this information to their advantage. When all conversions are done, the underlay PDF document is inserted into the output PDF document; all of its content is inserted underneath the content that is already there (hence the name underlay). The same happens with the PDF generated by the conversion of the overlay HTML but this content obviously is added on top of the output PDF.

The cchip object

During the first pass, pdfChip stores a lot of information about the document in the cchip object and the print loop of the underlay or overlay HTML file can use this (we already mentioned the cchip object when introducing multi-page PDF generation earlier. Consider this simple example of an overlay print loop

function cchipPrintLoop() {
    for (var i=0; i < cchip.pages.length; i++) {
        $('#overlay-pagelabel p:first').text("page " + (i+1));
        cchip.printPages();
    }
}

Our overlay HTML is a very simple one-page file for this example. The print loop queries the cchip object to figure out how many pages resulted from paginating the main HTML file. Then it generates the same amount of pages, but each time there is a JQuery expression to change the page number (an object in the overlay HTML identified by the ID "overlay-pagelabel") to the correct value. The result is a paginated file that gets the page numbers neatly added in the second pass pdfChip makes.

Limitations

While pdfChip is very similar to a browser and while WebKit gives it a lot of flexibility and power, there are still a few limitations you should keep in mind.

Columns

The CSS properties to generate multiple columns are not supported by pdfChip. Basically pdfChip behaves like the printable version of such content which normally always has one column. Specifically this means that you should not rely on the column-count, column-gap and column-rule properties.

There is a potential work-around through the CSS regions concept, even though this is not an integral part of the CSS standard yet. But WebKit supports it and it is a very powerful layout technology.

Canvas

The HTML5 canvas is an HTML element that allows drawing graphics on the fly somewhere in an HTML page. It's a powerful technique but you should not, or only after lots of testing, use it in combination with pdfChip. The reason mainly has to do with how the canvas is converted into PDF and most of the time that will be through rasterisation. This means you end up with a PDF document that contains a rasterised version of your canvas content which is typically not what you want.

In most cases look at SVG as a more powerful technique to include arbitrary drawing in our HTML file and maintain it while converting to PDF.