Converting HTML to PDF using iText

Преобразование HTML в PDF с помощью iText

Я публикую этот вопрос, потому что многие разработчики задают более или менее один и тот же вопрос в разных формах. Я сам отвечу на этот вопрос (я основатель / технический директор iText Group), чтобы это мог быть "Вики-ответ". Если бы функция "документация" Stack Overflow все еще существовала, это было бы хорошим кандидатом для раздела документации.

Исходный файл:

Я пытаюсь преобразовать следующий HTML-файл в PDF:

<html>
    <head>
        <title>Colossal (movie)</title>
        <style>
            .poster { width: 120px;float: right; }
            .director { font-style: italic; }
            .description { font-family: serif; }
            .imdb { font-size: 0.8em; }
            a { color: red; }
        </style>
    </head>
    <body>
        <img src="img/colossal.jpg" class="poster" />
        <h3>Colossal (2016)</h3>
        <div class="director">Directed by Nacho Vigalondo</div>
        <div class="description">Gloria is an out-of-work party girl
            forced to leave her life in New York City, and move back home.
            When reports surface that a giant creature is destroying Seoul,
            she gradually comes to the realization that she is somehow connected
            to this phenomenon.
        </div>
        <div class="imdb">Read more about this movie on
            <a href="www.imdb.com/title/tt4680182">IMDB</a>
        </div>
    </body>
</html>

В браузере этот HTML выглядит следующим образом:

Проблемы, с которыми я столкнулся:

HTMLWorker вообще не учитывает CSS

Когда я использовал HTMLWorker, мне нужно создать ImageProvider чтобы избежать ошибки, сообщающей мне, что изображение не может быть найдено. Мне также нужно создать StyleSheet экземпляр, чтобы изменить некоторые стили:

public static class MyImageFactory implements ImageProvider {
    public Image getImage(String src, Map<String, String> h,
            ChainedProperties cprops, DocListener doc) {
        try {
            return Image.getInstance(
                String.format("resources/html/img/%s",
                    src.substring(src.lastIndexOf("/") + 1)));
        } catch (DocumentException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }    
}

public static void main(String[] args) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
    document.open();
    StyleSheet styles = new StyleSheet();   
    styles.loadStyle("imdb", "size", "-3");
    HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
    HashMap<String,Object> providers = new HashMap<String, Object>();
    providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
    htmlWorker.setProviders(providers);
    htmlWorker.parse(new FileReader("resources/html/sample.html"));
    document.close();   
}

Результат выглядит следующим образом:

For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.

When I look at my code, I see that plenty of objects and methods I'm using are deprecated:

So I decided to upgrade to using XML Worker.

Images aren't found when using XML Worker

I tried the following code:

public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();
    XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(HTML));
    document.close();
}

This resulted in the following PDF:

Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.

With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.

Not all CSS styles are supported in XML Worker

I adapted my code like this:

public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();

    CSSResolver cssResolver =
            XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
    HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
    htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
    htmlContext.setImageProvider(new AbstractImageProvider() {
        public String getImageRootPath() {
            return IMG_PATH;
        }
    });

    PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
    HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
    CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

    XMLWorker worker = new XMLWorker(css, true);
    XMLParser p = new XMLParser(worker);
    p.parse(new FileInputStream(HTML));

    document.close();
}

My code is much longer, but now the image is rendered:

The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?

The remaining question:

So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?

Additional question:

When my HTML contains form elements (such as <input>), those form elements are ignored.

Переведено автоматически

Ответ 1

Why your code doesn't work

As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.

Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.

В конце концов, мы решили переписать iText с нуля, учитывая требования к преобразованию HTML + CSS. Результатом стал iText 7. Поверх iText 7 мы создали несколько дополнений, наиболее важным в данном контексте является pdfHTML.

Как решить проблему

Используя последнюю версию iText (iText 7.1.0 + pdfHTML 2.0.0), код для преобразования HTML из вопроса в PDF сведен к этому фрагменту:

public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

Результат выглядит следующим образом:

Как вы можете видеть, это в значительной степени тот результат, которого вы ожидали. Начиная с iText 7.1.0 / pdfHTML 2.0.0, шрифт по умолчанию - Times-Roman. Соблюдается CSS: изображение теперь плавает справа.

Несколько дополнительных мыслей.

Разработчики часто выступают против обновления до более новой версии iText, когда я даю совет перейти на iText 7 / pdfHTML 2. Позвольте мне ответить на 3 главных аргумента, которые я слышу:

Мне нужно использовать бесплатный iText, а iText 7 не является бесплатным / надстройка pdfHTML с закрытым исходным кодом.

iText 7 выпущен с использованием AGPL, так же как iText 5 и XML Worker. AGPL допускает бесплатное использование в смысле безвозмездности в контексте проектов с открытым исходным кодом. Если вы распространяете закрытый исходный код / проприетарный продукт (например, вы используете iText в контексте SaaS), вы не можете использовать iText бесплатно; в этом случае вам необходимо приобрести коммерческую лицензию. Это уже было верно для iText 5; это все еще верно для iText 7. Что касается версий до iText 5: вы не должны использовать их вообще. Что касается pdfHTML: первые версии действительно были доступны только в виде программного обеспечения с закрытым исходным кодом. У нас была бурная дискуссия в iText Group: с одной стороны, были люди, которые хотели избежать массовых злоупотреблений со стороны компаний, которые не слушают своих разработчиков, когда те говорят власть имущим, что открытый исходный код - это не то же самое, что бесплатный. Разработчики говорили нам, что их босс заставил их поступить неправильно и что они не смогли убедить своего босса приобрести коммерческую лицензию. С другой стороны, были люди, которые утверждали, что мы не должны наказывать разработчиков за неправильное поведение их боссов. В конце концов, люди, выступающие за открытый исходный код pdfHTML, то есть разработчики iText, выиграли спор. Пожалуйста, докажите, что они не ошиблись, и используйте iText правильно: соблюдайте AGPL, если вы используете iText бесплатно; убедитесь, что ваш начальник приобрел коммерческую лицензию, если вы используете iText в контексте с закрытым исходным кодом.

Мне нужно поддерживать устаревшую систему, и я должен использовать старую версию iText.

Серьезно? Техническое обслуживание также включает в себя установку обновлений и переход на новые версии используемого вами программного обеспечения. Как вы можете видеть, код, необходимый при использовании iText 7 и pdfHTML, очень прост и менее подвержен ошибкам, чем код, необходимый ранее. Проект миграции не должен занимать слишком много времени.

Я только начал и не знал об iText 7; я узнал об этом только после завершения своего проекта.

Вот почему я публикую этот вопрос и ответ. Считайте себя экстремальным программистом. Выбросьте весь свой код и начните заново. Вы заметите, что это не так много работы, как вы себе представляли, и будете спать спокойнее, зная, что сделали свой проект перспективным, потому что iText 5 постепенно выводится из эксплуатации. Мы по-прежнему предлагаем поддержку платным клиентам, но со временем мы вообще прекратим поддержку iText 5.

Ответ 2

Используйте iText 7 и этот код:

public void generatePDF(String htmlFile) {
    try {

        //HTML String
        String htmlString = htmlFile;
        //Setting destination 
        FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
        
        PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
        ConverterProperties converterProperties = new ConverterProperties();
        PdfDocument pdfDocument = new PdfDocument(pdfWriter);

        //For setting the PAGE SIZE
        pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
        
        Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
        document.close();
    } 
    catch (Exception e) {
         e.printStackTrace();
    }
}

Ответ 3

Конвертируйте статическую HTML-страницу в любом стиле CSS:

 HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));

Для пользователей spring Boot: Конвертируйте динамическую HTML-страницу с помощью SpringBoot и Thymeleaf:

    @RequestMapping(path = "/pdf")
    public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
    /* Do Business Logic*/

    Order order = OrderHelper.getOrder();

    /* Create HTML using Thymeleaf template Engine */

    WebContext context = new WebContext(request, response, servletContext);
    context.setVariable("orderEntry", order);
    String orderHtml = templateEngine.process("order", context);

    /* Setup Source and target I/O streams */

    ByteArrayOutputStream target = new ByteArrayOutputStream();
    ConverterProperties converterProperties = new ConverterProperties();
    converterProperties.setBaseUri("http://localhost:8080");
    /* Call convert method */
    HtmlConverter.convertToPdf(orderHtml, target, converterProperties);

    /* extract output as bytes */
    byte[] bytes = target.toByteArray();


    /* Send the response as downloadable PDF */

    return ResponseEntity.ok()
            .header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")
            .contentType(MediaType.APPLICATION_PDF)
            .body(bytes);

}

2023-03-19 13:47 java