POI 4.1.2 word转html(保留样式及图片)

2021-11-20 22:30:27 阅读：269 来源： 互联网

标签：word String html POI new options important

因为自己的任务需要用到word转html,但是poi3.1.2的版本与我poi4.1.2版本冲突，所以尝试用4.1.2版本来写一个word转html,它是可以同时支持doc和docx两种格式，非常好用，当前文章是关于docx转html的，doc相对来说比较简单，有兴趣的可以尝试一下

本文章暂时为docx转html！！！！

开发工具：idea
项目管理工具：maven
不多说，直接撸代码

1、首先配置pom.xml文件,具体配置如下

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>4.1.2</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>4.1.2</version>
</dependency>
<dependency>
    <groupId>fr.opensagres.xdocreport</groupId>
    <artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId>
    <version>2.0.2</version>
</dependency>

poi与poi-ooxml的版本需要一致。

2.工具类的开发

/**
     * word 转html
     *
     * @param wordFile wordFile
     */
    public static Map<String, String> docsToHtml(File wordFile) {
        log.info("word to html beginning....");
        //这个地方设置了html的文件名，ColumnEnum.FileColumn.HTML是我定义的类型  即".html"
        String htmlName = RandomNum.getRandomString(8) +
                System.currentTimeMillis() + ColumnEnum.FileColumn.HTML;
        int dateTime = LocalDateUtil.getCurrentDateValue();
        Map<String, String> fileCollect = new HashMap<>(0);
        try {
            @Cleanup InputStream in = new FileInputStream(wordFile);
            //获取根目录名称  临时文件目录
            String templatePath = System.getProperty("user.dir").substring(0, System.getProperty("user.dir").indexOf(":") + 1) + "/logback/template/";
            //判断文件名是否为docx  
            if (!wordFile.getName().endsWith(ColumnEnum.FileColumn.DOCX)) {
                return null;
            }
            // 1.加载解析docx文档用的XWPFDocument对象
            XWPFDocument document = new XWPFDocument(in);
            // 2.解析XHTML配置 设置图片链接
            XHTMLOptions options = XHTMLOptions.create();
            //3.将word中图片保存到云服务器  这个地方需要注意
            options.setImageManager(new ImageManagerImpl());
            options.setIgnoreStylesIfUnused(false);
            options.setFragment(true);
            //为html文件添加头
            @Cleanup OutputStream out = new FileOutputStream(templatePath + htmlName);
            //这个地方要注意一下 后文会有讲解
            out.write(("<head>" +
                    "<meta http-equiv=\"Content-Type\" content=\"text/html;charset=\"UTF-8\"/>" +
                    "<meta name=\"viewport\" content=\"initial-scale=1.0, maximum-scale=1, minimum-scale=1, user-scalable=no,uc-fitscreen=yes\" />" +
                    "<style>body {margin: 0; font-family: \"Noto Sans SC\";}" +
                    "img {width: 100% !important;height: auto !important;} " +
                    //font-size: 40px;  设置字统一大小
                    "span{overflow-wrap: break-word}" +
                    "div{  width: auto !important; margin-left: 10% !important;  margin-right: 10% !important;}" +
                    "</style>" +
                    "</head>").getBytes(StandardCharsets.UTF_8));
            //将XWPFDocument转换成XHTML
            XHTMLConverter.getInstance().convert(document, out, options);
            //将html文件上传到oss
            File htmlFile = new File(templatePath + htmlName);
            //获取图片路径
            getImagePath(htmlFile, fileCollect);
            //图片上传到云服务器
            OssUtil.upload(htmlFile, AliConfig.PATH + dateTime + "/" + htmlName);
            //删除临时文件
            htmlFile.deleteOnExit();
            log.info("word to html completed....");
        } catch (IOException e) {
            log.info("word to html has some problems", e);
        }
        //返回保存路径
        fileCollect.put("htmlPath", AliConfig.OSS_LINK + AliConfig.PATH + dateTime + "/" + htmlName);
        //判断urlmap为空
        if (ObjectUtils.isEmpty(fileCollect)) {
            return null;
        }
        fileCollect内存储的是html地址以及图片地址  按需
        return fileCollect;
    }

    /**
     * 获取html文件中的图片路径
     *
     * @param htmlFile html文件
     */
    private static void getImagePath(File htmlFile, Map<String, String> resultMap) {
        try {
            //读取html文件的每一行
            List<String> lines = FileUtils.readLines(htmlFile, StandardCharsets.UTF_8);

            lines.forEach(line -> {
                if (StringUtils.isEmpty(line)) {
                    return;
                }
                //获取html文件中图片
                String patternString = "<img\\b[^<>]*?\\bsrc[\\s]*=[\\s|'|\"]*([^\\s|'|\"]*)[\\s|'|\"]*";

                Pattern patten = Pattern.compile(patternString);

                Matcher matcher = patten.matcher(line);
                while (matcher.find()) {
                    String src = matcher.group(1);
                    String fileNmae = src.substring(src.lastIndexOf("/") + 1);
                    resultMap.put(fileNmae, src);
                }
            });
        } catch (IOException e) {
            log.error("get image path has error", e);
        }
    }

这里首先要注意的是 XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new ImageManagerImpl());

标红的地方， options.setImageManager(）设置的是word内图片存储的位置，由于我需要生成的html文件在任何地方都可以访问到包括内部图片，所以自己定义了一个类来继承ImageManager，在我定义的类中来配置word内图片保存的位置。如下代码所示

public class ImageManagerImpl extends ImageManager {

    private byte[] picture;
    private String suffix;

    public ImageManagerImpl() {
        super(new File(""), "");
    }


    @Override
    public void extract(String imagePath, byte[] imageData) throws IOException {
        this.suffix = "." + imagePath.split("\\.")[1];
        this.picture = imageData;
    }

    @Override
    public String resolve(String uri) {
        //图片保存位置
        String path = AliConfig.PATH + LocalDateUtil.getCurrentDateValue() + "/" +
                RandomNum.getRandomString(6) + System.currentTimeMillis() + suffix;
        OssUtil.upload(path, picture);
        //需要返回保存路径  这样生成的html中的src标签会替换成你设置的地址
        return AliConfig.OSS_LINK + path;
    }
}

有意思的是这个地方

out.write(("<head>" +
"<meta http-equiv=\"Content-Type\" content=\"text/html;charset=\"UTF-8\"/>" +
"<meta name=\"viewport\" content=\"initial-scale=1.0, maximum-scale=1, minimum-scale=1, user-scalable=no,uc-fitscreen=yes\" />" +
"<style>body {margin: 0; font-family: \"Noto Sans SC\";}" +

//设置图片的大小
"img {width: 100% !important;height: auto !important;} "
"span{overflow-wrap: break-word}" +

//设置边距
"div{ width: auto !important; margin-left: 10% !important; margin-right: 10% !important;}" +
"</style>" +
"</head>").getBytes(StandardCharsets.UTF_8));

因为生成的html我想让他自适应手机以及电脑，所以在转换之前写进去了这么一段话，来确保自适应。

获取图片路径

因为图片被我上传到云服务器，所以在我删除html文件的时候需要获取岛图片的地址并删除，之后我写了getImagePath()这个方法来获取，不知道还有没有其他更方便的方法，有的话可以私信我哦~

写到最后：

这串代码会有以下不足之处，很欢迎大家指出~望大家指教~ EveryBody Peace~

标签：word,String,html,POI,new,options,important
来源： https://blog.csdn.net/LiberiG/article/details/121446341

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

POI 4.1.2 word转html(保留样式及图片)