文件提取利器-Apache Tika

Description: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
功能描述:Apache Tika 可以发现和提取各种各样类型文件的元数据(meatadata)和文件内容,例如:PPT,XLS 和PDF等等。

Using Tika as a Maven dependency

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.13</version>
  </dependency>

If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you’ll want to depend on tika-parsers instead:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.13</version>
  </dependency>

使用案例:利用Tika获取文件的Content-Type

String[] files = new String[]{
        "/Users/lisanlai/Desktop/file.docx",
        "/Users/lisanlai/Desktop/file.xlsx",
        "/Users/lisanlai/Desktop/file.pptx",
        "/Users/lisanlai/Desktop/file.md",
        "/Users/lisanlai/Desktop/file.txt",
        "/Users/lisanlai/Desktop/Test.java"
};

for (int i = 0; i < files.length; i++) {
    TikaConfig tikaConfig = TikaConfig.getDefaultConfig();;
    Metadata metadata = new Metadata();

    MimeTypes mimeRegistry = tikaConfig.getMimeRepository();
    String filename = files[i];
    metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
    System.out.println(i+ " = ["
            + mimeRegistry.detect(null, metadata) + "]");
}

System.out.println("=========================================");


for (int i = 0; i < files.length; i++) {
    TikaConfig tikaConfig = TikaConfig.getDefaultConfig();;
    Metadata metadata = new Metadata();
    String filename = files[i];
    Path path = FileSystems.getDefault().getPath(filename);
    InputStream stream = TikaInputStream.get(path);
    Detector detector = tikaConfig.getDetector();
    MediaType mediaType = detector.detect(stream, metadata);
    System.out.println(i+ " = ["
            + detector.detect(stream, metadata) + "]" + " ==== " + mediaType.getSubtype());
}

打印结果如下:

0 = [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
1 = [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
2 = [application/vnd.openxmlformats-officedocument.presentationml.presentation]
3 = [text/x-web-markdown]
4 = [text/plain]
5 = [text/x-java-source]
=========================================
0 = [application/vnd.openxmlformats-officedocument.wordprocessingml.document] ==== vnd.openxmlformats-officedocument.wordprocessingml.document
1 = [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet] ==== vnd.openxmlformats-officedocument.spreadsheetml.sheet
2 = [application/vnd.openxmlformats-officedocument.presentationml.presentation] ==== vnd.openxmlformats-officedocument.presentationml.presentation
3 = [text/plain] ==== plain
4 = [text/plain] ==== plain
5 = [text/plain] ==== plain

热评文章