从 ODF 文件中提取内容的 Java 程序
原文:https://www . geesforgeks . org/Java-program-to-extract-content-from-a-ODF-file/
ODF 的全部是开放文档格式。它是一个国际标准家族,继承了常见的不推荐使用的特定于供应商的文档格式,如。医生。wpd,。xls。与其他格式相比,ODF 文档更小。从 TIKA 库中使用 OpenDocumentParser 类从 ODF 文件中提取内容。
使用的方法:
- BodyContentHandler(): 它创建一个内容处理程序,将 XHTML 正文字符事件写入内部字符串缓冲区。
- Metadata() : 它构造新的、空的元数据。
- ParseContext(): 它创建了一个 ParseContext 对象,用于将上下文信息传递给 Tika 解析器。
- parse(): 实例化解析器对象,调用 parse 方法。
以下是执行以下 java 代码所需的依赖项:
tika-parsers-1.24.1.jar
commons-io-2.8.0.jar
slf4j-api-2.0.0-alpha0.jar
实施:
Java 语言(一种计算机语言,尤用于创建网站)
// Java Program to Extract Content from a ODF file
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import sun.security.util.Length;
public class OdfContentExtractor {
public static void main(String[] args)
{
try {
BodyContentHandler handler
= new BodyContentHandler();
Metadata metadata = new Metadata();
// Here .odt is open document text format.
FileInputStream inputstream
= new FileInputStream(
new File("F:\\geeks.odt"));
ParseContext parsecontent = new ParseContext();
// Parsing the open document.
OpenDocumentParser opendocumentparser
= new OpenDocumentParser();
// Passing the InputStream , ContentHandler,
// Metadata , ParseContext to the parse method.
opendocumentparser.parse(inputstream, handler,
metadata,
parsecontent);
System.out.println("Content in the document :"
+ handler.toString());
// Displaying the metadata of the odf file.
System.out.println("Metadata of the document:");
String[] metaName = metadata.names();
int l = metaName.length;
for (int i = 0; i < l; i++) {
System.out.println(
metaName[i]
+ " : = " + metadata.get(metaName[i]));
}
}
catch (Exception e) {
System.out.println(
"failed to extract content due to " + e);
}
}
}
输出:
Content in the document :Geekforgeeks has a great content on DSA.
Metadata of the document:
date : = 2020-11-21T05:38:00Z
meta:paragraph-count : = 1
meta:word-count : = 6
meta:initial-author : = Mohan Sai
initial-creator : = Mohan Sai
dc:creator : = Mohan Sai
generator : = MicrosoftOffice/15.0 MicrosoftWord
Word-Count : = 6
dcterms:created : = 2020-11-21T05:36:00Z
dcterms:modified : = 2020-11-21T05:38:00Z
Last-Modified : = 2020-11-21T05:38:00Z
nbPara : = 1
Last-Save-Date : = 2020-11-21T05:38:00Z
meta:character-count : = 40
Paragraph-Count : = 1
meta:save-date : = 2020-11-21T05:38:00Z
modified : = 2020-11-21T05:38:00Z
Edit-Time : = PT0S
nbCharacter : = 40
nbPage : = 1
nbWord : = 6
Content-Type : = application/vnd.oasis.opendocument.text
creator : = Mohan Sai
meta:author : = Mohan Sai
meta:creation-date : = 2020-11-21T05:36:00Z
Creation-Date : = 2020-11-21T05:36:00Z
xmpTPg:NPages : = 1
Character Count : = 40
editing-cycles : = 3
Page-Count : = 1
Author : = Mohan Sai
meta:page-count : = 1
版权属于:月萌API www.moonapi.com,转载请注明出处