Java Xslt Example Save Parquet
How To Parallelly Process Large XML Files In Java
An efficient approach to parse, change and write large XML files parallelly in Java using StAX parser and concurrent programming
Dealing with XML files is always challenging. Multiple formats to format files exist but XML still leads the list. We at Clairvoyant understand that working with this in a time- and memory- efficient fashion cannot be easy. This blog is our effort at documenting our learning around this topic.
The Problem
We were dealing with hundreds of XML files. The zip contained multiple XML files and each file contained thousands of records. After unzipping, the folder size was up to a GB. The job can be broken down as below:
- Read the zip file and unzip it into a folder for processing
- Go over each file and parse it
- Identify data, validate using external service, change if required, and write the result file to a different folder
- Zip the result folder and copy it to the destination location
The XML schema was not too nested and contained data that looked something like this:
Also, we stored all the processing information in the database to be able to backtrack later on. Though there were more data and processing involved, for ease of understanding we can simplify the core problem as; for each <Request>if <ID>validation fails then remove the <ACKNOWLEDGEMENT> element.
The solution
Tapping into Clairvoyant's experience of dealing with similar challenges, we arrived at the following solution:
It was preferable to map each <Request> element to the corresponding Java object called Request. We mapped the zip filed to InputFile object and XML file to DataFile. The code snippets shown in the following sections are simplified to show the core problem as mentioned earlier, skipping a few details. This solution can be extended to similar problems by grouping XMLEvents generated by the StAX parser in different ways.
Processing individual XML file
To process an individual XML file, we have selected the below mentioned approach:
Phase-1: Parse XML file, populate Java objects, i.e. Request ,InputFile ,DataFile with data
Phase-2: Validate each Requestwith the help of external REST API
Pashe-3: Parse XML for the 2nd time, remove data if required, and write to the new file
For parsing and changing XML data, we can use the DOM-based approach or the Streaming-based approach. The DOM-based approach requires entire data to be in memory, which immediately rules out this solution for our problem. In the streaming approach, XML info sets are transmitted and parsed serially resulting in a lesser memory footprint.
Stream parsing can be of two types- push or pull-based, and they are differentiated based on XML parser sending events, or application code requesting events. With pull parsing, the client controls the application thread and can call methods on the parser when needed, which perfectly works for our case. The other advantages of the pull-based approach and comparisons between the different available parsers can be found in the below-mentioned document.
For our case, we have used Woodstox's implementation of StAX.
It's faster and easy to use and also provides a way to validate our XML document against DTD. Here are some code snippets:
- Setting DTD validator for XML file.
XMLInputFactory2 inputFactory = (XMLInputFactory2) XMLInputFactory2.newInstance();
inputFactory.setXMLResolver(
new XMLResolver() {
public Object resolveEntity(String publicID, String systemID, String baseURI, String namespace){
if (systemID.contains("test.dtd")) {
return getClass().getClassLoader()
.getResourceAsStream("schema/test.dtd");
}
else {
return null;
}
}
}
); Note : The 2 suffixes used in code, i.e.
XMLInputFactory2refers to the 2nd version of Stax API provided by Woodstox.
2. Parsing XML document and saving the results in the Request object.
try {
FileReader fileReader = new FileReader(xmlFile);
XMLInputFactory2 inputFactory = (XMLInputFactory2) XMLInputFactory2.newInstance();
XMLEventReader eventReader = inputFactory.createXMLEventReader(fileReader);
Request request = null; while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.isStartElement()) {
StartElement startElement = event.asStartElement();
if ("REQUEST".equalsIgnoreCase(startElement.getName().getLocalPart())) {
request = new Request();
}
if ("ID".equalsIgnoreCase(startElement.getName().getLocalPart())) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
request.setID(dataEvent.getData());
}
}
if("ACKNOWLEDGEMENT".equalsIgnoreCase(startElement.getName()
.getLocalPart())) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
request.setAck(dataEvent.getData());
}
}
}
if (event.isEndElement()) {
EndElement endElement = event.asEndElement();
if ("REQUEST".equalsIgnoreCase(endElement.getName().getLocalPart())) {
requestList.add(request);
}
}
}
3. Changing data in XML file and writing to the new file.
try(FileWriter fileWriter = new FileWriter(newXmlFile.getName())){
XMLEventReader eventReader = inputFactory.createXMLEventReader(new FileInputStream(xmlFile));
XMLEventWriter writer = outputFactory.createXMLEventWriter(fileWriter);
boolean ackNeeded = false; String id = null;
List<XMLEvent> ackElementEvents = new ArrayList<>();
while (eventReader.hasNext()) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement startElement = xmlEvent.asStartElement();
if ("ID".equalsIgnoreCase(startElement.getName().getLocalPart())) {
writer.add(xmlEvent);
xmlEvent = eventReader.nextEvent();
if (xmlEvent.isCharacters()) {
Characters dataEvent = (Characters) xmlEvent;
id = dataEvent.getData();
}
}
if ("ACKNOWLEDGEMENT".equalsIgnoreCase(startElement.getName().getLocalPart())) {
ackElementEvents.add(xmlEvent);
xmlEvent = eventReader.nextEvent();
String finalId = id;
Optional<Request> request = requestList.stream().
filter(r->r.getId().equalsIgnoreCase(finalId)).findFirst();
if(request.isPresent() && request.get().isValid())
ackNeeded = true;
}
}
if (xmlEvent.isEndElement()) {
EndElement endElement = xmlEvent.asEndElement();
if ("ACKNOWLEDGEMENT".equalsIgnoreCase(endElement.getName().getLocalPart())){
ackElementEvents.add(xmlEvent);
if(ackNeeded)
for (XMLEvent event : ackElementEvents)
writer.add(event);
ackElementEvents.clear();
ackNeeded = false;
continue;
}
}
if(ackNeeded){
ackElementEvents.add(xmlEvent);
}else{
writer.add(xmlEvent);
}
}
} Here we have created a separate list ackElementEvents to store XML elements that we conditionally want to add in the result XML file.
Processing All Files parallelly
For the purpose of parallel execution, we have utilized ExecutorService provided by Java with the number of initial threads configurable. We have iterated over all XML files in a single zip file and created a callable thread for processing each XML file. For synchronization, we have used CountDownLatch . The sample implementation for this is provided in the below code snippet:
File[] allXmlFiles = dataFileFolder.listFiles();
CountDownLatch latch = new CountDownLatch(requireNonNull(allXmlFiles).length);
for (File xmlFile : allXmlFiles) {
XMLTask parseXMLTask = new XMLTask(xmlFile,latch);
Future<DataFile> xmlFileProcessingDetailsFuture = executorService.submit(parseXMLTask);
dataFileFutures.add(xmlFileProcessingDetailsFuture);
}
latch.await();
processDataFiles(dataFileFutures); Here XMLTask is the main entity responsible for processing individual XML files. Once all the processing is done in XMLTask, we call latch.countDown(). With latch.await() we wait till all the files are processed and dataFilesFutures contain all the processing information that can be saved to DB.
Summary
The key takeaway here is the utilization of the StAX parser for manipulating XML. Also, the approach described in this document can be extended to many such similar problems.
Source: https://blog.clairvoyantsoft.com/how-to-parallelly-process-large-xml-files-in-java-9ff3a2d32e90
0 Response to "Java Xslt Example Save Parquet"
Post a Comment