[HOW TO] Parse large XML files in Ruby

In one of the projects I am working there was a requirement to parse a very large XML file (around 1.2 GB) in Ruby. Using the the traditional method of parsing wherein the XML file is loaded in memory and parsed was not a feasible approach for this.

So, I started exploring different methods for XML parsing and came across the libxml library.

Parsing using libxml is event based, that is, the parser reads the file line by line and looks for XML elements. When and element is encountered a event is fired. To parse the contents of the file, these events need to be handled.

To get started we need to install the following:

gem install libxml-ruby
sudo apt-get install libxml2
sudo apt-get install libxml2-dev libxslt1-dev

The structure of a typical program to parse using libxml is as follows:

require ‘libxml’
include LibXML

class Parser
 include XML::SaxParser::Callbacks 

 def initialize
  # Constructor
 end

 def on_start_element(element, attributes)  
  # This event is fired when an start of an element is found.
 end

 def on_cdata_block(cdata)
  # This event is fired when a CDATA block is found.
 end

 def on_characters(chars)
  # This event is fired when characters are encountered between the start and end of an element.
 end

 def on_end_element(element)
  # This event is fired when an end of an element is found.
 end

end

parser = XML::SaxParser.file(“large_file.xml”)
parser.callbacks = Parser.new
parser.parse

Let us try this out with an example. For this example I am using the XML file from the following location:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms762271%28v=vs.85%29.aspx

I have saved the XML file as large_file.xml. As this is just an example I am using a small file, however, the above mentioned code will work for large files too without any change.

Sample from the XML file:

<catalog>
   <book id=”bk101”>
      <author>Gambardella, Matthew</author>
      <title>XML Developer’s Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

So the code to parse the XML containing a book elements as shown above is as follows :

require ‘libxml’
include LibXML

class Parser
 include XML::SaxParser::Callbacks 

 def initialize
  #The Constructor
 end

 def on_start_element(element, attributes)
  if element.to_s == “catalog”
   puts “Catalog Started”
  end

  if element.to_s == “book”
   puts “ID : ” + attributes[“id”].to_s
  end

  if element.to_s == “author”
   @read_string = “”
  end

  if element.to_s == “title”
   @read_string = “”
  end

  if element.to_s == “genre”
   @read_string = “”
  end

  if element.to_s == “price”
   @read_string = “”
  end

  if element.to_s == “publish_date”
   @read_string = “”
  end

  if element.to_s == “description”
   @read_string = “”
  end
 end

 def on_cdata_block(cdata)
  puts “CDATA Found: ” + cdata.to_s
 end

 def on_characters(chars)
  if @read_string != nil
   @read_string = @read_string + chars
  end
 end

 def on_end_element(element)
  if element.to_s == “catalog”
   puts “Catalog Ended”
  end

  if element.to_s == “book”
   puts “n”
  end

  if element.to_s == “author”
   puts “Author :” + @read_string
   @read_string = nil
  end

  if element.to_s == “title”
   puts “Title :” + @read_string
   @read_string = nil
  end

  if element.to_s == “genre”
   puts “Genre :” + @read_string
   @read_string = nil
  end

  if element.to_s == “price”
   puts “Price :” + @read_string
   @read_string = nil
  end

  if element.to_s == “publish_date”
   puts “Publish Date :” + @read_string
   @read_string = nil
  end

  if element.to_s == “description”
   puts “Description :” + @read_string
   @read_string = nil
  end
 end

end

parser = XML::SaxParser.file(“large_file.xml”)
parser.callbacks = Parser.new
parser.parse

As you can see above how the event handlers are parsing the XML file element by element.

Sample output of the above code :

Catalog Started
ID : bk101
Author :Gambardella, Matthew
Title :XML Developer’s Guide
Genre :Computer
Price :44.95
Publish Date :2000-10-01
Description :An in-depth look at creating applications 
      with XML.

ID : bk102
Author :Ralls, Kim
Title :Midnight Rain
Genre :Fantasy
Price :5.95
Publish Date :2000-12-16
Description :A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.

ID : bk103
Author :Corets, Eva
Title :Maeve Ascendant
Genre :Fantasy
Price :5.95
Publish Date :2000-11-17
Description :After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
.....
.....
Catalog Ended

The code for the above can be found at the following location :

https://github.com/rohitsden/XMLParser

Hope this helps and let me know if you need further information.

Leave a Reply

Your email address will not be published. Required fields are marked *