NAV
lxml Library Reference

Introduction

An LFE XML Library

lxml is an LFE library which wraps erlsom’s XML-parsing capabilities and adds an API for easily accessing the parsed data.

Wrapping erlsom

lxml only makes one call to erlsom: in the private function lxml:parse-body-raw. It is through this function that all of lxml’s exported parsing functions ultimately pass. As referenced in the introduction, lxml only supports erlsom’s simple form, and this is what parse-body-raw calls.

erlsom:simple_form/1 takes either string or binary data and upon a successful parse, returns an LFE data structure which represents the original XML. Bot the inputs and outputs of lxml:parse are covered in more detail below.

Inputs

To demonstrate string and binary inputs, let’s define some data:

> (set data-1 "<xml>data</xml>")
"<xml>data</xml>"
> (set data-2 (binary "<xml>data</xml>"))
#B(60 120 109 108 62 100 97 116 97 60 47 120 109 108 62)

Now we can parse these and show that lxml (via erlsom) handles both string and binary input the same:

> (=:= (lxml:parse data-1)
       (lxml:parse data-2))
true

lxml handles the following types of input:

For the first two, you simply pass lxml the data you want to parse. If you want to parse an XML file or URL, you’ll need to pass a tuple to parse with the appropriate option set.

For more information on the parse function, see the “API” section.

Options

When calling parse you can simply pass the data (or URL or filename), or you can also pass a second argument: an “options” property list. The supported options are:

Outputs

TBD

The API

start

Starting up lxml for parsing URLs and following media links:

> (lxml:start)
(#(inets ok) #(ssl ok) #(lhttpc ok) #(lhc ok) #(lxml ok))

This function simply starts the lhc LFE HTTP client. If you plan on parsing XML from URLs or if you want to be able to call the get-linked function, you will first need to call start.

parse

For this and subsequent API functions, we will use the following sample data, assumed to be saved to ./data.xml:

<life>
  <bacteria division="domain">
    <bacterium>spirochetes</bacterium>
    <bacterium>proteobacteria</bacterium>
    <bacterium>cyanobacteria</bacterium>
  </bacteria>
  <archaea division="domain">
    <archaum></archaum>
  </archaea>
  <eukaryota division="domain">
    <eukaryotum>slime molds</eukaryotum>
    <eukaryotum>fungi</eukaryotum>
    <eukaryotum>plants</eukaryotum>
    <eukaryotum>animals</eukaryotum>
  </eukaryota>
</life>

Parse the data normally:

> (lxml:parse #(file "data.xml"))
(#(tag "life")
 #(attr ())
 #(content
   #(life ()
     (#(bacteria
        (#(division "domain"))
        (#(bacterium () ("spirochetes"))
         #(bacterium () ("proteobacteria"))
         #(bacterium () ("cyanobacteria"))))
      #(archaea (#(division "domain")) (#(archaum () ())))
      #(eukaryota
        (#(division "domain"))
        (#(eukaryotum () ("slime molds"))
         #(eukaryotum () ("fungi"))
         #(eukaryotum () ("plants"))
         #(eukaryotum () ("animals")))))))
 #(tail "\n"))

Parse the data, requesting a raw result:

> (lxml:parse #(file "data.xml") '(#(result-type raw)))
#(ok
  #("life"
    ()
    (#("bacteria"
       (#("division" "domain"))
       (#("bacterium" () ("spirochetes"))
        #("bacterium" () ("proteobacteria"))
        #("bacterium" () ("cyanobacteria"))))
     #("archaea" (#("division" "domain")) (#("archaum" () ())))
     #("eukaryota"
       (#("division" "domain"))
       (#("eukaryotum" () ("slime molds"))
        #("eukaryotum" () ("fungi"))
        #("eukaryotum" () ("plants"))
        #("eukaryotum" () ("animals"))))))
  "\n")

When parsing the data normally, a 4-element property list is returned. The keys and values of the proplist are as follows:

Note that XML tags with no attributes simply have an empty list in the “attrributes” portion of the parsed results.

Normal parsing also converts nested tag names from strings to ataom.

When asking for raw results, the XML data is passed directly to erlsom and the result is returned without any filtering. The data from erlsom is returned as a 2-tuple where the elements of the tuple are as follows:

get-data

By default, get-data operates on a parsed result set:

> (lxml:get-data (lxml:parse #(file "data.xml")))
#(life ()
  (#(bacteria
     (#(division "domain"))
     (#(bacterium () ("spirochetes"))
      #(bacterium () ("proteobacteria"))
      #(bacterium () ("cyanobacteria"))))
   #(archaea (#(division "domain")) (#(archaum () ())))
   #(eukaryota
     (#(division "domain"))
     (#(eukaryotum () ("slime molds"))
      #(eukaryotum () ("fungi"))
      #(eukaryotum () ("plants"))
      #(eukaryotum () ("animals"))))))

However, by passing options, one may call get-data in the following, more concise manner:

> (lxml:get-data #(file "data.xml"))
#(life ()
  (#(bacteria ...)))

Or:

> (lxml:get-data #(url "http://example.com/data.xml"))
#(life ()
  (#(bacteria ...)))

Or:

> (lxml:get-data #(xml "<life> ... </life>"))
#(life ()
  (#(bacteria ...)))

The get-data function is provided as a convenience, as more often than not, one cares about the #(content ...) tuple in the parsed property list, and this is what get-data returns.

get-data also provides a convenience wrapper around parse: if you provide a tuple with any of the following keys, the parse command is called under the covers, alleviating the user from any need to make that call. The value keys are:

get-in

First let’s set some data:

> (set data (lxml:get-data #(file "data.xml")))
#(life ()

Using get-in trivially, with one key:

> (lxml:get-in '(life) data)
(#(bacteria
   (#(division "domain"))
   (#(bacterium () ("spirochetes"))
    #(bacterium () ("proteobacteria"))
    #(bacterium () ("cyanobacteria"))))
 #(archaea (#(division "domain")) (#(archaum () ())))
 #(eukaryota
   (#(division "domain"))
   (#(eukaryotum () ("slime molds"))
    #(eukaryotum () ("fungi"))
    #(eukaryotum () ("plants"))
    #(eukaryotum () ("animals")))))

Using get-in with two keys:

> (lxml:get-in '(life bacteria) data)
(#(bacterium () ("spirochetes"))
 #(bacterium () ("proteobacteria"))
 #(bacterium () ("cyanobacteria")))

Using get-in with three:

> (lxml:get-in '(life bacteria bacterium) data)
"spirochetes"

Or, to get the third bacterium:

> (lxml:get-in '(life bacteria 3))
"cyanobacteria"

For each of these, we could also have used file, url, or xml:

> (lxml:get-in '(life bacteria) #(file "data.xml"))
(#(bacterium () ("spirochetes"))
 #(bacterium () ("proteobacteria"))
 #(bacterium () ("cyanobacteria")))

get-in is inspired by the Clojure function of the same name and provide an easy means of extracting data nested to any depth. A list of keys is provided, as well as parsed XML data. It is expected that each subsequent key frerences a child element. The parsed XML tree is walked until the last key is reached, at which point the data at that node is returned.

Keys may either be atoms (keys in the proplist) or integers (1-based indices). This allows for situations when sibling elements have the same name and can only be distinguished by index.

As you may have guessed, get-in provides the same optional use as that of get-data and supports the same tuple keys:

get-attr-in

TBD

get-linked

TBD

map

Get the raw data of our sample XML file, without any of lxml’s post-parse-processing:

> (set `#(ok ,data ,_) (lxml:parse #(file "data.xml") '(#(result-type raw))))
#(ok
  #("life"
    ()
    (#("bacteria"
       (#("division" "domain"))
       (#("bacterium" () ("spirochetes"))
        #("bacterium" () ("proteobacteria"))
        #("bacterium" () ("cyanobacteria"))))
     #("archaea" (#("division" "domain")) (#("archaum" () ())))
     #("eukaryota"
       (#("division" "domain"))
       (#("eukaryotum" () ("slime molds"))
        #("eukaryotum" () ("fungi"))
        #("eukaryotum" () ("plants"))
        #("eukaryotum" () ("animals"))))))
  "\n")
>

Using lxml:map/2, we can modify the content of all tags. For example, the following will uppercase all the content data:

> (lxml:map #'string:to_upper/1 data)
#("life"
  ()
  (#("bacteria"
     (#("division" "domain"))
     (#("bacterium" () ("SPIROCHETES"))
      #("bacterium" () ("PROTEOBACTERIA"))
      #("bacterium" () ("CYANOBACTERIA"))))
   #("archaea" (#("division" "domain")) (#("archaum" () ())))
   #("eukaryota"
     (#("division" "domain"))
     (#("eukaryotum" () ("SLIME MOLDS"))
      #("eukaryotum" () ("FUNGI"))
      #("eukaryotum" () ("PLANTS"))
      #("eukaryotum" () ("ANIMALS"))))))

In the raw data we got back, all the tags and attirbute keys are all strings. What if we’d like to convert those to atoms? We can use lxml:map/4 and just use the identify function for processing the content, since we don’t want that modified:

> (lxml:map #'list_to_atom/1 #'lxml:key->atom/1 #'lxml:ident/1 data)
#(life ()
  (#(bacteria
     (#(division "domain"))
     (#(bacterium () ("spirochetes"))
      #(bacterium () ("proteobacteria"))
      #(bacterium () ("cyanobacteria"))))
   #(archaea (#(division "domain")) (#(archaum () ())))
   #(eukaryota
     (#(division "domain"))
     (#(eukaryotum () ("slime molds"))
      #(eukaryotum () ("fungi"))
      #(eukaryotum () ("plants"))
      #(eukaryotum () ("animals"))))))

map functions typically operate over lists of arbitrary data; the function passed to the map function is what is expected to know something about the list data. The map functions in lxml are significantly different than standard map functions in the following ways:

As a result of the last bullet point, lxml:map comes in three arities: