Introduction
An LFE XML Library
lxml is an LFE library which wraps erlsom's XML-parsing capabilities and adds an API for easily accessing the parsed data.
Wrapping erlsom
lxml only makes one call to erlsom: in the private function
lxml:parse-body-raw
. It is through this function that all of lxml's exported
parsing functions ultimately pass. As referenced in the introduction, lxml only
supports erlsom's simple form, and this is what parse-body-raw
calls.
erlsom:simple_form/1
takes either string or binary data and upon a
successful parse, returns an LFE data structure which represents the original
XML. Bot the inputs and outputs of lxml:parse
are covered in more detail
below.
Inputs
To demonstrate string and binary inputs, let's define some data:
lfe> (set data-1 "<xml>data</xml>")
"<xml>data</xml>"
lfe> (set data-2 (binary "<xml>data</xml>"))
#"<xml>data</xml>"
Now we can parse these and show that lxml (via erlsom) handles both string and binary input the same:
lfe> (=:= (lxml:parse data-1)
(lxml:parse data-2))
true
lxml handles the following types of input:
- string XML data
- binary XML data
- a file whose data is XML
- a URL that points to XML data
For the first two, you simply pass lxml the data you want to parse. If you want
to parse an XML file or URL, you'll need to pass a tuple to parse
with the
appropriate option set.
For more information on the parse
function, see the "API" section.
Options
When calling parse
you can simply pass the data (or URL or filename), or
you can also pass a second argument: an "options" property list. The supported
options are:
result-type
- can be the atomraw
. When this is the case, no processing is done on the results; they are simply returned as obtained from erlsom. Without the#(result-type raw)
option set, lxml will return keys as atoms (with it, they are left as-is: strings).
Outputs
TBD
The API
start
Starting up lxml for parsing URLs and following media links:
lfe> (lxml:start)
#(ok (lxml))
This function simply starts the lhc LFE HTTP client. If you plan on parsing XML
from URLs or if you want to be able to call the get-linked
function, you
will first need to call start
.
parse
For this and subsequent API functions, we will use the following sample data, part of the lxml codebase and available in
./test/data.xml
:
<life>
<bacteria division="domain">
<bacterium>spirochetes</bacterium>
<bacterium>proteobacteria</bacterium>
<bacterium>cyanobacteria</bacterium>
</bacteria>
<archaea division="domain">
<archaum></archaum>
</archaea>
<eukaryota division="domain">
<eukaryotum>slime molds</eukaryotum>
<eukaryotum>fungi</eukaryotum>
<eukaryotum>plants</eukaryotum>
<eukaryotum>animals</eukaryotum>
</eukaryota>
</life>
Parse the data normally:
lfe> (lxml:start)
#(ok (lxml))
lfe> (lxml:parse #(file "test/data.xml"))
(#(tag "life")
#(attr ())
#(content
#("life"
()
(#("bacteria"
(#("division" "domain"))
(#("bacterium" () ("spirochetes"))
#("bacterium" () ("proteobacteria"))
#("bacterium" () ("cyanobacteria"))))
#("archaea" (#("division" "domain")) (#("archaum" () ())))
#("eukaryota"
(#("division" "domain"))
(#("eukaryotum" () ("slime molds"))
#("eukaryotum" () ("fungi"))
#("eukaryotum" () ("plants"))
#("eukaryotum" () ("animals")))))))
#(tail "\n\n"))
Parse the data, requesting a raw result:
lfe> (lxml:parse #(file "test/data.xml") '(#(result-type raw)))
#(ok
#("life"
()
(#("bacteria"
(#("division" "domain"))
(#("bacterium" () ("spirochetes"))
#("bacterium" () ("proteobacteria"))
#("bacterium" () ("cyanobacteria"))))
#("archaea" (#("division" "domain")) (#("archaum" () ())))
#("eukaryota"
(#("division" "domain"))
(#("eukaryotum" () ("slime molds"))
#("eukaryotum" () ("fungi"))
#("eukaryotum" () ("plants"))
#("eukaryotum" () ("animals"))))))
"\n\n")
When parsing the data normally, a 4-element property list is returned. The keys and values of the proplist are as follows:
tag
- the top-level tag of the parsed XML dataattr
- the attributes of the top-level tagcontent
- the content of the top-level tag; this includes the full XML result, minus any trailing characterstail
- any trailing characters
Note that XML tags with no attributes simply have an empty list in the "attrributes" portion of the parsed results.
Normal parsing also converts nested tag names from strings to ataom.
When asking for raw results, the XML data is passed directly to erlsom and the result is returned without any filtering. The data from erlsom is returned as a 2-tuple where the elements of the tuple are as follows:
- status - either
ok
orerror
- result - either the parsed data structure or an error message
get-data
By default,
get-data
operates on a parsed result set:
lfe> (lxml:get-data (lxml:parse #(file "test/data.xml")))
#("life"
()
(#("bacteria"
(#("division" "domain"))
(#("bacterium" () ("spirochetes"))
#("bacterium" () ("proteobacteria"))
#("bacterium" () ("cyanobacteria"))))
#("archaea" (#("division" "domain")) (#("archaum" () ())))
#("eukaryota"
(#("division" "domain"))
(#("eukaryotum" () ("slime molds"))
#("eukaryotum" () ("fungi"))
#("eukaryotum" () ("plants"))
#("eukaryotum" () ("animals"))))))
However, by passing options, one may call
get-data
in the following, more concise manner:
lfe> (lxml:get-data #(file "test/data.xml"))
#(life ()
(#(bacteria ...)))
Or:
lfe> (lxml:get-data #(url "http://example.com/data.xml"))
#(life ()
(#(bacteria ...)))
Or:
lfe> (lxml:get-data #(xml "<life> ... </life>"))
#(life ()
(#(bacteria ...)))
The get-data
function is provided as a convenience, as more often than not,
one cares about the #(content ...)
tuple in the parsed property list, and
this is what get-data
returns.
get-data
also provides a convenience wrapper around parse: if you provide
a tuple with any of the following keys, the parse command is called under
the covers, alleviating the user from any need to make that call. The value
keys are:
file
url
xml
get-in
First let's set some data:
lfe> (set data (lxml:get-data #(file "test/data.xml")))
Using
get-in
trivially, with one key:
lfe> (lxml:get-in '("life") data)
(#("bacteria"
(#("division" "domain"))
(#("bacterium" () ("spirochetes"))
#("bacterium" () ("proteobacteria"))
#("bacterium" () ("cyanobacteria"))))
#("archaea" (#("division" "domain")) (#("archaum" () ())))
#("eukaryota"
(#("division" "domain"))
(#("eukaryotum" () ("slime molds"))
#("eukaryotum" () ("fungi"))
#("eukaryotum" () ("plants"))
#("eukaryotum" () ("animals")))))
Using
get-in
with two keys:
lfe> (lxml:get-in '("life" "bacteria") data)
(#("bacterium" () ("spirochetes"))
#("bacterium" () ("proteobacteria"))
#("bacterium" () ("cyanobacteria")))
Using
get-in
with three:
lfe> (lxml:get-in '("life" "bacteria" "bacterium") data)
"spirochetes"
For a series of elements that share the same name, the first is returned by default. As such, the above is identical to this:
lfe> (lxml:get-in '("life" "bacteria" 1) data)
"spirochetes"
To get the third bacterium:
lfe> (lxml:get-in '("life" "bacteria" 3) data)
"cyanobacteria"
For each of these, we could also have used
file
,url
, orxml
:
> (lxml:get-in '("life" "bacteria") #(file "test/data.xml"))
(#(bacterium () ("spirochetes"))
#(bacterium () ("proteobacteria"))
#(bacterium () ("cyanobacteria")))
get-in
is inspired by the Clojure function of the same name and provide
an easy means of extracting data nested to any depth. A list of keys is
provided, as well as parsed XML data. It is expected that each subsequent key
frerences a child element. The parsed XML tree is walked until the last
key is reached, at which point the data at that node is returned.
Keys may either be atoms (keys in the proplist) or integers (1-based indices). This allows for situations when sibling elements have the same name and can only be distinguished by index.
As you may have guessed, get-in
provides the same optional use as that
of get-data
and supports the same tuple keys:
file
url
xml
get-attr-in
TBD
get-linked
TBD
map
Get the raw data of our sample XML file, without any of lxml's post-parse-processing:
lfe> (set `#(ok ,data ,_) (lxml:parse #(file "test/data.xml") '(#(result-type raw))))
#(ok
#("life"
()
(#("bacteria"
(#("division" "domain"))
(#("bacterium" () ("spirochetes"))
#("bacterium" () ("proteobacteria"))
#("bacterium" () ("cyanobacteria"))))
#("archaea" (#("division" "domain")) (#("archaum" () ())))
#("eukaryota"
(#("division" "domain"))
(#("eukaryotum" () ("slime molds"))
#("eukaryotum" () ("fungi"))
#("eukaryotum" () ("plants"))
#("eukaryotum" () ("animals"))))))
"\n")
Using
lxml:map/2
, we can modify the content of all tags. For example, the following will uppercase all the content data:
lfe> (lxml:map #'string:to_upper/1 data)
#("life"
()
(#("bacteria"
(#("division" "domain"))
(#("bacterium" () ("SPIROCHETES"))
#("bacterium" () ("PROTEOBACTERIA"))
#("bacterium" () ("CYANOBACTERIA"))))
#("archaea" (#("division" "domain")) (#("archaum" () ())))
#("eukaryota"
(#("division" "domain"))
(#("eukaryotum" () ("SLIME MOLDS"))
#("eukaryotum" () ("FUNGI"))
#("eukaryotum" () ("PLANTS"))
#("eukaryotum" () ("ANIMALS"))))))
In the raw data we got back, all the tags and attirbute keys are all strings. What if we'd like to convert those to atoms? We can use `
xml:map/4
and just use the identify function for processing the content, since we don't want that modified:
lfe> (lxml:map #'list_to_atom/1 #'lxml:key->atom/1 #'lxml:ident/1 data)
#(life ()
(#(bacteria
(#(division "domain"))
(#(bacterium () ("spirochetes"))
#(bacterium () ("proteobacteria"))
#(bacterium () ("cyanobacteria"))))
#(archaea (#(division "domain")) (#(archaum () ())))
#(eukaryota
(#(division "domain"))
(#(eukaryotum () ("slime molds"))
#(eukaryotum () ("fungi"))
#(eukaryotum () ("plants"))
#(eukaryotum () ("animals"))))))
map
functions typically operate over lists of arbitrary data; the function
passed to the map
function is what is expected to know something about the
list data. The map
functions in lxml are significantly different than
standard map
functions in the following ways:
- They operate on nested data,
- They understand that the lists will contain either the 3-tuple of parsed XML data, or a list of attribute typles, and
- There are potentially three functions one may pass: on to operate on the tag, one on the attributes, and another on the contents.
As a result of the last bullet point, lxml:map
comes in three arities:
xml:map/2
- takes a content-manipulating function and parsed XML datalxml:map/3
- takes an attribute-manipulating function, a content-manipulating function, and parsed XML datalxml:map/4
- takes s tag-manipulating function, an attribute-manipulating function, a content-manipulating function, and parsed XML data
Previous Versions
User Guide
The User Guide is available for the following versions: