Help:Data import with a script

From HackerspaceWiki
Jump to: navigation, search

Usually users neither do want to upload complete ontologies, nor does the wiki support the complete expressivity of the ontology language. Furthermore, ontology imports are not a trivial task that has to happen often: it is basically used to pump prime the wiki with an pre-existing data source -- something that will usually be done only once with a certain data source. And different data sources need to be treated differently.

Therefore the most promising way to implement the import of ontologies -- and basically many other, different external data sources -- to a wiki, is to write a script that will read the external data source, create Mediawiki markup based on that data source, and upload that markup to an appropriate page in the wiki. In case the page already exists, the script author has to decide how to handle that case.

Let's take a look at an example. Assume we have a file with the following content:

Hydrogen, H, 1
Helium, He, 2
...

I.e., a tab separated list of all elements with its name, chemical symbol, and element order, with each entry being separated with a new line. A script could parse the data line by line, and create wiki text like "'''Hydrogen''' (Chemical symbol:H) is element # 1 in the [[element table]]. [[Category:Element]]" and upload that text to the page Hydrogen, assuming it does not exist yet.

For that we need a library that allows us simple read and write operations on the wiki. Since Semantic MediaWiki is properly based on MediaWiki, we can merrily reuse libraries created for MediaWiki for that task -- in PHP for example the integrated API (not finished yet), or in Python the pyWikipediaBot framework (stable and maintained).

The format of the data source does not require to be an ontology, but as often ontologies have certain advantages for data exchange: first, there are a number of available libraries to parse ontologies in standard formats (so, no need to write a parser), second, it is easy to map and merge data from different sources and then get exactly the data that we need.

Here is an example with an ontology, that can also be used for testing that feature. In this example we will use Python as a scripting language, but this is not a requirement. Let's assume the file is known as elements.rdf.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rdf:RDF [
    <!ENTITY ex 'http://aifb.uni-karlsruhe.de/WBS/ex#'>
    <!ENTITY owl 'http://www.w3.org/2002/07/owl#'>
    <!ENTITY xsd 'http://www.w3.org/2001/XMLSchema#'>
]>

<rdf:RDF
    xml:base="http://aifb.uni-karlsruhe.de/WBS/ex#"
    xmlns:ex="http://aifb.uni-karlsruhe.de/WBS/ex#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">

<owl:Ontology rdf:about="#"/>

<owl:Class rdf:about="&ex;Element" rdfs:label="Element" />

<ex:Element rdf:about="&ex;Hydrogen" rdfs:label="Hydrogen">
  <ex:elementSymbol>H</ex:elementSymbol>
  <ex:elementNumber>1</ex:elementNumber>
</ex:Element>

</rdf:RDF>

It is easy to imagine to have a bigger data-file. Assuming an installed and configured pyWikipediaBot framework, the following script would upload the file:

# load the required libraries
from rdflib import Graph, URIRef, Literal, Namespace, RDF
import wikipedia, login, category

# note that you need to setup the appropriate family file
family = "testwiki"

i = Graph()
# Create the required namespaces
i.bind("ex", "http://aifb.uni-karlsruhe.de/WBS/ex#")
RDF = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")
EX = Namespace("http://aifb.uni-karlsruhe.de/WBS/ex#")
# Load the file. If loaded like this, python needs to be able
# to find the file, e.g. put it in the same directory
i.load("elements.rdf")

# login to the wiki
ex = wikipedia.Site('en')
login.LoginManager('password', False, ex)

# iterates through everything that has the type Element
# (note, only explicit assertions -- rdflib does not do reasoning here!)
for s in i.subjects(RDF["type"], EX["Element"]):
  for name in i.objects(s, RDFS["label"]):  # reads the label
    # gets the page with that name
    page = wikipedia.Page(ex,n)
      if p.exists(): # if the page already exists
        print n + " exists already, I did not change it."
      else: # create the text for the page and load it up
         text = "'''" + name + "''' "
         for symbol in i.objects(s, EX["elementSymbol"]):
           text += "(Chemical symbol: " + symbol + ") "
         for number in i.objects(s, EX["elementNumber"]):
           text += "is element number " + number + " in the [[element table]]."
         text += "\n\n[[Category:Element]]"
         # Now we have created the text, let's upload it
         page.put(text, 'Added from ontology')
         print "Added page " + n

wikipedia.stopme()
# close the Wikipedia library
print "Script ended."

Running the script on an empty wiki with the above data file should lead to the following output on the command line:

Added page Hydrogen
Script ended.

In the wiki, a page "Hydrogen" should have been created an the content of the page would be

'''Hydrogen''' (Chemical symbol: H) is element number 1 in the [[element table]].

[[Category:Element]]

Running it a second time should send the following output to the command line:

Hydrogen exists already, I did not change it.
Script ended.

No changes in the wiki should have happened.

Based on this starting point, the scripts and the data sources may be of a much higher complexity and do more clever stuff, for e.g. instead of just not touching anything if a page exists, the script could analyze the existing content and try to add or update data from the data source.

A more general script based on the script described above can be found here.

Personal tools