Indexing with Gleaner

Indexing with Gleaner#

compose options

Gleaner (app)#

The Gleaner applications performs the retrieval and loading of JSON-LD documents from the web following structured data on the web patterns. Gleaner is available for Linux, Mac OS X and Windows.

While Gleaner is a stand alone app, it needs to interact with an object store to support data storage and other operations. These dependencies are met within the Gleaner Indexing Services or Data Service Docker compose files.

Warning

This documentation is in development. The primary testing environments are Linux and other UNIX based platforms such as Mac OS X. If you are on Windows, there may be some issues. If you can use a Linux subsystem on Windows, you may experience better results. We will test with Windows eventually and update documentation as needed.

Quick Start steps#

This quick start guide is focused on setting up and testing Gleaner in a local environnement. It is similar to how you might run Gleaner in a production environment but lacks the routing and other features likely desired for such a situation.

Note

This documentation assumes a basic understanding of Docker and experience with basic Docker activities like starting and stopping containers. It also assumes an understanding of using a command line interface and editing configuration files in the YAML format.

Command

From this point down, the documentation will attempt to put all commands you should issue in this admonition style box.

In the end, this is the table of applications and config files you will need. In this guide we will go through downloading, setting them up and running Gleaner to index documents from the web.

Table 1 Required Applications and Their Config Files#
Gleaner	Docker	Minio Client
config.yaml	`setenv.sh`	`load2blaze.sh`
schemaorg-current-https.jsonld	gleaner-DS-NoRouter.yml

Grab Gleaner and the support files we need#

We will need to get the Gleaner binary for your platform and also the Gleaner configuration file template. To do this, visit the Gleaner Releases page and pick the release Ocean InfoHubdev rc1. Under the Assets drop down you should see the files we need. Get:

Gleaner for your platform
Gleaner config template: template_v2.0.yaml
Gleaner indexing service compose file: gleaner-IS.yml
Helper environment setup script: setenvIS.sh

For this demonstration, we will be running on linux, so this would look something like:

Command

curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/gleaner
curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/gleaner-IS.yml
curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/setenvIS.sh
curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/template_v2.0.yaml

Note

You can download these with any tool you wish or through the browser. Above we downloaded used the command line curl tool. For GitHub, be sure to add the -L to inform curl to follow redirects to the object to download.

Command

You may need to change the permission on your gleaner file to ensure it can be run. On Linux this would look something like the following.

chmod 755 gleaner

We then need to visit Schema.org for Developers to pull down the appropriate JSON-LD context. For this work we will want to pull down the schemaorg-current-https in JSON-LD format.
It also should work to do something similar to the following:

Command

curl -O https://schema.org/version/latest/schemaorg-current-https.jsonld

About the compose file(s)#

The above steps have collected the resources for the indexer. We now want to set up the services that Gleaner will use to perform the indexing. To do that we use Docker or an appropriate run time alternative like Podman or others. For this example, we will assume you are using the Docker client.

As noted, a basic understanding of Docker and the ability to issue Docker cli commands to start and stop containers is required. If you are new do Docker, we recommend you visit and read: Get Started with Docker.

We need to select the type of services we wish to run. The various versions of these Docker compose file can be found in the Gleaner-compose deployment directory.

Why pick one over the other?

Choose Gleaner IS if you simply wish to retrieve the JSON-LD into a data warehouse to use in your own workflows

Choose Gleaner DS if you wish to build out a graph and want to use the default contains used by Gleaner.

Note

We wont look at this file in detail here since there will hopefully be no required edits. You can see the file in detail in the Index Services section.

Edit environment variables setup script#

We have Docker and the appropriate compose file. The compose files require a set of environment variables to be populated to provide the local hosts information needed to run. You can set these yourself or use or reference the setenv.sh file in the Gleaner-compose repository in the
Gleaner-compose deployment directory. You may also need to visit information about permissions at Post-installation steps for Linux if you are having permission issues.

Let’s take a look at the script.

#!/bin/bash

# Object store keys
export MINIO_ACCESS_KEY=worldsbestaccesskey
export MINIO_SECRET_KEY=worldsbestsecretkey

# local data volumes
export GLEANER_BASE=/tmp/gleaner/
mkdir -p ${GLEANER_BASE}
export GLEANER_OBJECTS=${GLEANER_BASE}/datavol/s3
export GLEANER_GRAPH=${GLEANER_BASE}/datavol/graph

You may wish to edit file to work better with your environment. By default it will attempt to use localhost to resolve with and host local runtime data in a /tmp/gleaner directory.

Spin up the containers#

Load our environment variables to the shell:

Command

source setenv.sh

Then start the containers:

Command

docker-compose -f gleaner-IS.yml up -d

If all has gone well, you should be able to see your running containers with

Command

docker ps

and see results similar to:

CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS                    NAMES
c4b7097f5e06        nawer/blazegraph                 "docker-entrypoint.s…"   8 seconds ago       Up 7 seconds        0.0.0.0:9999->9999/tcp   test_triplestore_1
ca08c24963a0        minio/minio:latest               "/usr/bin/docker-ent…"   8 seconds ago       Up 7 seconds        0.0.0.0:9000->9000/tcp   test_s3system_1
24274eba0d34        chromedp/headless-shell:latest   "/headless-shell/hea…"   8 seconds ago       Up 7 seconds        0.0.0.0:9222->9222/tcp   test_headless_1

Edit Gleaner config file#

We have all the files we need and we have our support services running. The next and final step is to edit our Gleaner configuration file. This will let Gleaner know the location of the support services, the JSON-LD context file and the locations of the resources we wish to index.

Let’s take a look at the full configuration file first and then break down each section.

---
minio:
  address: 0.0.0.0
  port: 9000
  accessKey: worldsbestaccesskey      
  secretKey: worldsbestsecretkey  
  ssl: false
  bucket: gleaner
gleaner:
  runid: oih # this will be the bucket the output is placed in...
  summon: true # do we want to visit the web sites and pull down the files
  mill: true
context:
  cache: true
contextmaps:
- prefix: "https://schema.org/"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org/"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
summoner:
  after: ""      # "21 May 20 10:00 UTC"   
  mode: full  # full || diff:  If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
  threads: 1
  delay: 0  # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1) 
  headless: http://0.0.0.0:9222  # URL for headless see docs/headless
millers:
  graph: true
  #geojson: false
sitegraphs:
- name: aquadocs
  url: https://oih.aquadocs.org/aquadocs.json 
  headless: false
  pid: https://www.re3data.org/repository/aquadocs
  properName: AquaDocs
  domain: https://aquadocs.org 
sources:
- name: samplesearth
  url: https://samples.earth/sitemap.xml
  headless: false
  pid: https://www.re3data.org/repository/samplesearth
  properName: Samples Earth (DEMO Site)
  domain: https://samples.earth  
- name: marinetraining
  url: https://www.marinetraining.eu/sitemap.xml
  headless: false
  pid: https://www.re3data.org/repository/marinetraining
  properName: Marine Training EU
  domain: https://marinetraining.eu/
- name: marineie
  url: http://data.marine.ie/geonetwork/srv/eng/portal.sitemap
  headless: true
  pid: https://www.re3data.org/repository/marineie
  properName: Marine Institute Data Catalogue
  domain: http://data.marine.ie
- name: oceanexperts
  url: https://oceanexpert.org/assets/sitemaps/sitemapTraining.xml
  headless: false
  pid: https://www.re3data.org/repository/oceanexpert
  properName: OceanExpert UNESCO/IOC Project Office for IODE 
  domain: https://oceanexpert.org/
# - name: obis
#   url: https://obis.org/sitemap/sitemap_datasets.xml
#   headless: false
#   pid: https://www.re3data.org/repository/obis
#   properName: Ocean Biodiversity Information System
#   domain: https://obis.org  

Object store#

minio:
  address: 0.0.0.0
  port: 9000
  accessKey: worldsbestaccesskey      
  secretKey: worldsbestsecretkey  
  ssl: false
  bucket: gleaner

The minio section defines the IP and port of the object store. For this case, we are using minio and these are the IP and port from our docker compose steps above. Note, if you were to use Ceph or AWS S3, this section is still labeled minio. You simply need to update the property values.

Gleaner#

gleaner:
  runid: oih # this will be the bucket the output is placed in...
  summon: true # do we want to visit the web sites and pull down the files
  mill: true

This passes a few high level concpets.

runid:
summon
mill

Context sections#

context:
  cache: true
contextmaps:
- prefix: "https://schema.org/"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org/"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld

Comments for the context sections

Summoner section#

summoner:
  after: ""      # "21 May 20 10:00 UTC"   
  mode: full  # full || diff:  If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
  threads: 1
  delay: 0  # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1) 
  headless: http://0.0.0.0:9222  # URL for headless see docs/headless

Comments for the summoner sections

Millers section#

millers:
  graph: true
  #geojson: false

Comments for the miller sections

Site graphs section#

sitegraphs:
- name: aquadocs
  url: https://oih.aquadocs.org/aquadocs.json 
  headless: false
  pid: https://www.re3data.org/repository/aquadocs
  properName: AquaDocs
  domain: https://aquadocs.org 

Comments for the sitegrpah sections

Sources section#

sources:
- name: samplesearth
  url: https://samples.earth/sitemap.xml
  headless: false
  pid: https://www.re3data.org/repository/samplesearth
  properName: Samples Earth (DEMO Site)
  domain: https://samples.earth  
- name: marinetraining
  url: https://www.marinetraining.eu/sitemap.xml
  headless: false
  pid: https://www.re3data.org/repository/marinetraining
  properName: Marine Training EU
  domain: https://marinetraining.eu/
- name: marineie
  url: http://data.marine.ie/geonetwork/srv/eng/portal.sitemap
  headless: true
  pid: https://www.re3data.org/repository/marineie
  properName: Marine Institute Data Catalogue
  domain: http://data.marine.ie
- name: oceanexperts
  url: https://oceanexpert.org/assets/sitemaps/sitemapTraining.xml
  headless: false
  pid: https://www.re3data.org/repository/oceanexpert
  properName: OceanExpert UNESCO/IOC Project Office for IODE 
  domain: https://oceanexpert.org/
# - name: obis
#   url: https://obis.org/sitemap/sitemap_datasets.xml
#   headless: false
#   pid: https://www.re3data.org/repository/obis
#   properName: Ocean Biodiversity Information System
#   domain: https://obis.org  

Comments for the sources sections

Run gleaner#

For this example we are going to run Gleaner directly. In a deployed instance you may run Gleaner via a script or cron style service. We will document that elsewhere.

We can do a quick test of the setup.

Command

 ./gleaner -cfg template_v2.0 -setup

For now, we are ready to run Gleaner. Try:

Command

 ./gleaner -cfg template_v2.0

Note

Leave the suffix like .yaml off the name of the config file. The config system can also read json and other formats. So simply leave the suffix off and let the config code inspect the contents.

Load results to a graph and test#

You have set up the server environment and Gleaner and done your run. Things look good but you don’t have a graph you can work with yet. You need to load the JSON-LD into the triplestore in order to start playing.

Minio Object store#

To view the object store you could use your browser and point it on the default minio port at 9000. This typically something like localhost:9000.

If you wish to continue to use the command line you can use the Minio client at Minio Client Quickstart guide.

Once you have it installed and working, you can write an entry for our object store with:

Command

 ./mc alias set minio http://0.0.0.0:9000 worldsbestaccesskey worldsbestsecretkey

Load Triplestore#

We now want to load these objects, which are JSON-LD files holding RDF based graph data, into a graph database. We use the term, triplestore, to define a graph database designed to work with the RDF data model and provide SPARQL query support over that graph data.

Simple script loading
Nabu
Try out a simple SPARQL query

References#

The following are some reference which may provide more information on the various technologies used in this approach.

Indexing with Gleaner

Contents

Indexing with Gleaner#

Gleaner (app)#

Quick Start steps#

Grab Gleaner and the support files we need#

About the compose file(s)#

Edit environment variables setup script#

Spin up the containers#

Edit Gleaner config file#

Object store#

Gleaner#

Context sections#

Summoner section#

Millers section#

Site graphs section#

Sources section#

Run gleaner#

Load results to a graph and test#

Minio Object store#

Load Triplestore#

References#