Indexing with Gleaner#

compose options

Gleaner (app)#

The Gleaner applications performs the retrieval and loading of JSON-LD documents from the web following structured data on the web patterns. Gleaner is available for Linux, Mac OS X and Windows.

While Gleaner is a stand alone app, it needs to interact with an object store to support data storage and other operations. These dependencies are met within the Gleaner Indexing Services or Data Service Docker compose files.

Warning

This documentation is in development. The primary testing environments are Linux and other UNIX based platforms such as Mac OS X. If you are on Windows, there may be some issues. If you can use a Linux subsystem on Windows, you may experience better results. We will test with Windows eventually and update documentation as needed.

Quick Start steps#

This quick start guide is focused on setting up and testing Gleaner in a local environnement. It is similar to how you might run Gleaner in a production environment but lacks the routing and other features likely desired for such a situation.

Note

This documentation assumes a basic understanding of Docker and experience with basic Docker activities like starting and stopping containers. It also assumes an understanding of using a command line interface and editing configuration files in the YAML format.

Command

From this point down, the documentation will attempt to put all commands you should issue in this admonition style box.

In the end, this is the table of applications and config files you will need. In this guide we will go through downloading, setting them up and running Gleaner to index documents from the web.

Table 1 Required Applications and Their Config Files#

Gleaner

Docker

Minio Client

config.yaml

setenv.sh

load2blaze.sh

schemaorg-current-https.jsonld

gleaner-DS-NoRouter.yml

Grab Gleaner and the support files we need#

We will need to get the Gleaner binary for your platform and also the Gleaner configuration file template. To do this, visit the Gleaner Releases page and pick the release Ocean InfoHubdev rc1. Under the Assets drop down you should see the files we need. Get:

  • Gleaner for your platform

  • Gleaner config template: template_v2.0.yaml

  • Gleaner indexing service compose file: gleaner-IS.yml

  • Helper environment setup script: setenvIS.sh

For this demonstration, we will be running on linux, so this would look something like:

Command

curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/gleaner
curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/gleaner-IS.yml
curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/setenvIS.sh
curl -L -O https://github.com/earthcubearchitecture-project418/gleaner/releases/download/2.0.25/template_v2.0.yaml

Note

You can download these with any tool you wish or through the browser. Above we downloaded used the command line curl tool. For GitHub, be sure to add the -L to inform curl to follow redirects to the object to download.

Command

You may need to change the permission on your gleaner file to ensure it can be run. On Linux this would look something like the following.

chmod 755 gleaner

We then need to visit Schema.org for Developers to pull down the appropriate JSON-LD context. For this work we will want to pull down the schemaorg-current-https in JSON-LD format.
It also should work to do something similar to the following:

Command

curl -O https://schema.org/version/latest/schemaorg-current-https.jsonld

About the compose file(s)#

The above steps have collected the resources for the indexer. We now want to set up the services that Gleaner will use to perform the indexing. To do that we use Docker or an appropriate run time alternative like Podman or others. For this example, we will assume you are using the Docker client.

As noted, a basic understanding of Docker and the ability to issue Docker cli commands to start and stop containers is required. If you are new do Docker, we recommend you visit and read: Get Started with Docker.

We need to select the type of services we wish to run. The various versions of these Docker compose file can be found in the Gleaner-compose deployment directory.

Why pick one over the other?

Choose Gleaner IS if you simply wish to retrieve the JSON-LD into a data warehouse to use in your own workflows

Choose Gleaner DS if you wish to build out a graph and want to use the default contains used by Gleaner.

Note

We wont look at this file in detail here since there will hopefully be no required edits. You can see the file in detail in the Index Services section.

Edit environment variables setup script#

We have Docker and the appropriate compose file. The compose files require a set of environment variables to be populated to provide the local hosts information needed to run. You can set these yourself or use or reference the setenv.sh file in the Gleaner-compose repository in the
Gleaner-compose deployment directory. You may also need to visit information about permissions at Post-installation steps for Linux if you are having permission issues.

Let’s take a look at the script.

 1#!/bin/bash
 2
 3# Object store keys
 4export MINIO_ACCESS_KEY=worldsbestaccesskey
 5export MINIO_SECRET_KEY=worldsbestsecretkey
 6
 7# local data volumes
 8export GLEANER_BASE=/tmp/gleaner/
 9mkdir -p ${GLEANER_BASE}
10export GLEANER_OBJECTS=${GLEANER_BASE}/datavol/s3
11export GLEANER_GRAPH=${GLEANER_BASE}/datavol/graph

You may wish to edit file to work better with your environment. By default it will attempt to use localhost to resolve with and host local runtime data in a /tmp/gleaner directory.

Spin up the containers#

Load our environment variables to the shell:

Command

source setenv.sh

Then start the containers:

Command

docker-compose -f gleaner-IS.yml up -d

If all has gone well, you should be able to see your running containers with

Command

docker ps

and see results similar to:

CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS                    NAMES
c4b7097f5e06        nawer/blazegraph                 "docker-entrypoint.s…"   8 seconds ago       Up 7 seconds        0.0.0.0:9999->9999/tcp   test_triplestore_1
ca08c24963a0        minio/minio:latest               "/usr/bin/docker-ent…"   8 seconds ago       Up 7 seconds        0.0.0.0:9000->9000/tcp   test_s3system_1
24274eba0d34        chromedp/headless-shell:latest   "/headless-shell/hea…"   8 seconds ago       Up 7 seconds        0.0.0.0:9222->9222/tcp   test_headless_1

Edit Gleaner config file#

We have all the files we need and we have our support services running. The next and final step is to edit our Gleaner configuration file. This will let Gleaner know the location of the support services, the JSON-LD context file and the locations of the resources we wish to index.

Let’s take a look at the full configuration file first and then break down each section.

 1---
 2minio:
 3  address: 0.0.0.0
 4  port: 9000
 5  accessKey: worldsbestaccesskey      
 6  secretKey: worldsbestsecretkey  
 7  ssl: false
 8  bucket: gleaner
 9gleaner:
10  runid: oih # this will be the bucket the output is placed in...
11  summon: true # do we want to visit the web sites and pull down the files
12  mill: true
13context:
14  cache: true
15contextmaps:
16- prefix: "https://schema.org/"
17  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
18- prefix: "http://schema.org/"
19  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
20summoner:
21  after: ""      # "21 May 20 10:00 UTC"   
22  mode: full  # full || diff:  If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
23  threads: 1
24  delay: 0  # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1) 
25  headless: http://0.0.0.0:9222  # URL for headless see docs/headless
26millers:
27  graph: true
28  #geojson: false
29sitegraphs:
30- name: aquadocs
31  url: https://oih.aquadocs.org/aquadocs.json 
32  headless: false
33  pid: https://www.re3data.org/repository/aquadocs
34  properName: AquaDocs
35  domain: https://aquadocs.org 
36sources:
37- name: samplesearth
38  url: https://samples.earth/sitemap.xml
39  headless: false
40  pid: https://www.re3data.org/repository/samplesearth
41  properName: Samples Earth (DEMO Site)
42  domain: https://samples.earth  
43- name: marinetraining
44  url: https://www.marinetraining.eu/sitemap.xml
45  headless: false
46  pid: https://www.re3data.org/repository/marinetraining
47  properName: Marine Training EU
48  domain: https://marinetraining.eu/
49- name: marineie
50  url: http://data.marine.ie/geonetwork/srv/eng/portal.sitemap
51  headless: true
52  pid: https://www.re3data.org/repository/marineie
53  properName: Marine Institute Data Catalogue
54  domain: http://data.marine.ie
55- name: oceanexperts
56  url: https://oceanexpert.org/assets/sitemaps/sitemapTraining.xml
57  headless: false
58  pid: https://www.re3data.org/repository/oceanexpert
59  properName: OceanExpert UNESCO/IOC Project Office for IODE 
60  domain: https://oceanexpert.org/
61# - name: obis
62#   url: https://obis.org/sitemap/sitemap_datasets.xml
63#   headless: false
64#   pid: https://www.re3data.org/repository/obis
65#   properName: Ocean Biodiversity Information System
66#   domain: https://obis.org  
Object store#
1minio:
2  address: 0.0.0.0
3  port: 9000
4  accessKey: worldsbestaccesskey      
5  secretKey: worldsbestsecretkey  
6  ssl: false
7  bucket: gleaner

The minio section defines the IP and port of the object store. For this case, we are using minio and these are the IP and port from our docker compose steps above. Note, if you were to use Ceph or AWS S3, this section is still labeled minio. You simply need to update the property values.

Gleaner#
1gleaner:
2  runid: oih # this will be the bucket the output is placed in...
3  summon: true # do we want to visit the web sites and pull down the files
4  mill: true

This passes a few high level concpets.

  • runid:

  • summon

  • mill

Context sections#
1context:
2  cache: true
3contextmaps:
4- prefix: "https://schema.org/"
5  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
6- prefix: "http://schema.org/"
7  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld

Comments for the context sections

Summoner section#
1summoner:
2  after: ""      # "21 May 20 10:00 UTC"   
3  mode: full  # full || diff:  If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
4  threads: 1
5  delay: 0  # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1) 
6  headless: http://0.0.0.0:9222  # URL for headless see docs/headless

Comments for the summoner sections

Millers section#
1millers:
2  graph: true
3  #geojson: false

Comments for the miller sections

Site graphs section#
1sitegraphs:
2- name: aquadocs
3  url: https://oih.aquadocs.org/aquadocs.json 
4  headless: false
5  pid: https://www.re3data.org/repository/aquadocs
6  properName: AquaDocs
7  domain: https://aquadocs.org 

Comments for the sitegrpah sections

Sources section#
 1sources:
 2- name: samplesearth
 3  url: https://samples.earth/sitemap.xml
 4  headless: false
 5  pid: https://www.re3data.org/repository/samplesearth
 6  properName: Samples Earth (DEMO Site)
 7  domain: https://samples.earth  
 8- name: marinetraining
 9  url: https://www.marinetraining.eu/sitemap.xml
10  headless: false
11  pid: https://www.re3data.org/repository/marinetraining
12  properName: Marine Training EU
13  domain: https://marinetraining.eu/
14- name: marineie
15  url: http://data.marine.ie/geonetwork/srv/eng/portal.sitemap
16  headless: true
17  pid: https://www.re3data.org/repository/marineie
18  properName: Marine Institute Data Catalogue
19  domain: http://data.marine.ie
20- name: oceanexperts
21  url: https://oceanexpert.org/assets/sitemaps/sitemapTraining.xml
22  headless: false
23  pid: https://www.re3data.org/repository/oceanexpert
24  properName: OceanExpert UNESCO/IOC Project Office for IODE 
25  domain: https://oceanexpert.org/
26# - name: obis
27#   url: https://obis.org/sitemap/sitemap_datasets.xml
28#   headless: false
29#   pid: https://www.re3data.org/repository/obis
30#   properName: Ocean Biodiversity Information System
31#   domain: https://obis.org  

Comments for the sources sections

Run gleaner#

For this example we are going to run Gleaner directly. In a deployed instance you may run Gleaner via a script or cron style service. We will document that elsewhere.

We can do a quick test of the setup.

Command

 ./gleaner -cfg template_v2.0 -setup

For now, we are ready to run Gleaner. Try:

Command

 ./gleaner -cfg template_v2.0

Note

Leave the suffix like .yaml off the name of the config file. The config system can also read json and other formats. So simply leave the suffix off and let the config code inspect the contents.

Load results to a graph and test#

You have set up the server environment and Gleaner and done your run. Things look good but you don’t have a graph you can work with yet. You need to load the JSON-LD into the triplestore in order to start playing.

Minio Object store#

To view the object store you could use your browser and point it on the default minio port at 9000. This typically something like localhost:9000.

If you wish to continue to use the command line you can use the Minio client at Minio Client Quickstart guide.

Once you have it installed and working, you can write an entry for our object store with:

Command

 ./mc alias set minio http://0.0.0.0:9000 worldsbestaccesskey worldsbestsecretkey

Load Triplestore#

We now want to load these objects, which are JSON-LD files holding RDF based graph data, into a graph database. We use the term, triplestore, to define a graph database designed to work with the RDF data model and provide SPARQL query support over that graph data.

  • Simple script loading

  • Nabu

  • Try out a simple SPARQL query

References#

The following are some reference which may provide more information on the various technologies used in this approach.