Resolves transitive relations in a graph and removes cycles.
Escapes a Unicode string according to Turtle / N-Triples format.
Maps old URIs in triple files to new URIs: - read one or more triple files that contain the URI mapping:
Maps old URIs in triple files to new URIs: - read one or more triple files that contain the URI mapping:
As of DBpedia release 3.8 and 3.9, the following datasets should be canonicalized:
article-categories category-labels disambiguations disambiguations-redirected external-links geo-coordinates homepages images infobox-properties infobox-properties-redirected infobox-property-definitions instance-types labels long-abstracts mappingbased-properties mappingbased-properties-redirected page-in-link-counts (redirected) page-links page-links-redirected page-out-link-counts (redirected) persondata persondata-redirected pnd short-abstracts skos-categories specific-mappingbased-properties
Example calls: ../run CanonicalizeUris /data/dbpedia interlanguage-links .nt.gz labels,short-abstracts,long-abstracts -en-uris .nt.gz,.nq.gz en en 10000-
../run CanonicalizeUris /data/dbpedia interlanguage-links .nt.gz article-categories,category-labels,disambiguations,disambiguations-redirected,external-links,geo-coordinates,homepages,images,infobox-properties,infobox-properties-redirected,infobox-property-definitions,instance-types,labels,long-abstracts,mappingbased-properties,mappingbased-properties-redirected,page-links,page-in-link-counts,page-in-link-counts-redirected,page-links,page-links-redirected,page-out-link-counts,persondata,persondata-redirected,pnd,short-abstracts,skos-categories,specific-mappingbased-properties -en-uris .nt.gz,.nq.gz en en 10000-
TODO: merge with MapObjectUris?
Example call: ../run CountTypes /data/dbpedia instance-types .ttl.gz instance-types-counted.txt true 10000-
Generate Wacko Wiki source text for http://wiki.dbpedia.org/Downloads and all its sub pages.
Generate Wacko Wiki source text for http://wiki.dbpedia.org/Downloads and all its sub pages.
Example call:
../run CreateDownloadPage src/main/data/lines-bytes-packed.txt
Create a dataset file with owl:sameAs links to Freebase.
Create a dataset file with owl:sameAs links to Freebase.
Example calls:
URIs and N-Triple escaping ../run CreateFreebaseLinks /data/dbpedia .nt.gz freebase-rdf-<date>.gz freebase-links.nt.gz
IRIs and Turtle escaping ../run CreateFreebaseLinks /data/dbpedia .ttl.gz freebase-rdf-<date>.gz freebase-links.ttl.gz
See https://developers.google.com/freebase/data for a reference of the Freebase RDF data dumps
Encodes non-ASCII chars in N-Triples files.
Encodes non-ASCII chars in N-Triples files.
Example call: ../run DecodeHtmlEntities /data/dbpedia/links gutenberg _fixed _links.nt.gz
Example call: ../run DecodeHtmlText /data/dbpedia labels,short-abstracts,long-abstracts -fixed .nt.gz false 10000-
Encodes non-ASCII chars in N-Triples files.
Encodes non-ASCII chars in N-Triples files. DOES NOT ESCAPE DOUBLE QUOTES (") AND BACKSLASHES (\) - we assume that the file is mostly in correct N-Triples format and just contains a few non-ASCII chars.
Example call: ../run FixNTriplesEncoding /data/dbpedia/links bbcwildlife,italian-public-schools _fixed _links.nt.gz
Generates list of existing abstracts as a TSV file.
Maps old object URIs in triple files to new object URIs: - read one or more triple files that contain the URI mapping:
Maps old object URIs in triple files to new object URIs: - read one or more triple files that contain the URI mapping:
Redirects SHOULD be resolved in the following datasets:
disambiguations infobox-properties mappingbased-properties page-links persondata topical-concepts
Redirects seem to be so rare in categories that it doesn't make sense to resolve these:
article-categories skos-categories
The following datasets DO NOT have object URIs that can be redirected:
category-labels external-links flickr-wrappr-links geo-coordinates homepages images infobox-property-definitions infobox-test instance-types iri-same-as-uri labels specific-mappingbased-properties
Maybe we should resolve redirects in interlanguage-links, but we would have to integrate redirect resolution into interlanguage link resolution. We're pretty strict when we generate interlanguage-links-same-as. If we resolve redirects in interlanguage-links, we would probably gain a few interlanguage-links (tenths of a percent), but we would not eliminate errors.
Example call: ../run MapObjectUris /data/dbpedia transitive-redirects .nt.gz infobox-properties,mappingbased-properties,... -redirected .nt.gz,.nq.gz 10000-
The following should be redirected (as of 2012-07-11) disambiguations,infobox-properties,mappingbased-properties,page-links,persondata,topical-concepts (specific-mappingbased-properties is not necessary, it has only literal values)
TODO: merge with CanonicalizeUris?
Maps old subject URIs in triple files to new subject URIs: - read one or more triple files that contain the URI mapping:
Maps old subject URIs in triple files to new subject URIs: - read one or more triple files that contain the URI mapping:
Redirects in subject position SHOULD be resolved in the following datasets:
anchor-text
This is because during the extraction, the targets of the links are used as subjects in the anchor texts dataset and thus no redirect resolving is performed on them.
Example call: ../run MapSubjectUris /data/dbpedia transitive-redirects .nt.gz anchor-text -redirected .nt.gz,.nq.gz 10000-
See https://developers.google.com/freebase/data for a reference of the Freebase RDF data dumps
Split inter-language links into bidirectional and unidirectional links.
Split inter-language links into bidirectional and unidirectional links.
Example calls:
'fr,de' mean specific languages. '-' means no language uses generic domain. '-fr-de' is file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as-fr-de.ttl.gz and enwiki-20120601-interlanguage-links-see-also-fr-de.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-fr-de.txt.gz -fr-de .ttl.gz - fr,de
'10000-' means languages by article count range. 'en' uses generic domain. '-' means no file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-ttl.txt.gz - .ttl.gz en 10000-
'-' means don't write dump file. ../run ProcessInterLanguageLinks /data/dbpedia - -fr-de .ttl.gz - fr,de
if no languages are given, read links from dump file, not from triple files. ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links.txt.gz - .ttl.gz
generate links for all DBpedia I18N chapters in nt format '-chapters' is file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as-chapters.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-chapters-nt.txt.gz -chapters .nt.gz en cs,en,fr,de,el,it,ja,ko,pl,pt,ru,es
generate links for all languages that have a namespace on mappings.dbpedia.org in nt format '-mapped' is file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as-mapped.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-mapped-nt.txt.gz -mapped .nt.gz en @mappings
Generate separate triple files for each language from Wikidata link file.
Generate separate triple files for each language from Wikidata link file.
Input format:
ips_row_id ips_item_id ips_site_id ips_site_page 55 3596065 abwiki Џьгьарда 56 3596037 abwiki Џьырхәа 58 3596033 abwiki Аацы ... 17374035 868895 zh_yuewiki As50 17374052 552300 zh_yuewiki Baa, Baa, Black Sheep 17374062 813114 zh_yuewiki Beatcollective ... 257464664 3176059 frwiki Jeanne Granier 257464665 3176059 enwiki Jeanne Granier 257464677 7465026 ptwiki 36558 2000 QP105 257464678 8275441 frwiki Catégorie:Îles Auckland ...
Fields are tab-separated. Titles contain spaces. Lines are sorted by row ID. For us this means they're basically random.
Example call:
../run ProcessWikidataLinks process.wikidata.links.properties
Maps old quads/triples to new quads/triples.
Decodes DBpedia URIs that percent-encode too many characters and encodes them following our new rules.
Decodes DBpedia URIs that percent-encode too many characters and encodes them following our new rules.
Example call: ../run RecodeUris /data/dbpedia/links .nt.gz _fixed.nt.gz bbcwildlife.nt.gz,bookmashup.nt.gz
Removes HTML tags from the property values of generated triples.
Removes HTML tags from the property values of generated triples. This might help if there are some tags left after the abstract extraction.
Example call: ../run DecodeHtmlText /data/dbpedia short-abstracts,long-abstracts -tagstripped .nt.gz 10000-
Replace triples in a dataset by their transitive closure.
Replace triples in a dataset by their transitive closure. All triples must use the same property. Cycles are removed.
Example call: ../run ResolveTransitiveLinks /data/dbpedia redirects transitive-redirects .nt.gz 10000-
Generates an SQL import file that contains all cache items
Generates an SQL import file that contains all cache items
Example calls: ../run UnmodifiedFeederCacheGenerator /data/dbpedia .nt.gz 2013-02-01 en
Generates language links from the Wikidata sameAs dataset as created by the org.dbpedia.extraction.mappings.WikidataSameAsExtractor.
Generates language links from the Wikidata sameAs dataset as created by the org.dbpedia.extraction.mappings.WikidataSameAsExtractor. This code assumes the subjects to be ordered, in particular, it assumes that there is *exactly* one continuous block for each subject.
Split multistream Wikipedia dumps (e.g.
Split multistream Wikipedia dumps (e.g. [1]) into size-configurable chunks [1] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2 Note: This script only works with multistream dumps!
Usage: ../run WikipediaDumpSplitter /path/to/multistream/dump/enwiki-latest-pages-articles-multistream.xml.bz2 /path/to/mulstistream/dump/index/enwiki-latest-pages-articles-multistream-index.txt.bz2 /output/directory 64
Escapes a Unicode string according to Turtle / N-Triples format. DOES NOT ESCAPE DOUBLE QUOTES (") AND BACKSLASHES (\) - we assume that the file is mostly in correct N-Triples format and just contains a few non-ASCII chars.