org.dbpedia.extraction

scripts

package scripts

Visibility
  1. Public
  2. All

Type Members

  1. class Counter extends AnyRef

  2. class DateFinder[T] extends AnyRef

  3. class ProcessInterLanguageLinks extends AnyRef

  4. class ProcessWikidataLinks extends AnyRef

  5. class TurtleEscaper extends AnyRef

    Escapes a Unicode string according to Turtle / N-Triples format.

    Escapes a Unicode string according to Turtle / N-Triples format. DOES NOT ESCAPE DOUBLE QUOTES (") AND BACKSLASHES (\) - we assume that the file is mostly in correct N-Triples format and just contains a few non-ASCII chars.

  6. class WikidataSameAsToLanguageLinks extends AnyRef

Value Members

  1. object CanonicalizeUris

    Maps old URIs in triple files to new URIs: - read one or more triple files that contain the URI mapping:

    Maps old URIs in triple files to new URIs: - read one or more triple files that contain the URI mapping:

    • the predicate is ignored
    • only triples whose object URI has the target domain are used - read one or more files that need their URIs changed
    • DBpedia URIs in subject, predicate or object position are mapped
    • non-DBpedia URIs and literal values are copied unchanged
    • triples containing DBpedia URIs in subject or object position that cannot be mapped are discarded

    As of DBpedia release 3.8 and 3.9, the following datasets should be canonicalized:

    article-categories category-labels disambiguations disambiguations-redirected external-links geo-coordinates homepages images infobox-properties infobox-properties-redirected infobox-property-definitions instance-types labels long-abstracts mappingbased-properties mappingbased-properties-redirected page-in-link-counts (redirected) page-links page-links-redirected page-out-link-counts (redirected) persondata persondata-redirected pnd short-abstracts skos-categories specific-mappingbased-properties

    Example calls: ../run CanonicalizeUris /data/dbpedia interlanguage-links .nt.gz labels,short-abstracts,long-abstracts -en-uris .nt.gz,.nq.gz en en 10000-

    ../run CanonicalizeUris /data/dbpedia interlanguage-links .nt.gz article-categories,category-labels,disambiguations,disambiguations-redirected,external-links,geo-coordinates,homepages,images,infobox-properties,infobox-properties-redirected,infobox-property-definitions,instance-types,labels,long-abstracts,mappingbased-properties,mappingbased-properties-redirected,page-links,page-in-link-counts,page-in-link-counts-redirected,page-links,page-links-redirected,page-out-link-counts,persondata,persondata-redirected,pnd,short-abstracts,skos-categories,specific-mappingbased-properties -en-uris .nt.gz,.nq.gz en en 10000-

    TODO: merge with MapObjectUris?

  2. object CountTypes

    Example call: ../run CountTypes /data/dbpedia instance-types .ttl.gz instance-types-counted.txt true 10000-

  3. object CreateDownloadPage

    Generate Wacko Wiki source text for http://wiki.dbpedia.org/Downloads and all its sub pages.

    Generate Wacko Wiki source text for http://wiki.dbpedia.org/Downloads and all its sub pages.

    Example call:

    ../run CreateDownloadPage src/main/data/lines-bytes-packed.txt

  4. object CreateFlickrWrapprLinks

  5. object CreateFreebaseLinks

    Create a dataset file with owl:sameAs links to Freebase.

    Create a dataset file with owl:sameAs links to Freebase.

    Example calls:

    URIs and N-Triple escaping ../run CreateFreebaseLinks /data/dbpedia .nt.gz freebase-rdf-<date>.gz freebase-links.nt.gz

    IRIs and Turtle escaping ../run CreateFreebaseLinks /data/dbpedia .ttl.gz freebase-rdf-<date>.gz freebase-links.ttl.gz

    See https://developers.google.com/freebase/data for a reference of the Freebase RDF data dumps

  6. object CreateIriSameAsUriLinks

  7. object DecodeHtmlEntities

    Encodes non-ASCII chars in N-Triples files.

    Encodes non-ASCII chars in N-Triples files.

    Example call: ../run DecodeHtmlEntities /data/dbpedia/links gutenberg _fixed _links.nt.gz

  8. object DecodeHtmlText

    Example call: ../run DecodeHtmlText /data/dbpedia labels,short-abstracts,long-abstracts -fixed .nt.gz false 10000-

  9. object FixNTriplesEncoding

    Encodes non-ASCII chars in N-Triples files.

    Encodes non-ASCII chars in N-Triples files. DOES NOT ESCAPE DOUBLE QUOTES (") AND BACKSLASHES (\) - we assume that the file is mostly in correct N-Triples format and just contains a few non-ASCII chars.

    Example call: ../run FixNTriplesEncoding /data/dbpedia/links bbcwildlife,italian-public-schools _fixed _links.nt.gz

  10. object GenerateListOfExistingAbstracts

    Generates list of existing abstracts as a TSV file.

  11. object MapObjectUris

    Maps old object URIs in triple files to new object URIs: - read one or more triple files that contain the URI mapping:

    Maps old object URIs in triple files to new object URIs: - read one or more triple files that contain the URI mapping:

    • the predicate is ignored - read one or more files that need their object URI changed:
    • the predicate is ignored
    • literal values and quads without mapping for object URI are copied unchanged

    Redirects SHOULD be resolved in the following datasets:

    disambiguations infobox-properties mappingbased-properties page-links persondata topical-concepts

    Redirects seem to be so rare in categories that it doesn't make sense to resolve these:

    article-categories skos-categories

    The following datasets DO NOT have object URIs that can be redirected:

    category-labels external-links flickr-wrappr-links geo-coordinates homepages images infobox-property-definitions infobox-test instance-types iri-same-as-uri labels specific-mappingbased-properties

    Maybe we should resolve redirects in interlanguage-links, but we would have to integrate redirect resolution into interlanguage link resolution. We're pretty strict when we generate interlanguage-links-same-as. If we resolve redirects in interlanguage-links, we would probably gain a few interlanguage-links (tenths of a percent), but we would not eliminate errors.

    Example call: ../run MapObjectUris /data/dbpedia transitive-redirects .nt.gz infobox-properties,mappingbased-properties,... -redirected .nt.gz,.nq.gz 10000-

    The following should be redirected (as of 2012-07-11) disambiguations,infobox-properties,mappingbased-properties,page-links,persondata,topical-concepts (specific-mappingbased-properties is not necessary, it has only literal values)

    TODO: merge with CanonicalizeUris?

  12. object MapSubjectUris

    Maps old subject URIs in triple files to new subject URIs: - read one or more triple files that contain the URI mapping:

    Maps old subject URIs in triple files to new subject URIs: - read one or more triple files that contain the URI mapping:

    • the predicate is ignored - read one or more files that need their subject URI changed:
    • the predicate is ignored

    Redirects in subject position SHOULD be resolved in the following datasets:

    anchor-text

    This is because during the extraction, the targets of the links are used as subjects in the anchor texts dataset and thus no redirect resolving is performed on them.

    Example call: ../run MapSubjectUris /data/dbpedia transitive-redirects .nt.gz anchor-text -redirected .nt.gz,.nq.gz 10000-

  13. object ProcessFreebaseLinks

    See https://developers.google.com/freebase/data for a reference of the Freebase RDF data dumps

  14. object ProcessInterLanguageLinks

    Split inter-language links into bidirectional and unidirectional links.

    Split inter-language links into bidirectional and unidirectional links.

    Example calls:

    'fr,de' mean specific languages. '-' means no language uses generic domain. '-fr-de' is file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as-fr-de.ttl.gz and enwiki-20120601-interlanguage-links-see-also-fr-de.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-fr-de.txt.gz -fr-de .ttl.gz - fr,de

    '10000-' means languages by article count range. 'en' uses generic domain. '-' means no file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-ttl.txt.gz - .ttl.gz en 10000-

    '-' means don't write dump file. ../run ProcessInterLanguageLinks /data/dbpedia - -fr-de .ttl.gz - fr,de

    if no languages are given, read links from dump file, not from triple files. ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links.txt.gz - .ttl.gz

    generate links for all DBpedia I18N chapters in nt format '-chapters' is file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as-chapters.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-chapters-nt.txt.gz -chapters .nt.gz en cs,en,fr,de,el,it,ja,ko,pl,pt,ru,es

    generate links for all languages that have a namespace on mappings.dbpedia.org in nt format '-mapped' is file name part, full names are e.g. enwiki-20120601-interlanguage-links-same-as-mapped.ttl.gz ../run ProcessInterLanguageLinks /data/dbpedia interlanguage-links-mapped-nt.txt.gz -mapped .nt.gz en @mappings

  15. object ProcessWikidataLinks

    Generate separate triple files for each language from Wikidata link file.

    Generate separate triple files for each language from Wikidata link file.

    Input format:

    ips_row_id ips_item_id ips_site_id ips_site_page 55 3596065 abwiki Џьгьарда 56 3596037 abwiki Џьырхәа 58 3596033 abwiki Аацы ... 17374035 868895 zh_yuewiki As50 17374052 552300 zh_yuewiki Baa, Baa, Black Sheep 17374062 813114 zh_yuewiki Beatcollective ... 257464664 3176059 frwiki Jeanne Granier 257464665 3176059 enwiki Jeanne Granier 257464677 7465026 ptwiki 36558 2000 QP105 257464678 8275441 frwiki Catégorie:Îles Auckland ...

    Fields are tab-separated. Titles contain spaces. Lines are sorted by row ID. For us this means they're basically random.

    Example call:

    ../run ProcessWikidataLinks process.wikidata.links.properties

  16. object QuadMapper

    Maps old quads/triples to new quads/triples.

  17. object QuadReader

  18. object RecodeUris

    Decodes DBpedia URIs that percent-encode too many characters and encodes them following our new rules.

    Decodes DBpedia URIs that percent-encode too many characters and encodes them following our new rules.

    Example call: ../run RecodeUris /data/dbpedia/links .nt.gz _fixed.nt.gz bbcwildlife.nt.gz,bookmashup.nt.gz

  19. object RemoveRemainingTags

    Removes HTML tags from the property values of generated triples.

    Removes HTML tags from the property values of generated triples. This might help if there are some tags left after the abstract extraction.

    Example call: ../run DecodeHtmlText /data/dbpedia short-abstracts,long-abstracts -tagstripped .nt.gz 10000-

  20. object ResolveTransitiveLinks

    Replace triples in a dataset by their transitive closure.

    Replace triples in a dataset by their transitive closure. All triples must use the same property. Cycles are removed.

    Example call: ../run ResolveTransitiveLinks /data/dbpedia redirects transitive-redirects .nt.gz 10000-

  21. object TypeConsistencyCheck

    Created by Markus Freudenberg

    Created by Markus Freudenberg

    Takes the mapping-based properties dataset and the assigned rdf types and tries to classify them in correct or wrong statements.

    Wrong statements are when the type of object IRI is disjoint with the property definition For correct we have different types but we skip them for now 1) Correct type/subtype 2) not correct type/subtype but not disjoint 3) the object IRI is untyped all 1-3 are for now kept together and not split

    TODO: this needs special care for English where nt/ttl use different IRIs/URIs

  22. object UnmodifiedFeederCacheGenerator

    Generates an SQL import file that contains all cache items

    Generates an SQL import file that contains all cache items

    Example calls: ../run UnmodifiedFeederCacheGenerator /data/dbpedia .nt.gz 2013-02-01 en

  23. object WikidataSameAsToLanguageLinks

    Generates language links from the Wikidata sameAs dataset as created by the org.dbpedia.extraction.mappings.WikidataSameAsExtractor.

    Generates language links from the Wikidata sameAs dataset as created by the org.dbpedia.extraction.mappings.WikidataSameAsExtractor. This code assumes the subjects to be ordered, in particular, it assumes that there is *exactly* one continuous block for each subject.

  24. object WikipediaDumpSplitter

    Split multistream Wikipedia dumps (e.g.

    Split multistream Wikipedia dumps (e.g. [1]) into size-configurable chunks [1] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2 Note: This script only works with multistream dumps!

    Usage: ../run WikipediaDumpSplitter /path/to/multistream/dump/enwiki-latest-pages-articles-multistream.xml.bz2 /path/to/mulstistream/dump/index/enwiki-latest-pages-articles-multistream-index.txt.bz2 /output/directory 64

Ungrouped