Tabular data in the form of CSV files is the common input format in a data analytics pipeline. However a lack of understanding of the semantic structure and meaning of the content may hinder the data analytics process. Thus gaining this semantic understanding will be very valuable for data integration, data cleaning, data mining, machine learning and knowledge discovery tasks. For example, understanding what the data is can help assess what sorts of transformation are appropriate on the data.
Tables on the Web may also be the source of highly valuable data. The addition of semantic information to Web tables may enhance a wide range of applications, such as web search, question answering, and knowledge base (KB) construction.
Tabular data to Knowledge Graph (KG) matching is the process of assigning semantic tags from Knowledge Graphs (e.g., Wikidata or DBpedia) to the elements of the table. This task however is often difficult in practice due to metadata (e.g., table and column names) being missing, incomplete or ambiguous.
The SemTab challenge aims at benchmarking systems dealing with the tabular data to KG matching problem, so as to facilitate their comparison on the same basis and the reproducibility of the results.
The 2021 edition of this challenge will be collocated with the 20th International Semantic Web Conference and the 16th International Workshop on Ontology Matching.
The ground truths are now open:
Target Knowledge Graphs: Schema.org (version: May 2021), DBPedia (version: 2016-10), Wikidata (version: 20210828)
The codes of the AICrowd evaluators are also available here.
See full ISWC program here with the relevant links to the sessions. Material from the SemTab sessions: posters and recorded oral presentations.
Results of all three rounds available here. Summary of SemTab 2021 results here.
Prizes sponsored by IBM Research:
The results of the challenge will be presented on October 27 (Wednesday). Three teams will also present their systems.
October 27, Session 4D (EDT (US): 10:20-11:20. CET (EU): 16:20-17:20. CST (China): 22:20-23:20):
SemTab will be present during the ISWC Posters & Demos/Social sessions. We will use wonder.me together with the other ISWC Semantic Web challenges.
SemTab will also be present at the Ontology Matching (OM) workshop on October 25 (14:30-15:30 CET). See full OM program here. We will also use wonder.me for the OM poster session (note that the wonder.me rooms are different).
Posters:We have a discussion group for the challenge where we share the latest news with the participants and we discuss issues risen during the evaluation rounds.
Please register your system using this google form.
Note that participants can join SemTab at any Round for any of the tasks/tracks.
As in previous editions, SemTab includes the following tasks organised into several evaluation rounds:
The challenge will be run with the support of the AICrowd platform and the STILTool system.
This new track aims at addressing applications in real-world settings that take advantage of the output of the matching systems. Challenging dataset proposals are also more than welcome.
Bio-Track: Due to advances in biological research techniques, new data is constantly being produced in the biomedical domain and it is commonly published unstructured or tabular formats. This data is not trivial to integrate semantically due not only to its sheer amount but also the complexity of the biological relations between entities. Specifically, for tabular data annotation, the representation of data can have a significant impact in performance since each entity can be represented by alphanumeric codes (e.g., chemical formulas or gene names) or even have multiple synonyms. Therefore, the domain would greatly benefit from automated methods to map entities, entity types and properties to existing datasets to speed-up the process of integrating new data in the domain.
We encourage participants to submit a system paper using easychair. The paper should be no more than 12 pages long (excluding references) and formatted using the LNCS Style. System papers will be reviewed by 1-2 challenge organisers.
Accepted system papers will be published as a volume of CEUR-WS. By submitting a paper, the authors accept the CEUR-WS publishing rules.
This challenge is organised by Kavitha Srinivas (IBM Research), Ernesto Jiménez-Ruiz (City, University of London; University of Oslo), Oktie Hassanzadeh (IBM Research), Jiaoyan Chen (University of Oxford), Vasilis Efthymiou (FORTH - ICS), Vincenzo Cutrona (University of Milano - Bicocca), Juan Sequeda (data.world), Daniela Oliveira (Universidade de Lisboa), Catia Pesquita (Universidade de Lisboa), Nora Abdelmageed (University of Jena), and Madelon Hulsebos (University of Amsterdam). If you have any problems working with the datasets or any suggestions related to this challenge, do not hesitate to contact us via the discussion group.
The challenge is currently supported by the SIRIUS Centre for Research-driven Innovation and IBM Research.
BiodivTab is credited to Nora Abdelmageed, Sirko Schindler, Birgitta König-Ries, Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Germany. The tables provided in this challenge are based on real biodiversity research datasets, but have been adapted for the challenge. In the form provided here, they may be used for the challenge, only. Any publication on challenge results needs to contain citations of the underlying datasets. These citations will be made available after the challenge deadline.