Generative AI Approaches to Maritime Data Parsing

Supervisors

Giorgio Orsi (Oxford Martin Fellow Oxford Martin Fellow)

Suitable for

Abstract

This project lies within the field of natural language processing (NLP) and Information Extraction (IE) using Generative AI (GenAI) technologies.

The specific problem at hand is Information Extraction (IE) from unstructured and semi-structured text documents used by the maritime industry. These documents include, among other:

● Port agent records (import/export bills of lading), containing information such as the identities of the vessels loading or discharging cargo into a port, the product carries, the quantities, and ports and dates of arrival, departure, clearance. Commercial information is often present such as the charterer, shipper, consignee, notify party. Depending on the country and the ports specific identifying codes can be present.

● Inspection reports, containing information about safety and commercial inspections of cargoes at various ports and carried out by specialised companies. These reports contain very technical detailed information about the cargoes, including technical specifications, e.g., APIs classifications for fuel. They also often include technical information about the vessel itself, e.g., checking the presence of a scrubber or how it fares against safety standards.

● Fixtures, containing information about a maritime fixture of a vessel for the transport of a specific cargo, including the type of product, the identity of the vessels, the rate at which it was fixed, the terms of the contracts, and the laycan dates.

● Port Lineups, containing predicted arrival schedules with predicted dates at ports across the world.

These documents are often the result of manual data inputting and OCR conversion (from e.g., paper filings) resulting in a substantial amount of noise and errors. In academic terms, these documents fall within the categorisation of Noisy Unstructured Text (or NUT) which poses challenges when coming to parsing and information extraction. Thanks to the extraordinary advances in Large Language Models (LLMs) and Generative AI, the state of the art in NLP and IE is extraordinarily advanced. Modern LLMs like GPT-o and Claude show incredible parsing and question answering capabilities both in terms of general language understanding as well as specialised domains, e.g., medical and legal where an extraordinary amount of data is available.

LLMs language understanding performance in specialised domains is directly dependent on the availability of data for that specific domain. While a large amount of data about the energy and maritime industries is available when it comes to general knowledge, data is extremely scarce when it comes to the technologies, operations, entities, and processes. Most of this data is still collected and managed manually by traders, brokers, port agents and government agencies without specific requirements for disclosure. It’s currently extremely challenging to train or even fine tune LLMs on the type of documents used within the energy and maritime industries. Moreover the fields are extremely technical, often matching the level and depth of technicalities found in the medical and legal domains.

The main aim of this project is to demonstrate that we can fully replace the need for custom rule-based parsers written in an imperative or declarative programming language entirely with LLMs. Additionally we aim at demonstrating that this can be achieved at the necessary level of accuracy required by the maritime industry (>95% accuracy).

The solution must be multi-modal (i.e., able to extract information from both images, text, and eventually audio) and should not be dependent on the type of information being parsed. For example, it needs to be able to process with a single model fixtures, port agent records, port lineups and inspection reports without requiring multiple models or different conditioning. The solution must be scalable to process documents in real-time as contracts (i.e., fixtures) are signed, and vessels enter or leave ports. Moreover, the solution must be transparent and explainable to comply with the strict regulatory requirements that the maritime and energy industries enforce, including confidentiality.

Skills and Experience Required:

● Driven by working in an intellectually engaging environment with the top minds in the industry, where constructive and friendly challenges and debates are encouraged, not avoided

● Strong foundation in software engineering and machine learning, with coursework in advanced machine learning or data science preferred.

● Proficiency in Python, especially in machine learning libraries and natural language processing.

● An understanding of the fundamentals of LLMs, Generative AI, and Retrieval Augmented Generation (RAG) is a plus.

Generative AI Approaches to Maritime Data Parsing

Supervisors

Suitable for

Abstract

Student Space