Current focus: performance for resolving broken files

Several people are now contributing to the repository. But it would be good to have a discussion first on the current focus. I would prefer to do this in an interactive meeting.

This project aims to resolve several problems:
 - conversion from different formats to NeTEx
 - conversion from NeTEx to different formats
 - conversion between various existing NeTEx profiles
 - having a consistent internal relationship model, and report if such consistency fails
 - actually making valid NeTEx out of the "NeTEx" some organisations publish

With respect to architecture we follow the basic principles used in column store databases, currently expressed in a key-value store (MDBX). It is important to be able to lookup a key (id + version), this is done by a hash. This hash resolves an index. The index can be found to find the actual serialised content. For referential mapping we store the forward relationship from the referencing object to the referenced object, this is 1:N. If the use would be interested in also going through the inverse quickly we can build this inverse index at the cost of diskspace.

What is the current "problem". There are datasets that are very poorly serialised. They don't validate for the defined identity-key-constraints in the NeTEx XML Schema. We know that the data itself can "make sense", a human would spot the errors and would be able te resolve the issue. Given that there are so many, it is not a task to do it by hand. This problem has multiple variants, with a various complexity:
 1. The producer has produced a file in which the attributes 'id' (and 'version') match, but "forgot" to mention which class it references, and it is either ambigious which class is ment due to the abstract nature of the schema (FromPointRef, NoticedObjectRef) or the producer uses a child class for example FromTimingPointRef while referencing a ScheduledStopPoint.
 2. The to be resolved elements are part of an **embedded** structure such as Quay (under StopPlace), DayType, DayTypeAssignment (under ServiceCalendar). We **do not** have an index for these embedded objects instances. Even if we would only store the ids for embedded properties this would explode the storage layer (think of storing the id of every individual TimetabledPassingTime)

Based on our experience we have created some strategies not to have to do a brute force computation. For example if we are searching for a broken reference from a TimingpointRef we can by topology of the schema limit this to TimingPoint, ScheduledStopPoint, FareScheduledStopPoint. This has been implemented, and is part of the code base. If we would search for a broken NoticedObjectRef, that means it would need to scan the **entire** document. As likely follows from the above we are unhappy with embeddings, and we would rather prefer lists of objects at a single location and references from them. This opens a new can of worms once implemented in that way, for example the inheritance of validity at individual objects.

The current problem: it is very costly to resolve broken embeddings. Because for this to work all parent objects that may host the types we are interested in must be searched for. This is already limits the set. Then for each parent object we iterate over all embeddings. If such embedding is found we can fix the broken relationship at the referencing object, and create an addition on our internal relationship model. I'll try to update all the tricks we are currently doing to limit the scope, but in essence this is the cause _unresolved_ performance issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Current focus: performance for resolving broken files #117

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Current focus: performance for resolving broken files #117

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions