[WIP] [#604] Normalize organisations names#116
Conversation
9d9841a to
5f2ef1c
Compare
58e9ca7 to
5f2ef1c
Compare
|
So, played a bit with the failing build issue, here are some observations:
There must be a |
|
Just for the record, this discussion summarizes the issue we are facing here. Apparently the most elegant solution comes from the libs (fastcluster and pyhacrf) which shouldn't use numpy as requirement on their setup.py @nightsh both workarounds you've suggested work, and if we choose to go for any of them, I would also suggest to avoid the tox deps and add the custom command on the commands target, right before the tests itselves |
2c70917 to
5f2ef1c
Compare
vitorbaptista
left a comment
There was a problem hiding this comment.
Apart from my comments, could you explain why we need a training file? I thought we would simply read the contents from the DB when re-generating the clusters.
| """ | ||
| CLUSTER_QUERY = "SELECT canonical " + \ | ||
| "FROM organisation_clusters " + \ | ||
| "WHERE '%s'=ANY(variations)" % organisation |
There was a problem hiding this comment.
Instead of binding the organisation parameter via string interpolation, rely on SQLAlchemy for that. The query would become SELECT canonical FROM organisation_clusters WHERE '?'=ANY(variations) and you'd call it as conn['warehouse'].query(CLUSTER_QUERY, organisation) (if I remember correctly)
| logger.debug('Organisation "%s" normalized as "%s"', organisation, normalized_form) | ||
| except StopIteration: | ||
| logger.debug('Organisation "%s" not normalized', organisation) | ||
| pass |
There was a problem hiding this comment.
Forgot to remove the pass
| cluster_ghent = { | ||
| 'canonical': 'Ghent University Hospital', | ||
| 'variations': ['Ghent University Hospital', 'Ghent University'] | ||
| } |
There was a problem hiding this comment.
These variations feel like should be in different clusters, as they look like different organisations.
|
|
||
| @pytest.mark.usefixtures('organisation_cluster') | ||
| def test_organisation_normalizer(self, conn, test_input, expected): | ||
| assert helpers.get_canonical_organisation_name(conn, test_input) == expected No newline at end of file |
There was a problem hiding this comment.
I think this test is doing too much. First of all, the pattern of the organisation_cluster fixture is confusing, as it simply adds two fixtures to the DB without returning anything (which is probably the reason you've used @pytest.mark.usefixtures instead of adding it as a parameter). There's also the issue that we're hardcoding the data we expect to be created by the fixture here. I think a fixture is the wrong abstraction here, at least with the very simple fixture tools we have.
Instead of doing this, please remove the organisation_cluster fixture and create the data you need manually in the tests for the 3 cases you're testing. You could create a _create_organisation_cluster(self, conn, canonical_name, variations) to help keep the code DRY.
There was a problem hiding this comment.
Resolved. Please check if the current approach meets your suggestion
|
All code issues were resolved. We're still looking into how to train the clustering algorithm to avoid incorrect clusters |
opentrials/opentrials#604