Usability issues

I was trying copyVAE on two different datasets to get an idea for its performance and noticed a few small things that I wanted to point out, as potential improvement suggestions or in case someone else runs into similar issues. I used a trisomy12 dataset and the DCIS dataset the authors used for benchmarking in the paper. 


 

**1. Access to the sorted anndata object as output**
Since the anndata object gets sorted by the absolute position in `bin_genes_from_anndata` it would be great to write this sorted adata object to file somewhere. After running copyVAE I want to be able to trace back the resulting matrices to the cells they were originally annotated with. Since they got sorted, the original input order is not maintained anymore and can't be used for labelling. One option would be to just write the "`data`" object to file, for example after the following fields are set:
```
data.obsm["latent"] = z
data.obsm["copy_number"] = copy_bin
```
 

Using my trisomy12 dataset, which I imported using the `sc.read_10x_mtx()` function (versus the DCIS dataset which I imported as `sc.AnnData() `from a pd dataframe) before saving it as .h5ad for copyVAE, I came across a runtime issue: 

**2. Datastructure missmatch**
The data imported with sc.read_10x_mtx comes in the format
```
type(data.X)
<class 'scipy.sparse._csc.csc_matrix'>
```

and after calling `todense()` on it stays a matrix:
```
type(x)
<class 'numpy.matrix'>
```

In comparison to the DCIS dataset, which is type `numpy.ndarray`. The code seems to work with the ndarray, but not matrix.

Because this further down the line in `auto_corr()` called from `find_normal_cluster()` leads to a ValueError in the line 
`res += np.sum(cluster_data[i,:] * cluster_data[j,:])`, because of the dimensions.
` ValueError: shapes (1,529) and (1,529) not aligned: 529 (dim 1) != 1 (dim 0)`

This can be, for example, fixed by converting 
```
try:
 x = data.X.todense()
 except AttributeError:
 x = data.X

 if(isinstance(x, np.matrix)):
 x = np.asarray(x)
```
Either this or defining the input has to be in array form and not matrix form.


 

**4. Gene name mapping**
I have noticed that the `gene_ids` in adata are overwritten by the `gene_name` 

```
 if 'gene_name' in adata.var:
 adata.var['gene_ids'] = adata.var['gene_name']
```

Only to later on match on the gene names when mapping, but from the (now overwritten with gene names) gene_ids variable in adata.

```
map_chr = dict(gene_map[['Gene name', 'chr']].values)
 adata.var['chr'] = adata.var.gene_ids.map(map_chr)
```

Why not match on gene_names in adata, instead of gene_ids and overwriting? Or alternatively matching gene_ids from adata and "Gene Stable ID" from the gene_map? I was trying to set the metadata correctly in the anndata object in my trisomy dataset, to match how the code expects it to be, but this part seemed unnecessarily confusing.

 

**5. Output decimals and not integers ("whole copy numbers")**
I have noticed that the output from both datasets are fractions and not whole numbers. While the differences in CN, for example in the trisomy12 dataset, are there in a fractional state, they are not really big enough so they wouldn't disappear when rounding to whole copy numbers. Would it make sense to add a functionality to get whole copy numbers?

 

**6. Nice to Have: Add a parameter for setting the output_path directory.**
Would just make things a bit more convenient: `parser.add_argument("-o", "--output_path", help="output path prefix")`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usability issues #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Usability issues #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions