Portal tables utils

library(nfportalutils)
synapser::synLogin(authToken = Sys.getenv("SYNAPSE_AUTH_TOKEN"))

NF Portal Tables Overview

These tables behind the NF Data Portal have grown over time and need maintenance like any other infrastructure. This walks through the original use cases for these utils to accomplish tasks such as:

Add new publications to Portal - Publications
Cleaning or migrating data, e.g. correcting an annotation “rnaSeq” to “RNA-seq” within Portal - Files
Add new people to Portal - People
Register a new study to Portal - Studies **
For each study in Portal - Studies, fill in related studies **

Some tasks marked ** are no longer done manually much of the time (having been automated as part of larger workflows), so they’ll be covered more briefly. When requirements change, having a sense of what these utils do can help understand where, what, and how to update.

How this will work

We’ll make copies of the relevant portal tables for update-type operations. What this also provides is an example workflow to follow for contributors who need to develop these utils and do testing.

First set your private development project id, which will be the parent project to host the table copies.

project_id <- "syn26462036" # the NF-dev-playground project

Portal - Publications updates

There are actually two options for adding publications: add_publication_from_pubmed and add_publication_from_unpaywall. The pubmed option is the default, and unpaywall is an additional option if there is no pmid.

Start with creating a table copy to work with.

PUBS_COPY <- copy_table("syn16857542", destination_id = project_id)

Adding 1-2 publications at a time

The minimum information needed is pmid (the new pub to add) and study_id (the linked study). This can use add_publication_from_pubmed, which pulls in author, journal, etc. from PubMed. What might need further explanation is the involvement of study_table_id – this needs to be a table where studyId, studyName, fundingAgency can be looked up to help fill fundingAgency with consistency.

Since this is a demo, the papers are not actually related or accurately classified at all. Commands show adding papers with and without additional disease_focus and manifestation labels (which were once manually derived).


STUDY_TABLE <- "syn52694652" # we will READ ONLY from this table
nfportalutils::add_publication_from_pubmed(pmid = 38383787,
                            study_id = "syn11672851",
                            disease_focus = "Neurofibromatosis type 1",
                            manifestation = c("MPNST"),
                            publication_table_id = PUBS_COPY,
                            study_table_id = STUDY_TABLE,
                            dry_run = F)


nfportalutils::add_publication_from_pubmed(pmid = 38383777,
                            study_id = "syn11672851",
                            publication_table_id = PUBS_COPY,
                            study_table_id = STUDY_TABLE,
                            dry_run = F)

Adding publications in large batch, from a spreadsheet

Large batches are often put in a spreadsheet and should instead use add_publications_from_file. The spreadsheet needs to have [studyId, pmid, diseaseFocus, manifestation] columns filled out. Other columns will be ignored.

An example of this format comes with the package and is shown below.


example_csv <- system.file("extdata", "pubs_example.csv", package = "nfportalutils")
new_pubs <- read.csv(example_csv)
knitr::kable(new_pubs)

pmid	studyId	diseaseFocus	manifestation	comments
38383777	syn11672851	NA		Drug-Target Explorer
38383780	syn4939902	Neufibromatosis type 1	MPNST	Johns Hopkins Biobank project
38375882	syn51133914\|syn51133929	Neufibromatosis type 1	MPNST\|Plexiform neurofibroma	DHART project 1and DHART project 2 produced collaborative paper

Several things to note:

Rarely, there may be multiple studies associated with one publication, so they need to be listed with a “|” (pipe) separator.
Indicating nulls is more nuanced for the spreadsheet version due to differences for STRING vs STRING_LIST – for diseaseFocus, use “NA” while for manifestation can leave blank.


add_publications_from_file(
  file = example_csv,
  publication_table_id = PUBS_COPY,
  study_table_id = STUDY_TABLE,
  list_sep = "|",
  dry_run = FALSE
)

Check the new additions in the UI.

To conclude this part of the vignette, clean up the table copy.

synapser::synDelete(PUBS_COPY)

For become even more erudite, review the source code or try some experiments regarding these concerns:

What happens with trying to add a pmid that already exists in the table?
What happens when the pmid is incorrect due to typo?

Toggle the code block below to show expected results.

#1. A publication that already exists should be skipped with a message saying so. 
#2. It fails.

Portal - Files corrections

TO DO.

Portal - People updates

Create the table copy.

PEOPLE_COPY <- copy_table("syn16857542", destination_id = project_id)

This relatively simple util finds new people that have made contributions and adds them to the people table.


add_people_from_table(people_table_id = PEOPLE_COPY, 
                      people_column = "ownerId", 
                      source_table_id = "syn16858331", # READ ONLY from the source table, which is Portal - Files
                      source_column = "createdBy",
                      dry_run = F)

Portal - Studies updates

Register new study