library(nfportalutils)
synapser::synLogin(authToken = Sys.getenv("SYNAPSE_AUTH_TOKEN"))
NF Portal Tables Overview
These tables behind the NF Data Portal have grown over time and need maintenance like any other infrastructure. This walks through the original use cases for these utils to accomplish tasks such as:
- Add new publications to Portal - Publications
- Cleaning or migrating data, e.g. correcting an annotation “rnaSeq” to “RNA-seq” within Portal - Files
- Add new people to Portal - People
- Register a new study to Portal - Studies **
- For each study in Portal - Studies, fill in related studies **
Some tasks marked ** are no longer done manually much of the time (having been automated as part of larger workflows), so they’ll be covered more briefly. When requirements change, having a sense of what these utils do can help understand where, what, and how to update.
How this will work
We’ll make copies of the relevant portal tables for update-type operations. What this also provides is an example workflow to follow for contributors who need to develop these utils and do testing.
First set your private development project id, which will be the parent project to host the table copies.
project_id <- "syn26462036" # the NF-dev-playground project
Portal - Publications updates
There are actually two options for adding publications:
add_publication_from_pubmed
and
add_publication_from_unpaywall
. The pubmed
option is the default, and unpaywall
is an additional
option if there is no pmid
.
Start with creating a table copy to work with.
PUBS_COPY <- copy_table("syn16857542", destination_id = project_id)
Adding 1-2 publications at a time
The minimum information needed is pmid
(the new pub to
add) and study_id
(the linked study). This can use
add_publication_from_pubmed
, which pulls in author,
journal, etc. from PubMed. What might need further explanation is the
involvement of study_table_id
– this needs to be a table
where studyId
, studyName
,
fundingAgency
can be looked up to help fill
fundingAgency
with consistency.
Since this is a demo, the papers are not actually related or
accurately classified at all. Commands show adding papers with and
without additional disease_focus
and
manifestation
labels (which were once manually
derived).
STUDY_TABLE <- "syn52694652" # we will READ ONLY from this table
nfportalutils::add_publication_from_pubmed(pmid = 38383787,
study_id = "syn11672851",
disease_focus = "Neurofibromatosis type 1",
manifestation = c("MPNST"),
publication_table_id = PUBS_COPY,
study_table_id = STUDY_TABLE,
dry_run = F)
nfportalutils::add_publication_from_pubmed(pmid = 38383777,
study_id = "syn11672851",
publication_table_id = PUBS_COPY,
study_table_id = STUDY_TABLE,
dry_run = F)
Adding publications in large batch, from a spreadsheet
Large batches are often put in a spreadsheet and should instead use
add_publications_from_file
. The spreadsheet needs to have
[studyId
, pmid
, diseaseFocus
,
manifestation
] columns filled out. Other columns will be
ignored.
An example of this format comes with the package and is shown below.
example_csv <- system.file("extdata", "pubs_example.csv", package = "nfportalutils")
new_pubs <- read.csv(example_csv)
knitr::kable(new_pubs)
pmid | studyId | diseaseFocus | manifestation | comments |
---|---|---|---|---|
38383777 | syn11672851 | NA | Drug-Target Explorer | |
38383780 | syn4939902 | Neufibromatosis type 1 | MPNST | Johns Hopkins Biobank project |
38375882 | syn51133914|syn51133929 | Neufibromatosis type 1 | MPNST|Plexiform neurofibroma | DHART project 1and DHART project 2 produced collaborative paper |
Several things to note:
- Rarely, there may be multiple studies associated with one
publication, so they need to be listed with a “|” (pipe)
separator.
- Indicating nulls is more nuanced for the spreadsheet version due to differences for STRING vs STRING_LIST – for diseaseFocus, use “NA” while for manifestation can leave blank.
add_publications_from_file(
file = example_csv,
publication_table_id = PUBS_COPY,
study_table_id = STUDY_TABLE,
list_sep = "|",
dry_run = FALSE
)
Check the new additions in the UI.
To conclude this part of the vignette, clean up the table copy.
synapser::synDelete(PUBS_COPY)
For become even more erudite, review the source code or try some experiments regarding these concerns:
- What happens with trying to add a pmid that already exists in the table?
- What happens when the pmid is incorrect due to typo?
Toggle the code block below to show expected results.
#1. A publication that already exists should be skipped with a message saying so.
#2. It fails.
Portal - People updates
Create the table copy.
PEOPLE_COPY <- copy_table("syn16857542", destination_id = project_id)
This relatively simple util finds new people that have made contributions and adds them to the people table.
add_people_from_table(people_table_id = PEOPLE_COPY,
people_column = "ownerId",
source_table_id = "syn16858331", # READ ONLY from the source table, which is Portal - Files
source_column = "createdBy",
dry_run = F)
Portal - Studies updates
Augment with ‘related studies’
TO DO.
Here are some other things to get deeper via the source code and/or docs:
- Why is there both
n_k
and andn_clust
. Why is this used withn_k
instead ofn_clust
?
Toggle the code block below to show asnwers.
# That's just because `n_clust` generates results as clusters with highly variable numbers of member studies, i.e. there could be 20 studies around this one mainstream topic vs 2-3 in this more arcane topic. The table breaks when list length exceeds a certain limit. Historically, clusters have been used and a max of four studies selected as a workaround. But using `n_k` can give better related results with more control.