Slack Office Hours Recapped

In case you missed it, the NanoString team is hosting Office Hours in Slack.  The questions and answers are full of helpful information so we're recapping them here!

Office Hours: Learn the science (March 29, 2021)
Get an understanding of the science behind the dataset with the scientist who helped create it. Speaker: Stephanie Zimmerman, PhD - NanoString Scientist

Q: Where does the data come from? 

WTA measures the expression of every gene in the transcriptome in a distinct region of interest (ROI) in the tissue. So the form of the data is counts x gene x ROI. The way we get the count data is by hybridizing probes that are specific to each gene across the tissue. These probes have a photo-cleavable linker that is released when illuminated with UV light. The NanoString DSP instrument can illuminate just the probes in a region of interest that the researcher designates and collect the released tags from that region. We then sequence the tags to get counts for each region and each gene. You can watch a video about how it works here

Q: What kind of research questions might this technology assist with? Specifically, in the case of diabetic kidney disease. 

In the past most gene expression studies have focused on bulk tissue - in other words, looking at gene expression in the kidney as a whole. However, our tissues actually are very structured, and different structures have very different gene expression and are differently affected by disease. In the kidney specifically, the main structures are the glomeruli and the tubules - the glomeruli filter your blood and the tubules collect the filtrate and absorb water and small molecules. The glomeruli are damaged by diabetes but the degree of damage varies from one glomeruli to another.  Tubules are also damaged but has been less studied and understood. We hope that by profiling these structures separately we can learn how different regions of the kidney are affected by disease

Q: Is there a relation between the scanned images and the text (csv) datasets? 

Yes. On the scanned images you can visualize the ROIs that we profiled. The text file contains gene expression data for each ROI. So you can take the gene expression data and match it up to where it came from on the image by scan ID and ROI number.  This can be really interesting, because you can actually visualize how gene expression is different in different parts of the tissue and correlate it with spatial features from the image. This is a very new technology, so we are just learning about how best to view gene expression in a spatial context and integrate the gene expression data with the image data. That's why I'm so excited for this hackathon!

Q: And again, still about the scanned images (and by the way are they really scanned, these tissues being so so small?) is knowledge about their magnifications useful, I had really wished to know that.

At the bottom of the images there is a ruler with the pixel/micrometer equivalence The full scans are 20x, the tissue slices are on the order of 10x20 mm (obviously each one is a bit different). The tissues are mounted on a slide and scanned with 4 wavelengths of light to visualize the 4 different morphology markers on the tissue (in the case of this dataset, DNA and 3 proteins: WT1, Pan cytokeratin, and CD45. But you can use other markers for different experiments). The full scan images available are a composite of the 4 single channel scans, and are at 20x magnification

Q: What are the units of the ROI Coordinates in the sample annotations file? What set of images do they reference? Those of high resolution?

Those refer to the full resolution tiff images, not the full scan png images. I believe they are in pixels and are indexed to the top left corner of the image but I will actually need to check on that. I can make sure we post that information in the README. They are in pixels from the top left, with x extending to the right and y extending down. we also have full resolution single channel images available for two slides of the 7 if people want to extract the image data from those

Q: What are the units of the ROI Coordinates in the sample annotations file. What set of images do they reference? Those of high resolution?

So we essentially took a 2D slice through a section of the kidney and are looking at cross-sections of tubules, glomeruli, and blood vessels (and other cells too, but those are the main structures). We selected individual glomeruli (they look kind of like circular clumps of cells in the image) that were annotated by a pathologist as being very diseased or more normal. For the tubules we didn't select based on pathology because it's not as well studied in diabetic kidney disease, instead we captured example ROIs across the tissue. The tissue structures are connected, but since it's a 2D slice it's hard to see the connections.

Office Hours: Learn the tools (April 2, 2021)
Get an introduction to the GeoMx platform and how it can interact with the data.
Speaker: Nicole Ortogero, PhD - NanoString Bioinformatics Scientist

Q: What’s your favorite part of the toolset? Or what would you do if you were building for the hackathon? 

A: My favorite part of the toolset are the subsetting functions. I love that I can simply say what features or samples to keep and all my data as well as metadata will be trimmed and carried over to my new object in just one short method call. I did think of this really fun project that you could use VR to traverse your slide landscape and show what genes were lighting up and where on the slide with fun graphics like fireworks. Very abstract and not entirely useful for analysis, but fun.

Q: What limitations do you consider current visualizations have in integrating gene expression data and inmuno-fluorescence images?

A: Plex is one of the biggest hurdles. Say you want to show how a group of 50 genes are expressing on an IF image, which in itself has color information already, how would you clearly show the differences in expression on one visualization?

Q: In the data provided, for some images there are no Segments, neg, WT1, PanCK. Is the data missing for them or does it have some meaning?

A: No segments mean no segmentation was performed so the region of interest was not further subsetted into a segment and was just taken as is. 

Q: What do you anticipate being the most challenging part of this hackathon and what tips do you have?

A: The most challenging part of this hackathon will be learning the technology as it is new and most people are not familiar with it. I highly recommend going through the resources provided.

Q: What is the scope of this challenge? Developing additional features using R for the above library 'NanoString GeoMx Tools'? OR we are supposed to develop a web application and try new visualizations?

A: All of those options are within scope for this challenge. We are hoping people can get creative in making tools to help GeoMx scientists, may it be to analyze or visualize their data.

Q: Do bioinformatic scientists find some value in machine learning models with low interpretability? Let's say you found cells affected by disease are clustered in some space projected by UMAP (or TSNE), but the axes of that graph don't have a particular association with some measurable quantity.

A: Yes, we commonly use dimension reduction tools such as UMAP, tSNE, PCA, etc. for GeoMx data.We use them as most do, to QC - such as checking that like clusters with like or to find outliers, or even better to find new unique clusters that may have biological importance. With GeoMx data, you can relate these clusters back to their physical location, which is the true beauty of spatial data. We can see how these clusters interact with each other in space!

Q: For some of us who are completely new to this domain, is it possible to provide some  pointers such as you mentioned in the above message "VR to traverse" and "plex"?

A: As of right now one of the biggest issues is integrating the spatial data with expression data. There is so much information that comes from both pieces of that puzzle. I already talked about making easy to read expression overlay on IF images, but there is also pathway, cluster, cell deconvolution, etc. types of results that come from expression data that can be integrated with an image. In addition, you can look at it from the image perspective first. So, if we look at proximity of cells on my IF image and translate that into numerical data, how can I incorporate that with the expression data and corresponding results previously listed.There are also challenges from a statistical perspective such as background modeling and outlier detection.If I build a model to predict say normal vs disease does my model improve when I add spatial + expression data. Obviously it most likely will, but I'm sure feature selection will be interesting.


Office Hours: Learn the possibilities (April 9, 2021)

Get a closer look at what has been developed using the GeoMx Digital Spatial Profiler data already. Speaker: Tyler Hether, PhD, - Data Scientist

Q: If you have any favorite projects or thoughts on what could be an interesting hackathon project concept, please share!

A: One of the challenges (and opportunities) we see in some of our GeoMx applications is merging of spatial data with expression data. I think computational biology and bioinformatics has focused on the complexities of DNA/RNA. But the integration of spatial components can lead to new biological insights that you might not have seen by looking at IF and gene expression separately.

One project that I worked on recently was presented at AGBT and the Keystone conference (single cell biology). A link of that is here:[…]3&elq=00000000000000000000

Q: Could you give an example of how the joint analysis of gene expression data and inmunofluorescence images led to new insights about a particular disease?

What quantitative methods do you use to characterize features in images?

A: @Tyler Hether’s been developing methods around these types of questions integrating a totally different field (landscape ecology), and has a poster up on the inspiration board about analysis of colorectal cancer. He can speak to some ways we've started to integrate the data, but there's tons that can be done. Other methods beyond the ones he may discuss include analyzing the single-channel TIFFs or ROI images for % composition of a given cell type colored by immunofluorescent stains, and looking for downstream associations with that in the count data across the dataset as a whole. I'd take a look into the field of digital pathology (and resources like for more information or inspiration.

Q: What’s your favorite project you’ve worked on?

A: This last year, as we know, has been impacted greatly by the pandemic. One of the more exciting set of projects have been centered around understanding the gene expression profiles of Covid-19 patients who unfortunately died of disease. In a recent set of published papers we compared RNA expression of specific compartments of Covid-19 lung (vascular, alveoli, etc.) and compared those results to FFPE samples from other forms of ARDS (viral and bacterial) as well as normal healthy control lungs from donors.In these projects, the blending of multiple 'omics data gives us a better understanding of what's happening to these patients. Here are two recent papers from that project:

Q: I saw on the inspiration page you choose to give an example with diabetic kidney disease. What other types of kidney diseases would be most helpful to reference or include as inspiration when building our projects?

A: Diabetic kidney disease (type II) is the type of tissue available in our dataset, but recent work in diabetes shows a strong role of the immune system in disease progression for DKD. I'd recommend looking at other autoimmune diseases which impact the kidney. Some resources might include:

Q: What are the biomarkers that is used to identify DKD? Are there any resources for that as well for currently used techniques?

A: There's a couple good open access resources I can point you towards to better understand the disease. Here's a review article you might find helpful:

At the same time - most biomarkers for this disease are likely blood or urine based. All of our samples are kidney resections (where they remove the whole tissue), so these are likely late stage disease and I'll work to find out more information about what we know about the patient samples so you have a more complete understanding of how far along disease progression these samples are. So some of these biomarkers may not be directly related to the tissue structures we are profiling, but new ones should be able to be found in this dataset as well!

Q: In the images of glomerulus and tubules, the genetic expression is only of the polygon?,  I ask, because I've seen images where structures outside the polygon are also colored 

A: The tubules are segmented based on PanCK-Pos and PanCK-Neg. You will get two samples (dccs) for that ROI: one for each area of interest (AOI). For example: in the kidney_norm@phenoData@data, you'll see the same ROI ID for two samples. the way our system works is we define the region we want to profile (white line in this example) and we can then decide how we want to collect data from within that region. If there's no additional color inside the image (just the outline) then we collected everything bounded by the white line. 

This is segmentation where we collect 2 seperate parts of the tissue bounded by the white line as distinct areas like this: they will be shown as different colors in the ROI images

That allows us to study the tissue structures that are defined by different IF staining patterns even though they are physically next to each other

Q: Could you clarify what IF is?

A: IF stands for immunofluorescence. IF is why there are yellow, purple, and magenta regions - each representing a different protein's expression

Q: One question that came up in previous office hours was related to the ROI coordinates provided.

A: As a follow-up to that, we are working on uploading extra annotation information about the ROI locations (hopefully today, we'll post a note when it's live), that describe them more specifically, and provide ROI coordinates for the smaller png files found in the ROI reports as well. Those might be easier for people to test on if you're having trouble loading the very large TIFF images, and the pngs will be compatible with most image editing software / platforms.