Statistics

3. Statistical Analyses #

Now that you have quantitative data on your vessels, you can perform statistics to investigate what those numbers mean.

  • 3.1 Normalization #

    Before you can apply statistical analyses, it is necessary to normalize your data. The code Area_slide.py (script downloadable from the user slrenne’s erivessel repository) allows you to normalize the number of vessels on the tissue sample area. The function of this code is to segment the entire sample area through the L channel, also known as the luminance channel.

    To begin, set up the environment.

    #python
    import cv2
    import numpy as np
    import matplotlib.pyplot as plt
    import os
    import glob
    import pandas as pd
    from tqdm import tqdm 
    

    The following function takes the path to an image as input and returns the area of tissue present, optionally saving visualizations of the results. The image is read using OpenCV, then converted from BGR to RGB to LAB (L = luminance, A = green/red, B = blue/yellow). Here, only the L channel is extracted to help separate tissue from the background based on brightness.

    #python
    def calculate_tissue_area(image_path, save_visualizations=False, output_folder=None):
    
      img = cv2.imread(image_path)
      if img is None:
        print(f"Error: Could not read image {image_path}")
        return 0, None, None
    
      img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
      lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
      l_channel = lab[:,:,0]
    

    You then apply a binary inverse threshold: pixels brighter than 220 (almost white) are set to 0 (black), while others are set to 255 (white). This isolates the darker tissue from the bright background. Next, you perform morhological operations and find the contours of white regions in the binary image, which should correspond to tissue.

    #python
      _, binary = cv2.threshold(l_channel, 220, 255, cv2.THRESH_BINARY_INV)
    
      kernel = np.ones((5,5), np.uint8)
      binary = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
      binary = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
    
      contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
      mask = np.zeros_like(l_channel)
    
      min_contour_area = 1000  # Adjust based on your image size
      for contour in contours:
          if cv2.contourArea(contour) > min_contour_area:
              cv2.drawContours(mask, [contour], 0, 255, -1)
    

    Here, you count how many white pixels are present, knowing that each pixel corresponds to 1 unit of tissue area. Additionally a mask is used to extract only the tissue from the RGB image by making pixels outside the tissue black.

    #python
      area_pixels = cv2.countNonZero(mask)
      filled_tissue = cv2.bitwise_and(img_rgb, img_rgb, mask=mask)
    

    If enabled, the save_visualizations function generates a 2x2 subplot with: the original image, the binary threshold result, the tissue mask and the overlayed image with area shown in the title. The plots are then saved in the specified output_folder with _analysis.png suffix.

      if save_visualizations:
          if output_folder is None:
              output_folder = os.path.dirname(image_path) or '.'
    
          plt.figure(figsize=(15, 10))
    
          plt.subplot(2, 2, 1)
          plt.imshow(img_rgb)
          plt.title('Original Image')
          plt.axis('off')
    
          plt.subplot(2, 2, 2)
          plt.imshow(binary, cmap='gray')
          plt.title('Thresholded Image')
          plt.axis('off')
    
          plt.subplot(2, 2, 3)
          plt.imshow(mask, cmap='gray')
          plt.title('Tissue Mask')
          plt.axis('off')
    
          plt.subplot(2, 2, 4)
          plt.imshow(filled_tissue)
          plt.title(f'Filled Tissue (Area: {area_pixels} pixels)')
          plt.axis('off')
    
          plt.tight_layout()
    
          base_name = os.path.splitext(os.path.basename(image_path))[0]
          plt.savefig(f"{output_folder}/{base_name}_analysis.png", dpi=300)
          plt.close()
    

    At the end the calculate_tissue_area function returns the tissue area in pixels, the binary mask, and the RGB image with tissue highlighted.

      return area_pixels, mask, filled_tissue
    

    Example output: a binary image showing the background in black and the total tissue area in white.

    Binary image showing tissue as white and background as black

    The next function, process_folder, processes all image files in a given folder and applies the calculate_tissue_area function to each of them. The results are saved to a CSV file.

    def process_folder(input_folder, output_folder, file_extensions=None):
      if file_extensions is None:
          file_extensions = ['.jpg', '.jpeg', '.png', '.tif', '.tiff']
    
      os.makedirs(output_folder, exist_ok=True)
    
      image_files = []
      for ext in file_extensions:
          image_files.extend(glob.glob(os.path.join(input_folder, f'*{ext}')))
          image_files.extend(glob.glob(os.path.join(input_folder, f'*{ext.upper()}')))
    
      if not image_files:
          print(f"No image files found in {input_folder} with extensions {file_extensions}")
          return None
    
      print(f"Found {len(image_files)} image files to process")
    
      results = []
    
      for img_path in tqdm(image_files, desc="Processing slides"):
          try:
              filename = os.path.basename(img_path)
              area, mask, filled_tissue = calculate_tissue_area(img_path)
    
              img = cv2.imread(img_path)
              base_name = os.path.splitext(filename)[0]
              cv2.imwrite(f"{output_folder}/{base_name}_original.png", img)
    
              if mask is not None:
                  cv2.imwrite(f"{output_folder}/{base_name}_mask.png", mask)
    
              results.append({
                  'Image': filename,
                  'Tissue Area (pixels)': area
              })
    
          except Exception as e:
              print(f"Error processing {img_path}: {e}")
    
      if results:
          df = pd.DataFrame(results)
          csv_path = os.path.join(output_folder, "tissue_areas.csv")
          df.to_csv(csv_path, index=False)
          print(f"Results saved to {csv_path}")
          return df
      else:
          print("No results were generated")
          return None   
    

    The final section of this script is the main execution block. It defines the folders to use, calls the processing function and checks whether results were produced by printing a confirmation or warning message accordingly.

    if __name__ == "__main__":
      input_folder = 'Input'
      output_folder = 'Output'
    
      print(f"Processing slides from {input_folder}")
      print(f"Results will be saved to {output_folder}")
    
      df = process_folder(input_folder, output_folder)
    
      if df is not None:
          print("Processing complete!")
          print(f"CSV file with area measurements saved to {output_folder}/tissue_areas.csv")
      else:
          print("No results were generated.")
    

    Now that you have calculated the area of each tissue section in pixels, you can proceed in performing your desired statistical analyses normalizing on tissue area.

  • 3.2 Bayesian Modeling #

    This section presents a complete example of how to apply a hierarchical Bayesian model. In our case, the primary focus of the analysis is the minor axis length of blood vessels, a morphological feature that can vary across treatments and experimental models. Specifically, the goal of this statistical approach is to determine whether different treatments have a significant effect on vessel morphology, while accounting for possible variability introduced by different experimental models. The analysis uses partial pooling, which allows for more robust estimation of effects even in the presence of small sample sizes or hierarchical data structures.

    The full code modeling.R is available in the erivessel repository by the user slrenne. Start by loading the rethinking package, which is built on top of Stan and allows to easily define and fit Bayesian models using the ulam function. The dataset is read from a CSV file.

    #r
    library(rethinking)
    db <- read.csv('./input/db.csv', stringsAsFactors = FALSE)
    

    To prepare the data for modeling, first convert the model and treatment columns to factors. This ensures that they are treated as categorical variables when passed to the model. Then apply a natural logarithmic transformation to the axis_minor_length variable, which is often necessary to normalize right-skewed biological measurements.

    #r
    db$model <- as.factor(db$model)
    db$treatment <- as.factor(db$treatment)
    db$min_log <- log(db$axis_minor_length)
    

    For the next step, you need to structure data for the model. The ulam function expects a list of named vectors. Here, you standardize the log-transformed minor axis lengths so that they have a mean of 0 and standard deviation of 1. This standardization helps with model convergence and interpretation. Also, this is where you should convert the factor variables into numeric indices.

    #r
    d <- list(
      axis_length = standardize(db$min_log),
      mod = as.numeric(db$model),
      rx = as.numeric(db$treatment))
    

    Then, specify a hierarchical Bayesian model using ulam. This model assumes that the standardized log-minor-axis length is normally distributed around a linear predictor mu that is the sum of effects from both the model type and the treatment group. Each group effect is given a group-specific prior with partial pooling. This setup allows you to estimate the overall average effects (a_bar, b_bar) as well as group-level deviations (a[mod], b[rx]), while controlling for uncertainty.

    #r
    m <- ulam(
      alist(
        axis_length ~ dnorm(mu, sigma),
        mu <- a[mod] + b[rx],
        a[mod] ~ dnorm(a_bar, sigma_a),
        b[rx] ~ dnorm(b_bar, sigma_b),
        c(a_bar, b_bar) ~ dnorm(0,0.7),
        c(sigma, sigma_a, sigma_b) ~ dexp(1)
      ), data=d, chains=4, cores=4, iter=1000)
    

    After fitting the model, we examine summary statistics and convergence diagnostics. The dashboard() function opens an interactive window, while precis() provides tabular summaries of the posterior distributions. You then extract the full posterior samples.

    #r
    dashboard(m)
    precis(m)
    post <- extract.samples(m)
    

    Define a helper function p_link() to compute posterior predictions for any combination of model and treatment. The dest() function reverses the standardization so that predictions are returned to the log scale.

    #r
    p_link <- function(mod = 1, rx = 1) {
      x <- with(post, a[, mod] + b[, rx])
      return(x)}
    dest <- function(x) {
      x <- x * 0.378
      x <- x + 4.44
      return(x)}
    

    Next, generate simulated values from the model for a specific combination of inputs and then transform those predictions back to the original axis scale and visualize the results using a posterior predictive check. The reasoning behind this plot is that it allows us to visually assess whether the model is producing values similar to those observed in the dataset.

    #r
    post_mu <- p_link(d$mod, d$rx)
    sim_min <- rnorm(length(d$axis_length), post_mu[1,], 1 )
    sim_min <- dest(sim_min)
    plot(NULL, xlim = c(0,200), ylim = c(0,0.03), 
        main = 'Posterior Predictive Check', 
        xlab = 'Minor Axis Size',
        ylab = 'Density')
    dens(db$axis_minor_length, lwd = 3, add = TRUE)
    dens(exp(sim_min), lwd = 3, add = TRUE, col = 2)
    

    To then assess the direct effect of treatments without adjusting for model type, we simulate predictions for each treatment level independently.

    #r
    p_link <- function(rx = 1) {
      x <- with(post, rnorm(length(post$a_bar), b[, rx], sigma_b))
      x <- dest(x)
      return(x)
    }
    p_raw_bsim <- sapply(1:4, function(i) p_link(i))
    

    Lastly, compute contrasts between each treatment and the control group, plotting the posterior density of these differences to visualize the magnitude and uncertainty of treatment effects. The resulting plot is saved as a high-resolution PNG file and highlights how much each treatment deviates from the control group in terms of its predicted effect on vessel morphology.

    #r
    png('output/contrastRXnoModel.png', 
        width = 3000, height = 2000, res = 300)
    par(mai = c(1.1, 0.8, 0.8, 0.4) + 0.02)
    col <- c("#332288","#AA4499","#117733")
    s_factor <- 20
    new.lab = c("DTX", "ERI1", "ERI2")
    plot(NULL, ylim=c(-0.8,1), xlim = c(0.5,3.5),
        xlab = '', ylab = 'Probability Density', xaxt = 'n', 
        main = 'Treatment Contrast vs CTRL')
    abline(h = 0)
    for(i in 1:3) {
      obj <- p_raw_bsim[,i+1] - p_raw_bsim[,1]
      y <- density(obj)$x
      x <- density(obj)$y
      polygon(i + x/s_factor, y, col = scales::alpha(col[[i]],0.6), border = FALSE)
      lines(i + x/s_factor, y, lwd = 1)
      polygon(i - x/s_factor, y, col = scales::alpha(col[[i]],0.6), lwd = 2, border = FALSE)
      lines(i - x/s_factor, y, lwd = 1)
    }
    axis(1, at = 1:3, labels = FALSE)
    text(x = 1:3+0.05, -0.92, labels = new.lab, srt=40, adj=1, xpd=TRUE)
    dev.off()
    

    In conclusion, this code shows you how to move from raw morphological measurements to robust statistical inferences using hierarchical Bayesian modeling. The structure is modular and can be adapted to other vessel features (e.g., area, eccentricity) or extended to include spatial variables as covariates.