peak_performance.pipeline#

The pipeline module defines functions for analyzing hundreds of peaks in a data pipeline.

Defines steps for a pipeline to process LC-MS-MS data.

exception peak_performance.pipeline.InputError#

Bases: Exception

Base type of exceptions related to information given by the user.

exception peak_performance.pipeline.ParsingError#

Bases: Exception

Base type of parsing exceptions.

class peak_performance.pipeline.UserInput(path: str | PathLike, files: Sequence[str], raw_data_file_format: str, peak_model: Sequence[str], retention_time_estimate: Sequence[float] | Sequence[int], peak_width_estimate: float | int, pre_filtering: bool, minimum_sn: float | int, timeseries: ndarray, acquisition: str, precursor: float | int, product_mz_start: float | int, product_mz_end: float | int)#

Bases: object

Collect all information required from the user and format them in the correct manner.

Attributes:
acquisition

Getting the value of the acquisition attribute (name of a single acquisition).

precursor

Getting the value of the precursor attribute which can be one of the following: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.

product_mz_end

Getting the value of the product_mz_end attribute.

product_mz_start

Getting the value of the product_mz_start attribute.

timeseries

Getting the value of the timeseries attribute.

user_info

Create a dictionary with the necessary user information based on the class attributes.

property acquisition#

Getting the value of the acquisition attribute (name of a single acquisition).

property precursor#
Getting the value of the precursor attribute which can be one of the following:

Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.

property product_mz_end#

Getting the value of the product_mz_end attribute. (End of the mass to charge ratio range of the product ion in the TOF.)

property product_mz_start#

Getting the value of the product_mz_start attribute.

property timeseries#

Getting the value of the timeseries attribute. (NumPy Array containing time (at first position) and intensity (at second position) data as NumPy arrays.)

property user_info#

Create a dictionary with the necessary user information based on the class attributes.

peak_performance.pipeline.detect_raw_data(path: str | PathLike, *, data_type: str = '.npy')#

Detect all .npy files with time and intensity data for peaks in a given directory.

Parameters:
path

Path to the folder containing raw data.

data_type

Data format of the raw data files (e.g. ‘.npy’).

Returns:
files

List with names of all files of the specified data type in path.

peak_performance.pipeline.excel_template_prepare(path_raw_data: str | PathLike, path_peak_performance: str | PathLike, raw_data_files: List[str] | Tuple[str], unique_identifiers: List[str] | Tuple[str])#

Function to copy Template.xlsx from the peak performance directory to the directory containing the raw data files. Subsequently, update Template.xlsx with a list of all raw data files and of all unique_identifiers.

Parameters:
path_raw_data

Path to the folder containing raw data.

path_peak_performance

Path to the folder containing PeakPerformance.

raw_data_files

List with names of all files of the specified data type in path_raw_data.

unique_identifiers

List with all unique combinations of targeted molecules. (i.e. experiment number or precursor ion m/z ratio and product ion m/z ratio range)

peak_performance.pipeline.initiate(path: str | PathLike, *, run_dir: str = '')#

Create a folder for the results. Also create a zip file inside that folder. Also create df_summary.

Parameters:
path

Path to the directory containing the raw data.

run_dir

Name of the directory created to store the results of the current run (default: current date and time).

Returns:
df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

path

Updated path variable pointing to the newly created folder for this batch.

peak_performance.pipeline.model_selection(path_raw_data: str | PathLike, *, ic: str = 'loo')#

Method to select the best model for every signal (i.e. combination of experiment number or precursor ion m/z ratio and product ion m/z ratio). This is realized by analyzing one representative sample of the batch with all models and comparing the results based on an informantion criterion.

Parameters:
path_raw_data

Path to the folder containing raw data.

ic

Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)

Returns:
comparison_results

DataFrame containing all rankings from model selection.

model_dict

Dict with unique identifiers as keys and model types as values.

peak_performance.pipeline.model_selection_check(result_df: DataFrame, ic: str, elpd_threshold: str | float = 35) str#

During model seleciton, double peak models are sometimes incorrectly preferred due to their increased complexity. Therefore, they have to outperform single peak models by an empirically determined value of the elpd.

Parameters:
result_df

DataFrame with the result of model comparison via az.compare().

ic

Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)

elpd_threshold

Threshold of the elpd difference between a double and a single peak model for the double peak model to be accepted.

Returns:
selected_model

Name of the selected model type.

peak_performance.pipeline.parse_data(path: str | PathLike, filename: str, raw_data_file_format: str) Tuple[ndarray, str, float, float, float]#

Extract names of data files.

Parameters:
path

Path to the raw data files.

filename

Name of a raw date file containing a NumPy array with a time series (time as first, intensity as second element of the array).

raw_data_file_format

Data format (suffix) of the raw data, default is ‘.npy’.

Returns:
timeseries

Updated NumPy array containing time and intensity data as NumPy arrays in first and second row, respectively. NaN values have been replaced with zeroes.

acquisition

Name of a single acquisition.

precursor

Can be one of the following: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.

product_mz_start

Start of the mass to charge ratio range of the product ion in the TOF.

product_mz_end

End of the mass to charge ratio range of the product ion in the TOF.

peak_performance.pipeline.parse_files_for_model_selection(signals: DataFrame) Dict[str, str]#

Function to parse the file names for model selection.

Parameters:
signals

DataFrame containing the signals tab of Template.xlsx.

Returns:
files_for_selection

Dict with file names as keys and unique identifiers as values.

peak_performance.pipeline.parse_unique_identifiers(raw_data_files: Sequence[str]) List[str]#

Get a set of all mass traces based on the standardized raw data file names (excluding acquisitions). Used to automatically fill out the unique_identifiers column in the Template.xlsx’ signals tab.

Parameters:
raw_data_files

Names of all files of the specified data type in path_raw_data.

Returns:
unique_identifiers

List with all unique combinations of targeted molecules. (i.e. experiment number or precursor ion m/z ratio and product ion m/z ratio range)

peak_performance.pipeline.pipeline(path_raw_data: str | PathLike, raw_data_file_format: str)#

Function to run the complete PeakPerformance pipeline.

Parameters:
path_raw_data

Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.

raw_data_file_format

Data format (suffix) of the raw data, default is ‘.npy’.

Returns:
path_results

Path variable pointing to the newly created folder for this batch.

peak_performance.pipeline.pipeline_loop(path_raw_data: str | PathLike, path_results: str | PathLike, raw_data_file_format: str, df_summary: DataFrame, *, restart: bool = False)#

Function to run the complete PeakPerformance pipeline.

Parameters:
path_raw_data

Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.

path_results

Path to the directory for the results of a given Batch run of PeakPerformance.

raw_data_file_format

Data format (suffix) of the raw data, default is ‘.npy’.

df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

restart

If a pipeline broke for some reason, it can be restarted by setting restart to True. That way, already analyzed files won’t be analyzed again.

peak_performance.pipeline.pipeline_read_template(path_raw_data: str | PathLike)#

Function to read and check the input settings and data from Template.xlsx when running the data pipeline.

Parameters:
path_raw_data

Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.

Returns:
pre_filtering

If True, potential peaks will be filtered based on retention time and signal to noise ratio before sampling.

plotting

If True, PeakPerformance will plot results.

peak_width_estimate

Rough estimate of the average peak width in minutes expected for the LC-MS method with which the data was obtained.

minimum_sn

Minimum signal to noise ratio for a signal to be recognized as a peak during pre-filtering.

df_signals

Read-out of the signals tab from Template.xlsx as a DataFrame.

unique_identifiers

List of unique identifiers from the signals tab of Template.xlsx.

peak_performance.pipeline.pipeline_restart(path_raw_data: str | PathLike, raw_data_file_format: str, path_results: str | PathLike)#

Function to restart a broken PeakPerformance pipeline. Files which are in the results directory of the broken pipeline will not be analyzed again. WARNING: This only works once! If a pipeline fails more than once, copy all files (except the Excel report sheets) into one directory and specify this directory as the path_results argument.

Parameters:
path_raw_data

Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.

raw_data_file_format

Data format (suffix) of the raw data, default is ‘.npy’.

path_results

Path variable pointing to the directory of the broken PeakPerformance batch

Returns:
path_results_new

Path variable pointing to the newly created folder for the restarted batch.

peak_performance.pipeline.posterior_predictive_sampling(pmodel, idata)#

Performs posterior predictive sampling for signals recognized as peaks.

Parameters:
pmodel

A PyMC model.

idata

Previously sampled inference data object.

Returns:
idata

Inference data object updated with the posterior predictive samples.

peak_performance.pipeline.postfiltering(filename: str, idata, ui: UserInput, df_summary: DataFrame)#

Method to filter out false positive peaks after sampling based on the obtained uncertainties of several peak parameters.

Parameters:
filename

Name of the raw data file.

idata

Inference data object resulting from sampling.

ui

Instance of the UserInput class.

df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

Returns:
acceptance

True if the signal was accepted as a peak -> save data and continue with next signal. False if the signal was not accepted as a peak -> re-sampling with more tuning samples or discard signal.

resample

True: re-sample with more tuning samples, False: don’t.

discard

True: discard sample.

peak_performance.pipeline.prefiltering(filename: str, ui: UserInput, noise_width_guess: float, df_summary: DataFrame)#

Optional method to skip signals where clearly no peak is present. Saves a lot of computation time.

Parameters:
filename

Name of the raw data file.

ui

Instance of the UserInput class

noise_width_guess

Estimated width of the noise of a particular measurement.

Returns:
found_peak

True, if any peak candidate was found within the time frame; False, if not.

df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.prepare_model_selection(path_raw_data: str | PathLike, path_template: str | PathLike)#

Function to prepare model selection by providing and mostly filling out an Excel template Template.xlsx. After this step, the user has to provide relevant information in Template.xlsx which is finally used for model selection.

Parameters:
path_raw_data

Path to the folder containing raw data.

path_template

Path to the folder containing Template.xlsx from PeakPerformance.

peak_performance.pipeline.report_add_data_to_summary(filename: str, idata, df_summary: DataFrame, ui: UserInput, is_peak: bool, rejection_cause: str = '')#

Extracts the relevant information from idata, concatenates it to the summary DataFrame, and saves the DataFrame as an Excel file. Error handling prevents stop of the pipeline in case the saving doesn’t work (e.g. because the file was opened by someone).

Parameters:
idata

Inference data object resulting from sampling.

df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

ui

Instance of the UserInput class.

is_peak

Boolean stating whether a signal was recognized as a peak (True) or not (False).

rejection_cause

Cause for rejecting a given signal.

Returns:
df_summary

Updated DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.report_add_nan_to_summary(filename: str, ui: UserInput, df_summary: DataFrame, rejection_cause: str)#

Method to add NaN values to the summary DataFrame in case a signal did not contain a peak.

Parameters:
ui

Instance of the UserInput class.

df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

rejection_cause

Cause for rejecting a given signal.

Returns:
df_summary

Updated DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.report_area_sheet(path: str | PathLike, df_summary: DataFrame)#

Save a different, more minimalist report sheet focussing on the area data.

Parameters:
path

Path to the directory containing the raw data.

df_summary

DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.report_save_idata(idata, ui: UserInput, filename: str)#

Saves inference data object as a .nc file.

Parameters:
idata

Inference data object resulting from sampling.

ui

Instance of the UserInput class.

filename

Name of a raw date file containing a NumPy array with a time series (time as first, intensity as second element of the array).

peak_performance.pipeline.sampling(pmodel, **sample_kwargs)#

Performs sampling.

Parameters:
pmodel

A PyMC model.

**kwargs

The keyword arguments are used in pm.sample().

tune

Number of tuning samples (default = 2000).

draws

Number of samples after tuning (default = 2000).

Returns:
idata

Inference data object.

peak_performance.pipeline.selected_models_to_template(path_raw_data: str | PathLike, signals: DataFrame, model_dict: Mapping[str, str])#

Function to update Template.xlsx with the selected model types.

Parameters:
path_raw_data

Path to the folder containing raw data.

signals

DataFrame containing the signals tab of Template.xlsx.

model_dict

Dict with unique identifiers as keys and model types as values.

peak_performance.pipeline.selection_loop(path_raw_data: str | PathLike, *, files_for_selection: Mapping[str, str], raw_data_files: List[str] | Tuple[str], ic: str, signals: DataFrame)#

Function containing the loop over all filenames intended for the model selection. Involves sampling every model featured by PeakPerformance, computing the loglikelihood and an information criterion, and comparing the results to ascertain the best model for every file.

Parameters:
path_raw_data

Path to the folder containing raw data.

files_for_selection

Dict with file names as keys and unique identifiers as values.

raw_data_files

List of raw data files returned by the detect_raw_data() function. Is needed here only to get access to the file format.

ic

Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)

Returns:
result_df

DataFrame containing the ranking and scores of the model selection.

model_dict

Dict with unique identifiers as keys and model types as values.