peak_performance.pipeline

`peak_performance.pipeline`#

The pipeline module defines functions for analyzing hundreds of peaks in a data pipeline.

Defines steps for a pipeline to process LC-MS-MS data.

exception peak_performance.pipeline.InputError#

Bases: Exception

Base type of exceptions related to information given by the user.

exception peak_performance.pipeline.ParsingError#

Bases: Exception

Base type of parsing exceptions.

class peak_performance.pipeline.UserInput(path: str | PathLike, files: Sequence[str], raw_data_file_format: str, peak_model: Sequence[str], retention_time_estimate: Sequence[float] | Sequence[int], peak_width_estimate: float | int, pre_filtering: bool, minimum_sn: float | int, timeseries: ndarray, acquisition: str, precursor: float | int, product_mz_start: float | int, product_mz_end: float | int)#

Bases: object

Collect all information required from the user and format them in the correct manner.

Attributes:

acquisition: Getting the value of the acquisition attribute (name of a single acquisition).
precursor: Getting the value of the precursor attribute which can be one of the following: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.
product_mz_end: Getting the value of the product_mz_end attribute.
product_mz_start: Getting the value of the product_mz_start attribute.
timeseries: Getting the value of the timeseries attribute.
user_info: Create a dictionary with the necessary user information based on the class attributes.

property acquisition#: Getting the value of the acquisition attribute (name of a single acquisition).

property precursor#

Getting the value of the precursor attribute which can be one of the following:: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.

property product_mz_end#: Getting the value of the product_mz_end attribute. (End of the mass to charge ratio range of the product ion in the TOF.)

property product_mz_start#: Getting the value of the product_mz_start attribute.

property timeseries#: Getting the value of the timeseries attribute. (NumPy Array containing time (at first position) and intensity (at second position) data as NumPy arrays.)

property user_info#: Create a dictionary with the necessary user information based on the class attributes.

peak_performance.pipeline.detect_raw_data(path: str | PathLike, *, data_type: str = '.npy')#

Detect all .npy files with time and intensity data for peaks in a given directory.

Parameters:

path: Path to the folder containing raw data.
data_type: Data format of the raw data files (e.g. ‘.npy’).

Returns:

files: List with names of all files of the specified data type in path.

peak_performance.pipeline.excel_template_prepare(path_raw_data: str | PathLike, path_peak_performance: str | PathLike, raw_data_files: List[str] | Tuple[str], unique_identifiers: List[str] | Tuple[str])#

Function to copy Template.xlsx from the peak performance directory to the directory containing the raw data files. Subsequently, update Template.xlsx with a list of all raw data files and of all unique_identifiers.

Parameters:

path_raw_data: Path to the folder containing raw data.
path_peak_performance: Path to the folder containing PeakPerformance.
raw_data_files: List with names of all files of the specified data type in path_raw_data.
unique_identifiers: List with all unique combinations of targeted molecules. (i.e. experiment number or precursor ion m/z ratio and product ion m/z ratio range)

peak_performance.pipeline.initiate(path: str | PathLike, *, run_dir: str = '')#

Create a folder for the results. Also create a zip file inside that folder. Also create df_summary.

Parameters:

path: Path to the directory containing the raw data.
run_dir: Name of the directory created to store the results of the current run (default: current date and time).

Returns:

df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
path: Updated path variable pointing to the newly created folder for this batch.

peak_performance.pipeline.model_selection(path_raw_data: str | PathLike, *, ic: str = 'loo')#

Method to select the best model for every signal (i.e. combination of experiment number or precursor ion m/z ratio and product ion m/z ratio). This is realized by analyzing one representative sample of the batch with all models and comparing the results based on an informantion criterion.

Parameters:

path_raw_data: Path to the folder containing raw data.
ic: Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)

Returns:

comparison_results: DataFrame containing all rankings from model selection.
model_dict: Dict with unique identifiers as keys and model types as values.

peak_performance.pipeline.model_selection_check(result_df: DataFrame, ic: str, elpd_threshold: str | float = 35) → str#

During model seleciton, double peak models are sometimes incorrectly preferred due to their increased complexity. Therefore, they have to outperform single peak models by an empirically determined value of the elpd.

Parameters:

result_df: DataFrame with the result of model comparison via az.compare().
ic: Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)
elpd_threshold: Threshold of the elpd difference between a double and a single peak model for the double peak model to be accepted.

Returns:

selected_model: Name of the selected model type.

peak_performance.pipeline.parse_data(path: str | PathLike, filename: str, raw_data_file_format: str) → Tuple[ndarray, str, float, float, float]#

Extract names of data files.

Parameters:

path: Path to the raw data files.
filename: Name of a raw date file containing a NumPy array with a time series (time as first, intensity as second element of the array).
raw_data_file_format: Data format (suffix) of the raw data, default is ‘.npy’.

Returns:

timeseries: Updated NumPy array containing time and intensity data as NumPy arrays in first and second row, respectively. NaN values have been replaced with zeroes.
acquisition: Name of a single acquisition.
precursor: Can be one of the following: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.
product_mz_start: Start of the mass to charge ratio range of the product ion in the TOF.
product_mz_end: End of the mass to charge ratio range of the product ion in the TOF.

peak_performance.pipeline.parse_files_for_model_selection(signals: DataFrame) → Dict[str, str]#

Function to parse the file names for model selection.

Parameters:

signals: DataFrame containing the signals tab of Template.xlsx.

Returns:

files_for_selection: Dict with file names as keys and unique identifiers as values.

peak_performance.pipeline.parse_unique_identifiers(raw_data_files: Sequence[str]) → List[str]#

Get a set of all mass traces based on the standardized raw data file names (excluding acquisitions). Used to automatically fill out the unique_identifiers column in the Template.xlsx’ signals tab.

Parameters:

raw_data_files: Names of all files of the specified data type in path_raw_data.

Returns:

unique_identifiers: List with all unique combinations of targeted molecules. (i.e. experiment number or precursor ion m/z ratio and product ion m/z ratio range)

peak_performance.pipeline.pipeline(path_raw_data: str | PathLike, raw_data_file_format: str)#

Function to run the complete PeakPerformance pipeline.

Parameters:

path_raw_data: Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
raw_data_file_format: Data format (suffix) of the raw data, default is ‘.npy’.

Returns:

path_results: Path variable pointing to the newly created folder for this batch.

peak_performance.pipeline.pipeline_loop(path_raw_data: str | PathLike, path_results: str | PathLike, raw_data_file_format: str, df_summary: DataFrame, *, restart: bool = False)#

Function to run the complete PeakPerformance pipeline.

Parameters:

path_raw_data: Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
path_results: Path to the directory for the results of a given Batch run of PeakPerformance.
raw_data_file_format: Data format (suffix) of the raw data, default is ‘.npy’.
df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
restart: If a pipeline broke for some reason, it can be restarted by setting restart to True. That way, already analyzed files won’t be analyzed again.

peak_performance.pipeline.pipeline_read_template(path_raw_data: str | PathLike)#

Function to read and check the input settings and data from Template.xlsx when running the data pipeline.

Parameters:

path_raw_data: Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.

Returns:

pre_filtering: If True, potential peaks will be filtered based on retention time and signal to noise ratio before sampling.
plotting: If True, PeakPerformance will plot results.
peak_width_estimate: Rough estimate of the average peak width in minutes expected for the LC-MS method with which the data was obtained.
minimum_sn: Minimum signal to noise ratio for a signal to be recognized as a peak during pre-filtering.
df_signals: Read-out of the signals tab from Template.xlsx as a DataFrame.
unique_identifiers: List of unique identifiers from the signals tab of Template.xlsx.

peak_performance.pipeline.pipeline_restart(path_raw_data: str | PathLike, raw_data_file_format: str, path_results: str | PathLike)#

Function to restart a broken PeakPerformance pipeline. Files which are in the results directory of the broken pipeline will not be analyzed again. WARNING: This only works once! If a pipeline fails more than once, copy all files (except the Excel report sheets) into one directory and specify this directory as the path_results argument.

Parameters:

path_raw_data: Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
raw_data_file_format: Data format (suffix) of the raw data, default is ‘.npy’.
path_results: Path variable pointing to the directory of the broken PeakPerformance batch

Returns:

path_results_new: Path variable pointing to the newly created folder for the restarted batch.

peak_performance.pipeline.posterior_predictive_sampling(pmodel, idata)#

Performs posterior predictive sampling for signals recognized as peaks.

Parameters:

pmodel: A PyMC model.
idata: Previously sampled inference data object.

Returns:

idata: Inference data object updated with the posterior predictive samples.

peak_performance.pipeline.postfiltering(filename: str, idata, ui: UserInput, df_summary: DataFrame)#

Method to filter out false positive peaks after sampling based on the obtained uncertainties of several peak parameters.

Parameters:

filename: Name of the raw data file.
idata: Inference data object resulting from sampling.
ui: Instance of the UserInput class.
df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

Returns:

acceptance: True if the signal was accepted as a peak -> save data and continue with next signal. False if the signal was not accepted as a peak -> re-sampling with more tuning samples or discard signal.
resample: True: re-sample with more tuning samples, False: don’t.
discard: True: discard sample.

peak_performance.pipeline.prefiltering(filename: str, ui: UserInput, noise_width_guess: float, df_summary: DataFrame)#

Optional method to skip signals where clearly no peak is present. Saves a lot of computation time.

Parameters:

filename: Name of the raw data file.
ui: Instance of the UserInput class
noise_width_guess: Estimated width of the noise of a particular measurement.

Returns:

found_peak: True, if any peak candidate was found within the time frame; False, if not.
df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.prepare_model_selection(path_raw_data: str | PathLike, path_template: str | PathLike)#

Function to prepare model selection by providing and mostly filling out an Excel template Template.xlsx. After this step, the user has to provide relevant information in Template.xlsx which is finally used for model selection.

Parameters:

path_raw_data: Path to the folder containing raw data.
path_template: Path to the folder containing Template.xlsx from PeakPerformance.

peak_performance.pipeline.report_add_data_to_summary(filename: str, idata, df_summary: DataFrame, ui: UserInput, is_peak: bool, rejection_cause: str = '')#

Extracts the relevant information from idata, concatenates it to the summary DataFrame, and saves the DataFrame as an Excel file. Error handling prevents stop of the pipeline in case the saving doesn’t work (e.g. because the file was opened by someone).

Parameters:

idata: Inference data object resulting from sampling.
df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
ui: Instance of the UserInput class.
is_peak: Boolean stating whether a signal was recognized as a peak (True) or not (False).
rejection_cause: Cause for rejecting a given signal.

Returns:

df_summary: Updated DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.report_add_nan_to_summary(filename: str, ui: UserInput, df_summary: DataFrame, rejection_cause: str)#

Method to add NaN values to the summary DataFrame in case a signal did not contain a peak.

Parameters:

ui: Instance of the UserInput class.
df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
rejection_cause: Cause for rejecting a given signal.

Returns:

df_summary: Updated DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.report_area_sheet(path: str | PathLike, df_summary: DataFrame)#

Save a different, more minimalist report sheet focussing on the area data.

Parameters:

path: Path to the directory containing the raw data.
df_summary: DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.

peak_performance.pipeline.report_save_idata(idata, ui: UserInput, filename: str)#

Saves inference data object as a .nc file.

Parameters:

idata: Inference data object resulting from sampling.
ui: Instance of the UserInput class.
filename: Name of a raw date file containing a NumPy array with a time series (time as first, intensity as second element of the array).

peak_performance.pipeline.sampling(pmodel, **sample_kwargs)#

Performs sampling.

Parameters:

pmodel: A PyMC model.
**kwargs: The keyword arguments are used in pm.sample().
tune: Number of tuning samples (default = 2000).
draws: Number of samples after tuning (default = 2000).

Returns:

idata: Inference data object.

peak_performance.pipeline.selected_models_to_template(path_raw_data: str | PathLike, signals: DataFrame, model_dict: Mapping[str, str])#

Function to update Template.xlsx with the selected model types.

Parameters:

path_raw_data: Path to the folder containing raw data.
signals: DataFrame containing the signals tab of Template.xlsx.
model_dict: Dict with unique identifiers as keys and model types as values.

peak_performance.pipeline.selection_loop(path_raw_data: str | PathLike, *, files_for_selection: Mapping[str, str], raw_data_files: List[str] | Tuple[str], ic: str, signals: DataFrame)#

Function containing the loop over all filenames intended for the model selection. Involves sampling every model featured by PeakPerformance, computing the loglikelihood and an information criterion, and comparing the results to ascertain the best model for every file.

Parameters:

path_raw_data: Path to the folder containing raw data.
files_for_selection: Dict with file names as keys and unique identifiers as values.
raw_data_files: List of raw data files returned by the detect_raw_data() function. Is needed here only to get access to the file format.
ic: Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)

Returns:

result_df: DataFrame containing the ranking and scores of the model selection.
model_dict: Dict with unique identifiers as keys and model types as values.

peak_performance.pipeline

Contents

peak_performance.pipeline#

`peak_performance.pipeline`#