peak_performance.pipeline#
The pipeline module defines functions for analyzing hundreds of peaks in a data pipeline.
Defines steps for a pipeline to process LC-MS-MS data.
- exception peak_performance.pipeline.InputError#
Bases:
ExceptionBase type of exceptions related to information given by the user.
- exception peak_performance.pipeline.ParsingError#
Bases:
ExceptionBase type of parsing exceptions.
- class peak_performance.pipeline.UserInput(path: str | PathLike, files: Sequence[str], raw_data_file_format: str, peak_model: Sequence[str], retention_time_estimate: Sequence[float] | Sequence[int], peak_width_estimate: float | int, pre_filtering: bool, minimum_sn: float | int, timeseries: ndarray, acquisition: str, precursor: float | int, product_mz_start: float | int, product_mz_end: float | int)#
Bases:
objectCollect all information required from the user and format them in the correct manner.
- Attributes:
acquisitionGetting the value of the acquisition attribute (name of a single acquisition).
precursorGetting the value of the precursor attribute which can be one of the following: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.
product_mz_endGetting the value of the product_mz_end attribute.
product_mz_startGetting the value of the product_mz_start attribute.
timeseriesGetting the value of the timeseries attribute.
user_infoCreate a dictionary with the necessary user information based on the class attributes.
- property acquisition#
Getting the value of the acquisition attribute (name of a single acquisition).
- property precursor#
- Getting the value of the precursor attribute which can be one of the following:
Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.
- property product_mz_end#
Getting the value of the product_mz_end attribute. (End of the mass to charge ratio range of the product ion in the TOF.)
- property product_mz_start#
Getting the value of the product_mz_start attribute.
- property timeseries#
Getting the value of the timeseries attribute. (NumPy Array containing time (at first position) and intensity (at second position) data as NumPy arrays.)
- property user_info#
Create a dictionary with the necessary user information based on the class attributes.
- peak_performance.pipeline.detect_raw_data(path: str | PathLike, *, data_type: str = '.npy')#
Detect all .npy files with time and intensity data for peaks in a given directory.
- Parameters:
- path
Path to the folder containing raw data.
- data_type
Data format of the raw data files (e.g. ‘.npy’).
- Returns:
- files
List with names of all files of the specified data type in path.
- peak_performance.pipeline.excel_template_prepare(path_raw_data: str | PathLike, path_peak_performance: str | PathLike, raw_data_files: List[str] | Tuple[str], unique_identifiers: List[str] | Tuple[str])#
Function to copy Template.xlsx from the peak performance directory to the directory containing the raw data files. Subsequently, update Template.xlsx with a list of all raw data files and of all unique_identifiers.
- Parameters:
- path_raw_data
Path to the folder containing raw data.
- path_peak_performance
Path to the folder containing PeakPerformance.
- raw_data_files
List with names of all files of the specified data type in path_raw_data.
- unique_identifiers
List with all unique combinations of targeted molecules. (i.e. experiment number or precursor ion m/z ratio and product ion m/z ratio range)
- peak_performance.pipeline.initiate(path: str | PathLike, *, run_dir: str = '')#
Create a folder for the results. Also create a zip file inside that folder. Also create df_summary.
- Parameters:
- path
Path to the directory containing the raw data.
- run_dir
Name of the directory created to store the results of the current run (default: current date and time).
- Returns:
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- path
Updated path variable pointing to the newly created folder for this batch.
- peak_performance.pipeline.model_selection(path_raw_data: str | PathLike, *, ic: str = 'loo')#
Method to select the best model for every signal (i.e. combination of experiment number or precursor ion m/z ratio and product ion m/z ratio). This is realized by analyzing one representative sample of the batch with all models and comparing the results based on an informantion criterion.
- Parameters:
- path_raw_data
Path to the folder containing raw data.
- ic
Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)
- Returns:
- comparison_results
DataFrame containing all rankings from model selection.
- model_dict
Dict with unique identifiers as keys and model types as values.
- peak_performance.pipeline.model_selection_check(result_df: DataFrame, ic: str, elpd_threshold: str | float = 35) str#
During model seleciton, double peak models are sometimes incorrectly preferred due to their increased complexity. Therefore, they have to outperform single peak models by an empirically determined value of the elpd.
- Parameters:
- result_df
DataFrame with the result of model comparison via az.compare().
- ic
Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)
- elpd_threshold
Threshold of the elpd difference between a double and a single peak model for the double peak model to be accepted.
- Returns:
- selected_model
Name of the selected model type.
- peak_performance.pipeline.parse_data(path: str | PathLike, filename: str, raw_data_file_format: str) Tuple[ndarray, str, float, float, float]#
Extract names of data files.
- Parameters:
- path
Path to the raw data files.
- filename
Name of a raw date file containing a NumPy array with a time series (time as first, intensity as second element of the array).
- raw_data_file_format
Data format (suffix) of the raw data, default is ‘.npy’.
- Returns:
- timeseries
Updated NumPy array containing time and intensity data as NumPy arrays in first and second row, respectively. NaN values have been replaced with zeroes.
- acquisition
Name of a single acquisition.
- precursor
Can be one of the following: Either the experiment number of the signal within the acquisition (each experiment = one mass trace) or the mass to charge ratio of the precursor ion selected in Q1.
- product_mz_start
Start of the mass to charge ratio range of the product ion in the TOF.
- product_mz_end
End of the mass to charge ratio range of the product ion in the TOF.
- peak_performance.pipeline.parse_files_for_model_selection(signals: DataFrame) Dict[str, str]#
Function to parse the file names for model selection.
- Parameters:
- signals
DataFrame containing the signals tab of Template.xlsx.
- Returns:
- files_for_selection
Dict with file names as keys and unique identifiers as values.
- peak_performance.pipeline.parse_unique_identifiers(raw_data_files: Sequence[str]) List[str]#
Get a set of all mass traces based on the standardized raw data file names (excluding acquisitions). Used to automatically fill out the unique_identifiers column in the Template.xlsx’ signals tab.
- Parameters:
- raw_data_files
Names of all files of the specified data type in path_raw_data.
- Returns:
- unique_identifiers
List with all unique combinations of targeted molecules. (i.e. experiment number or precursor ion m/z ratio and product ion m/z ratio range)
- peak_performance.pipeline.pipeline(path_raw_data: str | PathLike, raw_data_file_format: str)#
Function to run the complete PeakPerformance pipeline.
- Parameters:
- path_raw_data
Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
- raw_data_file_format
Data format (suffix) of the raw data, default is ‘.npy’.
- Returns:
- path_results
Path variable pointing to the newly created folder for this batch.
- peak_performance.pipeline.pipeline_loop(path_raw_data: str | PathLike, path_results: str | PathLike, raw_data_file_format: str, df_summary: DataFrame, *, restart: bool = False)#
Function to run the complete PeakPerformance pipeline.
- Parameters:
- path_raw_data
Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
- path_results
Path to the directory for the results of a given Batch run of PeakPerformance.
- raw_data_file_format
Data format (suffix) of the raw data, default is ‘.npy’.
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- restart
If a pipeline broke for some reason, it can be restarted by setting restart to True. That way, already analyzed files won’t be analyzed again.
- peak_performance.pipeline.pipeline_read_template(path_raw_data: str | PathLike)#
Function to read and check the input settings and data from Template.xlsx when running the data pipeline.
- Parameters:
- path_raw_data
Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
- Returns:
- pre_filtering
If True, potential peaks will be filtered based on retention time and signal to noise ratio before sampling.
- plotting
If True, PeakPerformance will plot results.
- peak_width_estimate
Rough estimate of the average peak width in minutes expected for the LC-MS method with which the data was obtained.
- minimum_sn
Minimum signal to noise ratio for a signal to be recognized as a peak during pre-filtering.
- df_signals
Read-out of the signals tab from Template.xlsx as a DataFrame.
- unique_identifiers
List of unique identifiers from the signals tab of Template.xlsx.
- peak_performance.pipeline.pipeline_restart(path_raw_data: str | PathLike, raw_data_file_format: str, path_results: str | PathLike)#
Function to restart a broken PeakPerformance pipeline. Files which are in the results directory of the broken pipeline will not be analyzed again. WARNING: This only works once! If a pipeline fails more than once, copy all files (except the Excel report sheets) into one directory and specify this directory as the path_results argument.
- Parameters:
- path_raw_data
Path to the raw data files. Files should be in the given raw_data_file_format, default is ‘.npy’. The .npy files are expected to be (2, ?)-shaped 2D NumPy arrays with time and intensity in the first dimension.
- raw_data_file_format
Data format (suffix) of the raw data, default is ‘.npy’.
- path_results
Path variable pointing to the directory of the broken PeakPerformance batch
- Returns:
- path_results_new
Path variable pointing to the newly created folder for the restarted batch.
- peak_performance.pipeline.posterior_predictive_sampling(pmodel, idata)#
Performs posterior predictive sampling for signals recognized as peaks.
- Parameters:
- pmodel
A PyMC model.
- idata
Previously sampled inference data object.
- Returns:
- idata
Inference data object updated with the posterior predictive samples.
- peak_performance.pipeline.postfiltering(filename: str, idata, ui: UserInput, df_summary: DataFrame)#
Method to filter out false positive peaks after sampling based on the obtained uncertainties of several peak parameters.
- Parameters:
- filename
Name of the raw data file.
- idata
Inference data object resulting from sampling.
- ui
Instance of the UserInput class.
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- Returns:
- acceptance
True if the signal was accepted as a peak -> save data and continue with next signal. False if the signal was not accepted as a peak -> re-sampling with more tuning samples or discard signal.
- resample
True: re-sample with more tuning samples, False: don’t.
- discard
True: discard sample.
- peak_performance.pipeline.prefiltering(filename: str, ui: UserInput, noise_width_guess: float, df_summary: DataFrame)#
Optional method to skip signals where clearly no peak is present. Saves a lot of computation time.
- Parameters:
- filename
Name of the raw data file.
- ui
Instance of the UserInput class
- noise_width_guess
Estimated width of the noise of a particular measurement.
- Returns:
- found_peak
True, if any peak candidate was found within the time frame; False, if not.
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- peak_performance.pipeline.prepare_model_selection(path_raw_data: str | PathLike, path_template: str | PathLike)#
Function to prepare model selection by providing and mostly filling out an Excel template Template.xlsx. After this step, the user has to provide relevant information in Template.xlsx which is finally used for model selection.
- Parameters:
- path_raw_data
Path to the folder containing raw data.
- path_template
Path to the folder containing Template.xlsx from PeakPerformance.
- peak_performance.pipeline.report_add_data_to_summary(filename: str, idata, df_summary: DataFrame, ui: UserInput, is_peak: bool, rejection_cause: str = '')#
Extracts the relevant information from idata, concatenates it to the summary DataFrame, and saves the DataFrame as an Excel file. Error handling prevents stop of the pipeline in case the saving doesn’t work (e.g. because the file was opened by someone).
- Parameters:
- idata
Inference data object resulting from sampling.
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- ui
Instance of the UserInput class.
- is_peak
Boolean stating whether a signal was recognized as a peak (True) or not (False).
- rejection_cause
Cause for rejecting a given signal.
- Returns:
- df_summary
Updated DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- peak_performance.pipeline.report_add_nan_to_summary(filename: str, ui: UserInput, df_summary: DataFrame, rejection_cause: str)#
Method to add NaN values to the summary DataFrame in case a signal did not contain a peak.
- Parameters:
- ui
Instance of the UserInput class.
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- rejection_cause
Cause for rejecting a given signal.
- Returns:
- df_summary
Updated DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- peak_performance.pipeline.report_area_sheet(path: str | PathLike, df_summary: DataFrame)#
Save a different, more minimalist report sheet focussing on the area data.
- Parameters:
- path
Path to the directory containing the raw data.
- df_summary
DataFrame for collecting the results (i.e. peak parameters) of every signal of a given pipeline.
- peak_performance.pipeline.report_save_idata(idata, ui: UserInput, filename: str)#
Saves inference data object as a .nc file.
- Parameters:
- idata
Inference data object resulting from sampling.
- ui
Instance of the UserInput class.
- filename
Name of a raw date file containing a NumPy array with a time series (time as first, intensity as second element of the array).
- peak_performance.pipeline.sampling(pmodel, **sample_kwargs)#
Performs sampling.
- Parameters:
- pmodel
A PyMC model.
- **kwargs
The keyword arguments are used in pm.sample().
- tune
Number of tuning samples (default = 2000).
- draws
Number of samples after tuning (default = 2000).
- Returns:
- idata
Inference data object.
- peak_performance.pipeline.selected_models_to_template(path_raw_data: str | PathLike, signals: DataFrame, model_dict: Mapping[str, str])#
Function to update Template.xlsx with the selected model types.
- Parameters:
- path_raw_data
Path to the folder containing raw data.
- signals
DataFrame containing the signals tab of Template.xlsx.
- model_dict
Dict with unique identifiers as keys and model types as values.
- peak_performance.pipeline.selection_loop(path_raw_data: str | PathLike, *, files_for_selection: Mapping[str, str], raw_data_files: List[str] | Tuple[str], ic: str, signals: DataFrame)#
Function containing the loop over all filenames intended for the model selection. Involves sampling every model featured by PeakPerformance, computing the loglikelihood and an information criterion, and comparing the results to ascertain the best model for every file.
- Parameters:
- path_raw_data
Path to the folder containing raw data.
- files_for_selection
Dict with file names as keys and unique identifiers as values.
- raw_data_files
List of raw data files returned by the detect_raw_data() function. Is needed here only to get access to the file format.
- ic
Information criterion to be used for model selection. (“loo”: pareto-smoothed importance sampling leave-one-out cross-validation, “waic”: widely applicable information criterion)
- Returns:
- result_df
DataFrame containing the ranking and scores of the model selection.
- model_dict
Dict with unique identifiers as keys and model types as values.