Vascular age - example¶

Setup¶

Import libraries¶

The ‘WFDB’ toolbox provides tools for reading data from ‘WaveForm DataBase’ format (which is the format used on PhysioNet)
The ‘Neurokit2’ toolbox provides tools for physiological signal processing

! pip install wfdb
! pip install neurokit2

import neurokit2 as nk2
import wfdb

Q: Can you find out more about these packages? What is the 'Waveform DataBase Format', and what tasks could 'Neurokit' be used for?
Hint: See the WFDB documentation, and the the Neurokit documentation.

The following toolboxes are often used in data science, and provide tools for data analysis

import numpy as np
from scipy import signal
from statistics import median
import matplotlib.pyplot as plt

Setup database¶

Specify the name of the database on PhysioNet (here).

database_name = 'pulse-transit-time-ppg/1.1.0' # The name of the database on Physionet

Q: Do look into PhysioNet and the datasets it hosts.

Find a list of records in the database

record_list = wfdb.get_record_list(database_name)

# Output the result
print(f"The '{database_name}' database contains {len(record_list)} records")

The 'pulse-transit-time-ppg/1.1.0' database contains 66 records

Import required functions¶

The following functions have been written to perform tasks specifically for this analysis.

It’s help to appreciate what the functions do (as described in the bullet points), and to be aware of the inputs to the functions (which are provided in the brackets on the first line of the function).

This function finds suitable records in the database. It finds records which meet the following criteria:
- required_signals: Contain the required signals
- required_duration: Are of at least the required duration
- required_activity: Were collected during the required activity

def find_suitable_records(required_signals, required_duration, required_activity, do_verbose=False):
    """
    Description: Finds suitable records in the Pulse Transit Time PPG dataset.
    """    
    
    matching_recs = {'dir':[], 'name':[], 'length':[], 'start_sbp':[], 'end_sbp':[], 'delta_sbp':[], 'age':[]}

    for record in record_list:
        if do_verbose:
            print('Record: {}'.format(record), end="", flush=True)
    
        # check whether this record corresponds to a suitable activity
        if not required_activity in record:
            if do_verbose:
                print('   (not required activity)')
            continue

        record_data = wfdb.rdheader(record,
                                    pn_dir=database_name,
                                    rd_segments=True)

        # Check whether the required signals are present in the record
        sigs_present = record_data.sig_name
        if not all(x in sigs_present for x in required_signals):
            if do_verbose:
                print('   (missing signals)')
            continue

        seg_length = record_data.sig_len/(record_data.fs)

        if seg_length < required_duration:
            if do_verbose:
                print(f' (too short at {seg_length/60:.1f} mins)')
            continue
    
        # This record does meet the requirements, so extract information and data from it
    
        # Information
        matching_recs['dir'].append(database_name)
        matching_recs['name'].append(record_data.record_name)
        matching_recs['length'].append(seg_length)
    
        # Blood pressure measurements
        str_to_find = '<bp_sys_start>: '
        curr_len = len(str_to_find)
        start_el = record_data.comments[0].index(str_to_find) + curr_len
        str_to_find = '<bp_sys_end>'
        end_el = record_data.comments[0].index(str_to_find) - 1
        matching_recs['start_sbp'].append(int(record_data.comments[0][start_el:end_el]))
        str_to_find = '<bp_sys_end>: '
        curr_len = len(str_to_find)
        start_el = record_data.comments[0].index(str_to_find) + curr_len
        str_to_find = '<bp_dia_start>'
        end_el = record_data.comments[0].index(str_to_find) - 1
        matching_recs['end_sbp'].append(int(record_data.comments[0][start_el:end_el]))
        matching_recs['delta_sbp'].append(matching_recs['end_sbp'][-1]-matching_recs['start_sbp'][-1])

        # ages
        str_to_find = '<age>: '
        curr_len = len(str_to_find)
        start_el = record_data.comments[0].index(str_to_find) + curr_len
        str_to_find = '<bp_sys_start>'
        end_el = record_data.comments[0].index(str_to_find) - 1
        matching_recs['age'].append(float(record_data.comments[0][start_el:end_el]))

        if do_verbose:
            print('   (met requirements)')
    
    if do_verbose:
        print(f"A total of {len(matching_recs['dir'])} out of {len(record_list)} records met the requirements.")
    
    return matching_recs

This function extracts a short segment of data from a longer record. It uses the following inputs:
- start_seconds: the start time at which to start extracting data.
- n_seconds_to_load: the duration of the segment to be extracted.
- record_name: the name of the record from which data are to be extracted.
- record_dir: the folder (a.k.a directory) which contains the record.
- required_signals: the signals to be extracted.

def extract_segment_of_data(start_seconds, n_seconds_to_load, record_name, record_dir, required_signals, do_verbose=False):

    # Get sampling frequency from header data
    record_data = wfdb.rdheader(record_name,
                                pn_dir=record_dir,
                                rd_segments=True)
    fs = record_data.fs
    
    # Specify timings of segment to be extracted
    sampfrom = fs * start_seconds
    sampto = fs * (start_seconds + n_seconds_to_load)

    # Load segment data
    segment_data = wfdb.rdrecord(record_name=record_name,
                                 channel_names = required_signals,
                                 sampfrom=sampfrom,
                                 sampto=sampto,
                                 pn_dir=record_dir)

    if do_verbose:
        print(f"{n_seconds_to_load} seconds of data extracted from record: {record_name}")
    
    return segment_data

This function finds peaks in PPG signal. It takes the following inputs:
- raw: raw PPG signal
- fs: sampling frequency (in Hz)

def get_ppg_peaks(raw, fs):
  ppg_clean = nk2.ppg_clean(raw, sampling_rate=fs)
  peaks = nk2.ppg_findpeaks(ppg_clean, method="elgendi", show=False)  
  return peaks['PPG_Peaks']

This function finds peaks in an ECG signal. It takes the following inputs:
- raw: raw ECG signal
- fs: sampling frequency (in Hz)

def get_ecg_peaks(raw, fs):
    ecg_clean = nk2.ecg_clean(raw, sampling_rate=fs)
    signals, info = nk2.ecg_peaks(ecg_clean, method="neurokit", show=False)  
    peaks = info["ECG_R_Peaks"]
    return peaks

This function finds PPG pulse onsets as the minimum values between each pair of identified PPG peaks. It takes the following inputs:
- ppg: A PPG signal
- pks: the peaks in a PPG signal
- fs: the sampling frequency (in Hz)

def get_ppg_onsets(ppg, pks, fs):
    ons = np.empty(0)
    for i in range(len(pks) - 1):
        start = pks[i]
        stop = pks[i + 1]
        ibi = ppg[start:stop]
        aux_ons = np.argmin(ibi)
        # aux_ons, = np.where(ibi == np.min(ibi))
        ind_ons = aux_ons.astype(int)
        ons = np.append(ons, ind_ons + start)   

    ons = ons.astype(int)
    return ons

A function to find the index in a list which immediately follows a specified point. It takes the following inputs:
- array: the list of sample numbers (e.g. a list of PPG peak sample numbers)
- value: the sample number of the specified point (e.g. an ECG R-wave sample number)

def find_next(array, value):
    array = np.asarray(array)
    diffs = array - value
    diffs = np.where(diffs>0, diffs, 100000)
    idx = diffs.argmin()
    return array[idx]

Extracting a dataset¶

Finding suitable records in the database¶

Specify the required signals

name_of_ecg_signal = "ecg" # the name of the ecg signal in the data
name_of_ppg_signal = "pleth_1" # the name of the ppg signal in the data
required_signals = [name_of_ecg_signal, name_of_ppg_signal]

Q: Can you find out the details of the 'pleth_1' signal from the database description? How was it measured? Is it ieal for vascular ageing assessment? What alternative signals are available?

Specify the required duration in seconds

required_duration = 1*60 # in seconds

Specify the required activity

required_activity = "sit"

Q: Similarly, can you find out what different activities the data in this database were collected during? See the database description.

find records which meet these requirements

do_verbose = True # this tells the function to output some text to explain what it is doing.
matching_recs = find_suitable_records(required_signals, required_duration, required_activity, do_verbose)

Record: s1_walk   (not required activity)
Record: s1_run   (not required activity)
Record: s1_sit   (met requirements)
Record: s2_walk   (not required activity)
Record: s2_run   (not required activity)
Record: s2_sit   (met requirements)
Record: s3_sit   (met requirements)
Record: s3_walk   (not required activity)
Record: s3_run   (not required activity)
Record: s4_sit   (met requirements)
Record: s4_run   (not required activity)
Record: s4_walk   (not required activity)
Record: s5_walk   (not required activity)
Record: s5_run   (not required activity)
Record: s5_sit   (met requirements)
Record: s6_run   (not required activity)
Record: s6_sit   (met requirements)
Record: s6_walk   (not required activity)
Record: s7_walk   (not required activity)
Record: s7_run   (not required activity)
Record: s7_sit   (met requirements)
Record: s8_sit   (met requirements)
Record: s8_run   (not required activity)
Record: s8_walk   (not required activity)
Record: s9_walk   (not required activity)
Record: s9_sit   (met requirements)
Record: s9_run   (not required activity)
Record: s10_run   (not required activity)
Record: s10_walk   (not required activity)
Record: s10_sit   (met requirements)
Record: s11_sit   (met requirements)
Record: s11_walk   (not required activity)
Record: s11_run   (not required activity)
Record: s12_walk   (not required activity)
Record: s12_run   (not required activity)
Record: s12_sit   (met requirements)
Record: s13_walk   (not required activity)
Record: s13_sit   (met requirements)
Record: s13_run   (not required activity)
Record: s14_run   (not required activity)
Record: s14_walk   (not required activity)
Record: s14_sit   (met requirements)
Record: s15_walk   (not required activity)
Record: s15_sit   (met requirements)
Record: s15_run   (not required activity)
Record: s16_sit   (met requirements)
Record: s16_run   (not required activity)
Record: s16_walk   (not required activity)
Record: s17_run   (not required activity)
Record: s17_walk   (not required activity)
Record: s17_sit   (met requirements)
Record: s18_run   (not required activity)
Record: s18_walk   (not required activity)
Record: s18_sit   (met requirements)
Record: s19_walk   (not required activity)
Record: s19_sit   (met requirements)
Record: s19_run   (not required activity)
Record: s20_run   (not required activity)
Record: s20_walk   (not required activity)
Record: s20_sit   (met requirements)
Record: s21_walk   (not required activity)
Record: s21_sit   (met requirements)
Record: s21_run   (not required activity)
Record: s22_walk   (not required activity)
Record: s22_run   (not required activity)
Record: s22_sit   (met requirements)
A total of 22 out of 66 records met the requirements.

Extract data from a record¶

specify details of the segment of data to be extracted

start_seconds = 0
n_seconds_to_load = 60
subj_no = 0  # In python, zero refers to the first item in a list
record_name = matching_recs['name'][subj_no]
record_dir = matching_recs['dir'][subj_no]

extract this segment of data

do_verbose = True
segment_data = extract_segment_of_data(start_seconds, n_seconds_to_load, record_name, record_dir, required_signals, do_verbose)

60 seconds of data extracted from record: s1_sit

Signal processing¶

Extract signals¶

Extract PPG and ECG signals from the segment of data

sig_list = ["ppg", "ecg"] # a list of signals to be extracted
for sig in sig_list:  # cycle through each signal in the list
    exec("curr_sig_name = name_of_" + sig + "_signal")  # specify the name of the signal
    rel_col = segment_data.sig_name.index(curr_sig_name)  # identify the column in the data which contains this signal
    exec(sig + " = segment_data.p_signal[:,rel_col]")  # extract the data for this signal

Extract sampling frequency

fs = segment_data.fs

Invert PPG signal¶

This is usually done to make the PPG signal look like a blood pressure signal. Consider why this is the case.

ppg = -1*ppg

Filter signals¶

Signals are filtered to eliminate noise from the signals, which makes it easier for the signal processing to perform well.

Filter PPG

ppg = nk2.ppg_clean(ppg, sampling_rate=fs)

Info: You can find out more about this function here.

Filter ECG

ecg = nk2.ecg_clean(ecg, sampling_rate=fs)

Q: You can find out more about this function here. What frequency content does it exclude from the signal?

Beat detection¶

The next step is to detect beats in the signals

PPG beat detection¶

Detect beats

ppg_peaks = get_ppg_peaks(ppg, fs)

Plot results

plt.figure(figsize=(20,5))  # setup the figure
plt.plot(ppg)  # plot the PPG signal
heights = ppg[ppg_peaks]  # extract the heights of the PPG peaks
plt.scatter(ppg_peaks, heights, marker='o', color='green') # plot points indicating the PPG peaks (plot the sample numbers against the heights, using green circular markers)
plt.show()

ECG beat detection¶

Detect beats

ecg_peaks = get_ecg_peaks(ecg, fs)

Plot results

plt.figure(figsize=(20,5))
plt.plot(ecg)
heights = ecg[ecg_peaks]
plt.scatter(ecg_peaks, heights, marker='o', color='green')
plt.show()

Extract timings¶

ECG timings¶

The ECG timings are simply the times of the R-waves, which indicate the times of ventricular depolarisation (i.e. the electrical activity which prompts ventricular contraction).

ecg_timings = ecg_peaks

PPG timings¶

Ideally, we wouldn’t use the times of systolic peaks on PPG waves, because there can be a variable delay between the pulse onset and systolic peak. Therefore, we use an alternative, such as the time of pulse onset:

Getting the times of PPG pulse wave onsets

ppg_timings = get_ppg_onsets(ppg,ppg_peaks,fs)

Plotting the results

plt.figure(figsize=(20,5))
plt.plot(ppg)
heights_peaks = ppg[ppg_peaks]
heights_onsets = ppg[ppg_timings]
plt.scatter(ppg_peaks, heights_peaks, marker='o', color='green')
plt.scatter(ppg_timings, heights_onsets, marker='o', color='red')
plt.show()

Extract pulse arrival times¶

Find pairs of R-waves and subsequent PPG pulse onsets

rel_ppg_timings = []
for ecg_timing in ecg_timings:
    current_ppg_timing = find_next(ppg_timings, ecg_timing)
    rel_ppg_timings.append(current_ppg_timing)

Calculate pulse arrival times as the difference between the times of R-waves and subsequent PPG pulse onsets

pats = (rel_ppg_timings - ecg_timings)/fs

Modelling¶

Extract dataset¶

Extract a dataset of:

Parameter(s) derived from PPG and ECG signals: pulse arrival time
Reference parameter(s): age

Q: What alternative reference parameters would be ideal?

Setup:¶

subj_nos = [i for i in range(len(matching_recs['name']))]

Extract median pulse arrival times¶

median_pats = []
for subj_no in subj_nos:

    # specify this subject's record name and directory:
    record_name = matching_recs['name'][subj_no]
    record_dir = matching_recs['dir'][subj_no]

    # extract this subject's signals
    segment_data = extract_segment_of_data(start_seconds, n_seconds_to_load, record_name, record_dir, required_signals)
    fs = segment_data.fs
    sig_list = ["ppg", "ecg"]
    for sig in sig_list:
        exec("curr_sig_name = name_of_" + sig + "_signal")
        rel_col = segment_data.sig_name.index(curr_sig_name)
        exec(sig + " = segment_data.p_signal[:,rel_col]")
        
    # invert PPG signal
    ppg = -1*ppg
    
    # filter signals
    ppg = nk2.ppg_clean(ppg, sampling_rate=fs)
    ecg = nk2.ecg_clean(ecg, sampling_rate=fs)
    
    # detect beats in signals
    ecg_peaks = get_ecg_peaks(ecg, fs)
    ppg_peaks = get_ppg_peaks(ppg, fs)
    
    # obtain timings
    ecg_timings = ecg_peaks
    ppg_timings = get_ppg_onsets(ppg,ppg_peaks,fs)
    
    # extract pulse arrival times
    rel_ppg_timings = []
    for ecg_timing in ecg_timings:
        current_ppg_timing = find_next(ppg_timings, ecg_timing)
        rel_ppg_timings.append(current_ppg_timing)
    pats = (rel_ppg_timings - ecg_timings)/fs
    
    # find median pulse arrival time for this subject
    curr_median_pat = median(pats)
    median_pats.append(curr_median_pat)
    
    print(f"Subject no. {subj_no} has median PAT of {median_pats[subj_no]} seconds")

Subject no. 0 has median PAT of 0.172 seconds
Subject no. 1 has median PAT of 0.658 seconds
Subject no. 2 has median PAT of 0.192 seconds
Subject no. 3 has median PAT of 0.162 seconds
Subject no. 4 has median PAT of 0.152 seconds
Subject no. 5 has median PAT of 0.192 seconds
Subject no. 6 has median PAT of 0.15 seconds
Subject no. 7 has median PAT of 0.154 seconds
Subject no. 8 has median PAT of 0.208 seconds
Subject no. 9 has median PAT of 0.18 seconds
Subject no. 10 has median PAT of 0.222 seconds
Subject no. 11 has median PAT of 0.176 seconds
Subject no. 12 has median PAT of 0.154 seconds
Subject no. 13 has median PAT of 0.168 seconds
Subject no. 14 has median PAT of 0.214 seconds
Subject no. 15 has median PAT of 0.184 seconds
Subject no. 16 has median PAT of 0.204 seconds
Subject no. 17 has median PAT of 0.18 seconds
Subject no. 18 has median PAT of 0.17 seconds
Subject no. 19 has median PAT of 0.184 seconds
Subject no. 20 has median PAT of 0.176 seconds
Subject no. 21 has median PAT of 0.158 seconds

Extract ages¶

Cycle through each subject, extracting its age

ages = []
for subj_no in subj_nos:

    # specify this subject's record name and directory:
    record_name = matching_recs['name'][subj_no]
    record_dir = matching_recs['dir'][subj_no]

    # extract this subject's age
    curr_age = matching_recs['age'][subj_no]
    ages.append(curr_age)

Inspect dataset¶

Plot the ages on the x-axis against median pulse arrival times on the y-axis

plt.scatter(ages, median_pats, marker='o', color='green')
m, b = np.polyfit(ages, median_pats, 1)
plt.plot(ages, m*np.array(ages)+b, color='red')
plt.xlabel("Age (years)")
plt.ylabel("Pulse arrival time (secs)")
plt.show()

Q: Are there any outliers? Which point(s)?

if we exclude the outlier:

rel_ages = []
rel_median_pats = []
for el in range(len(ages)):
    if median_pats[el]<0.3:
        rel_ages.append(ages[el])
        rel_median_pats.append(median_pats[el])

plt.scatter(rel_ages, rel_median_pats, marker='o', color='green')
m, b = np.polyfit(rel_ages, rel_median_pats, 1)
plt.plot(rel_ages, m*np.array(rel_ages)+b, color='red')
plt.xlabel("Age (years)")
plt.ylabel("Pulse arrival time (secs)")
plt.show()

Taking it further¶

Q: Can you assess the strength of the correlation?
Hint: Consider using the 'sklearn' package

Q: What other parameters could we extract from the PPG signal? See the tutorials in this book, and also this article.

Vascular age - Intro

Summary

Signal Processing and Learning for Wearables