Academic research and data analysis

Final-year project

My dissertation write-up component was awarded 83%. The project used genomics-linked hospital episode material to ask whether historical hospital-coded trajectories could support careful exploratory stratification in a Brugada-suspect research cohort.

View project pipeline Open overview PDF Academic profile

Quick scan

What the project demonstrates

The strongest signal is the method: cohort construction, feature engineering, leakage control, observability checks, model comparison and disciplined interpretation.

Academic result: Dissertation write-up component: 83%. Wider Undergraduate Project module: 82, A, 30 credits.
Project type: Undergraduate Biomedical Science final-year project with applied health data science and machine-learning components.
Research context: Exploratory Brugada-suspect work using genomics-linked hospital episode material inside an approved secure research environment.
Technical work: Cohort logic, ICD/HES-derived feature engineering, observability controls, split-first preprocessing, LR/RF/NN model comparison and output generation.

Research question

The project asked whether pre-sequencing hospital-coded trajectories could describe useful signal in a Brugada-suspect cohort, and whether that signal was strong enough to support careful retrospective stratification.

In practice, this meant building a feature pipeline around hospital activity, diagnosis/procedure groupings, observability and model evaluation, then being clear about what the analysis could and could not prove.

Interpretation boundary

This was undergraduate academic research, not a clinical decision tool. The value is in the technical workflow and the judgement used around weak, sparse and governance-bound evidence.

Methodology

Applied health data science workflow

Defined the analysis around a Brugada-suspect research cohort, with pre-sequencing hospital episode records treated as historical trajectory evidence.
Built participant-level feature tables from admitted care, outpatient and emergency-care sources, using ICD/OPCS groupings, temporal windows and utilisation features.
Added observability controls so record depth, available history and sparse coding were visible design constraints rather than hidden noise.
Kept preprocessing behind the train/test split so imputation, scaling and encoding were fitted from training data only.
Compared logistic regression, random forest and shallow neural-network baselines rather than relying on one model family.
Used thresholding, repeated split checks, model metrics, feature-effect outputs and subgroup/fairness review to keep the interpretation bounded.

Selected figures

Design, observability, features and model checks

These figures show the analysis shape more clearly than a long method paragraph: how the cohort was framed, how observation depth was handled, what features were retained and how model-family comparisons were reviewed.

Analytical design diagram for the final-year project. — Analysis design: cohort, hospital-event features, model comparison and interpretation boundary.

Cohort architecture chart from the dissertation overview. — Cohort and target architecture used to frame the analysis.

Observability depth chart showing record-history and code-history coverage. — Observability checks made record depth part of the method, not an afterthought.

Bar chart showing final compact feature-domain structure. — Compact feature-domain structure across ventricular severity, conduction, syncope, pathway proxy and observability.

Compact model-family comparison across balanced accuracy, ROC AUC, PR AUC and probability error. — Model-family comparison across interpretable, tree-based and shallow neural-network baselines.

Random forest feature stability chart across repeated splits. — Feature-stability review helped test whether signal was stable enough to discuss.

Code evidence

Pipeline excerpts

The repository includes the authored pipeline structure for technical review. These excerpts show the style of implementation behind the page: feature construction, observability handling, split checks and raw-code signal-discovery controls.

Cohort and event alignment

cohort["sequencing_date"] = pd.to_datetime(cohort["sequencing_date"], errors="coerce")
cohort["sequencing_date_missing"] = cohort["sequencing_date"].isna().astype(int)
cohort_index = cohort[["participant_id", "sequencing_date"]].drop_duplicates("participant_id")

apc_norm = prepare_hes_events(apc, "admidate", cohort_index, "hes_apc")
op_norm = prepare_hes_events(op, "apptdate", cohort_index, "hes_op")
ae_norm = prepare_hes_events(ae, "arrivaldate", cohort_index, "hes_ae")

apc_w_core = filter_window(apc_norm, core_window_years)
apc_w_recent = filter_window(apc_norm, recent_window_years)

The feature build starts by normalising the cohort index and aligning hospital-event tables to pre-index windows.

Observability controls

base["exposure_years_core"] = np.where(
    base["observation_years_preindex"].notna(),
    np.minimum(base["observation_years_preindex"], core_window_years),
    np.nan,
)
short_history_col = f"short_history_indicator_{core_tag}"
base[short_history_col] = np.where(
    base["exposure_years_core"].notna(),
    (base["exposure_years_core"] < 2.0).astype(float),
    np.nan,
)
observability_covariate_cols = [
    col for col in ["exposure_years_core", f"total_event_count_{core_tag}", short_history_col]
    if col in base.columns
]

Record depth and service-use density were carried as features and review metadata so sparse history could not disappear inside the model.

Split-aware modelling

class_counts = y.value_counts().sort_index()
min_class_n = int(class_counts.min()) if not class_counts.empty else 0
fallback_reasons = []
if min_class_n < 2:
    fallback_reasons.append(f"least populated class has {min_class_n} sample(s) (<2)")

use_stratify = len(fallback_reasons) == 0
stratify_arg = y if use_stratify else None
split = train_test_split(
    X, y, pid, test_size=test_size, random_state=rs, stratify=stratify_arg
)

The modelling script guards against invalid train/test splits in sparse classes before model fitting.

Signal-discovery controls

if not args.investigation_id:
    raise RuntimeError("Signal-discovery mode requires --investigation-id.")
if str(investigation_context.get("screening_mode", "")).strip() != SIGNAL_DISCOVERY_SCREENING_MODE:
    raise RuntimeError("Signal-discovery training requires raw ICD signal discovery mode.")

if not bool(signal_discovery_context.get("comparison_only_grouped_anchor_artifacts", {}).get("not_model_input")):
    raise RuntimeError("Grouped-anchor artifacts must remain comparison-only.")

splitter, outer_n_splits, outer_n_repeats = resolve_signal_discovery_outer_splitter(
    y=y, random_state=int(args.random_state)
)

Raw-code signal discovery used explicit investigation settings and kept grouped-anchor artefacts out of model inputs.

Open GitHub pipeline folder See document links

What I would discuss in review

The best review conversation would cover cohort definition, target choice, ICD/HES feature engineering, leakage prevention, observability, imbalance-aware evaluation, subgroup/fairness checks and how to communicate weak-signal results without overclaiming.

PDF

Dissertation overview

Overview PDF for the final-year project, including the research question, method shape, selected figures and result context.

Open PDF Download