Spaces:

SolarWine
/

api

Running

App Files Files Community

api / ui /tab_data.py

safraeli

Deploy: 2026 sensor migration + redesign + bucket B endpoints

13fc29d verified 17 days ago

raw

history blame contribute delete

50.7 kB

	"""
	Data tab.
	"""

	from __future__ import annotations

	import streamlit as st
	import pandas as pd
	import numpy as np

	from config import settings
	from ui.bootstrap import _BRAND_GREEN, _HAS_PLOTLY, load_labels, load_metrics

	if _HAS_PLOTLY:
	import plotly.graph_objects as go
	from plotly.subplots import make_subplots



	def render_tab_data() -> None:
	_data_section = st.radio(

	"Section",

	["Farquhar Model", "Model Validation", "Data Explorer"],

	horizontal=True,

	)


	if _data_section == "Farquhar Model":

	st.header("How we measure vine photosynthesis")

	st.markdown(

	"Before we can predict photosynthesis, we need to measure it. "

	"On-site sensors record light, temperature, humidity, and CO2 every "

	"15 minutes. A well-established plant biology model (Farquhar et al., 1980) "

	"converts these readings into the photosynthesis rate A — "

	"how fast the vine is converting sunlight into sugar. "

	"This tab shows the results of that calculation."

	)

	with st.expander("How does the Farquhar model work?"):

	st.markdown(

	"Uses the Farquhar et al. (1980) mechanistic model with "

	"Greer & Weedon (2012) grapevine parameters to compute the net leaf "

	"photosynthesis rate A (\u00b5mol CO\u2082 m\u207b\u00b2 s\u207b\u00b9) from on-site sensor readings "

	"(PAR, leaf temperature, air temperature, CO\u2082, VPD). Only daytime rows "

	"(PAR > 50) during the growing season (May\u2013Sep) are used. "

	"Temperature dependencies use Bernacchi et al. (2001) kinetic constants.\n\n"

	"The core computation:\n\n"

	"1. Rubisco-limited rate: Ac = Vcmax \u00b7 (ci \u2212 \u0393*) / (ci + Kc \u00b7 (1 + O\u1d62 / Ko))\n"

	"2. RuBP-limited rate: Aj = J \u00b7 (ci \u2212 \u0393) / (4\u00b7ci + 8\u00b7\u0393)\n"

	"3. Net assimilation: A = min(Ac, Aj) \u2212 Rd\n\n"

	"Where Vcmax and Jmax follow temperature-dependent Arrhenius curves "

	"(peak at 39\u00b0C and 36\u00b0C respectively for Semillon), J is solved from the "

	"light-response quadratic, and ci (intercellular CO\u2082) is derived from "

	"ambient CO\u2082 scaled by stomatal conductance (reduced by VPD and CWSI).\n\n"

	"#### How VPD and CWSI reduce photosynthesis\n\n"

	"Both VPD and CWSI act on photosynthesis through the same bottleneck: "

	"stomatal conductance (gs). Stomata are the pores on the leaf surface "

	"that let CO\u2082 in for photosynthesis \u2014 but also let water vapor out.\n\n"

	"VPD (Vapor Pressure Deficit) measures how dry the air is. "

	"When VPD is high (hot, dry air), the leaf would lose water too fast, "

	"so the vine partially closes its stomata to conserve water. "

	"Less open stomata = less CO\u2082 enters the leaf = lower ci = lower A. "

	"The model applies an exponential decay: "

	"`gs_scale = exp(\u22120.3 \u00b7 max(0, VPD \u2212 1.0))`, "

	"so the effect kicks in above 1 kPa and intensifies with drier air.\n\n"

	"CWSI (Crop Water Stress Index) is computed from the leaf\u2013air "

	"temperature difference: `CWSI = (Tleaf \u2212 Tair \u2212 \u0394Tmin) / (\u0394Tmax \u2212 \u0394Tmin)`, "

	"clipped to [0, 1]. A well-watered vine transpires freely, keeping its "

	"leaves cooler than the air (CWSI \u2248 0). A stressed vine closes stomata, "

	"transpiration drops, and leaves heat up (CWSI \u2192 1). "

	"The model reduces stomatal conductance by `(1 \u2212 0.5 \u00b7 CWSI)`, "

	"so at full stress (CWSI = 1) stomatal opening is halved.\n\n"

	"Combined effect on ci: "

	"`ci = CO\u2082 \u00b7 (1 \u2212 1 / (1.6 \u00b7 gs_factor))` where "

	"`gs_factor = 1.2 \u00b7 VPD_scale \u00b7 (1 \u2212 0.5 \u00b7 CWSI)`\n\n"

	"This means on a hot, dry Negev afternoon (VPD > 3 kPa, CWSI > 0.5), "

	"ci drops sharply and photosynthesis can fall even when light is abundant "

	"\u2014 the vine has plenty of energy but cannot get enough CO\u2082 through "

	"its closed stomata.\n\n"

	"Key references:\n"

	"- [Farquhar, von Caemmerer & Berry (1980)](https://doi.org/10.1007/BF00386231) \u2014 "

	"Biochemical model of photosynthetic CO\u2082 assimilation\n"

	"- [Greer & Weedon (2012)](https://doi.org/10.1111/j.1365-3040.2011.02471.x) \u2014 "

	"Modelling photosynthetic responses to temperature of grapevine\n"

	"- [Bernacchi et al. (2001)](https://doi.org/10.1046/j.1365-3040.2001.00668.x) \u2014 "

	"Temperature dependence of Kc, Ko, and \u0393*"

	)


	st.subheader("Sensor inputs")

	st.markdown("The Farquhar model uses 6 columns from the Air1 reference station, "

	"sampled every 15 minutes:")

	used_sensors_tab = pd.DataFrame([

	{"Column": "Air1_PAR_ref", "Measurement": "Photosynthetically Active Radiation", "Units": "\u00b5mol m\u207b\u00b2 s\u207b\u00b9", "Used in": "Farquhar: electron transport (J)", "Notes": "Primary light input driving photosynthesis rate"},

	{"Column": "Air1_leafTemperature_ref", "Measurement": "Leaf temperature", "Units": "\u00b0C", "Used in": "Farquhar: Vcmax, Jmax, Kc, Ko, \u0393*; CWSI", "Notes": "Controls enzyme kinetics; also used to compute CWSI"},

	{"Column": "Air1_airTemperature_ref", "Measurement": "Air temperature", "Units": "\u00b0C", "Used in": "CWSI computation", "Notes": "Tleaf \u2212 Tair drives the water stress index"},

	{"Column": "Air1_CO2_ref", "Measurement": "Ambient CO\u2082 concentration", "Units": "ppm", "Used in": "Farquhar: intercellular CO\u2082 (ci)", "Notes": "Substrate for carbon fixation by Rubisco (\u00d70.7 correction applied)"},

	{"Column": "Air1_VPD_ref", "Measurement": "Vapor Pressure Deficit", "Units": "kPa", "Used in": "Farquhar: stomatal conductance \u2192 ci", "Notes": "High VPD closes stomata, reducing ci and thus A"},

	{"Column": "Air1_airHumidity_ref", "Measurement": "Relative humidity", "Units": "%", "Used in": "Loaded but not consumed", "Notes": "VPD already encodes humidity; column is redundant"},

	])

	st.dataframe(used_sensors_tab, hide_index=True)


	labels_path = settings.PROCESSED_DIR / "stage1_labels.csv"

	validation_img = settings.OUTPUTS_DIR / "stage1_validation.png"


	if labels_path.exists():

	df_labels = load_labels(str(labels_path))

	df_labels.index = pd.to_datetime(df_labels.index, utc=True)


	# Metrics row

	c1, c2, c3, c4 = st.columns(4)

	c1.metric("Observations", f"{len(df_labels):,}")

	c2.metric("Avg photosynthesis rate", f"{df_labels.iloc[:, 0].mean():.1f}",

	help="Higher values mean the vine is growing faster. Typical range: 5-20.")

	c3.metric("Date from", df_labels.index.min().strftime("%Y-%m-%d"))

	c4.metric("Date to", df_labels.index.max().strftime("%Y-%m-%d"))


	with st.expander("What does this table show?"):

	st.markdown(

	"Descriptive statistics (count, mean, std, min, quartiles, max) of the "

	"computed photosynthesis rate A. Typical grapevine values are 0\u201325 "

	"\u00b5mol CO\u2082 m\u207b\u00b2 s\u207b\u00b9. Values outside this range may indicate sensor issues."

	)

	st.dataframe(df_labels.describe())

	st.caption(

	"This table summarizes the photosynthesis measurements. The 'mean' row shows "

	"the average rate across all observations. Values between 5-20 are typical "

	"for healthy grapevines during the growing season."

	)


	st.download_button(

	"Download labels CSV",

	df_labels.to_csv(),

	file_name="stage1_labels.csv",

	mime="text/csv",

	)


	if validation_img.exists():

	with st.expander("How to read the validation plots"):

	st.markdown(

	"Top panel \u2014 Diurnal pattern: A vs. hour of day (UTC). Expect a bell curve "

	"peaking mid-morning to early afternoon when light and temperature are optimal.\n\n"

	"Bottom panel \u2014 A vs PAR: Photosynthesis rate plotted against Photosynthetically "

	"Active Radiation. A should increase with PAR and saturate at high light levels, "

	"forming a characteristic light-response curve."

	)

	st.image(str(validation_img), width='stretch')

	else:

	st.info("No pre-computed photosynthesis labels found. Run `python scripts/run_pipeline.py` to generate them.")


	# ---------------------------------------------------------------------------


	if _data_section == "Model Validation":

	st.header("FvCB Model Validation")

	st.markdown(

	"Comparison of our Farquhar-von Caemmerer-Berry (FvCB) photosynthesis model "

	"against measured data from "

	"[Greer & Weedon (2012)](https://doi.org/10.1111/j.1365-3040.2011.02471.x) "

	"for field-grown Vitis vinifera cv. Semillon in a hot climate "

	"(Riverina, NSW, Australia)."

	)


	# --- Reference data from Greer & Weedon (2012) ---

	_val_temps = [20, 25, 30, 35, 40]


	# Fig 5a: Light-saturated Amax at ambient CO2 (389 ppm)

	_paper_amax = {20: 12.0, 25: 16.9, 30: 19.9, 35: 15.3, 40: 12.0}

	_paper_amax_se = {20: 1.5, 25: 1.3, 30: 1.8, 35: 1.2, 40: 1.5}


	# Fig 11: Vcmax and Jmax (from Arrhenius fit to A/Ci curves)

	_paper_vcmax = {20: 20, 25: 38.5, 30: 58, 35: 85, 40: 110}

	_paper_jmax = {20: 60, 25: 98.3, 30: 135, 35: 165, 40: 170}


	# Table 1: Stomatal limitation (%)

	_paper_stom_lim = {20: 13, 25: 20, 30: 25, 35: 31, 40: 34}


	from src.farquhar_model import FarquharModel as _FMVal

	_val_model = _FMVal()


	# ---- Section 1: Light Response Curves ----

	st.subheader("1. Photosynthetic light response at different temperatures")

	st.markdown(

	"Light response curves (A vs PFD) at five leaf temperatures. "

	"Model run at ambient CO$_2$ = 389 ppm, VPD = 1.5 kPa."

	)


	_val_vpd = st.slider(

	"VPD for model curves (kPa)", 0.5, 3.0, 1.5, 0.1,

	key="val_vpd"

	)

	_val_pfds = np.arange(0, 2100, 25)


	if _HAS_PLOTLY:

	_lr_fig = go.Figure()

	_temp_colors = {20: "#1f77b4", 25: "#2ca02c", 30: "#d62728",

	35: "#9467bd", 40: "#ff7f0e"}

	for t in _val_temps:

	# Model curve

	_a_vals = [_val_model.calc_photosynthesis(

	PAR=float(p), Tleaf=t, CO2=389, VPD=_val_vpd, Tair=t

	) for p in _val_pfds]

	_lr_fig.add_trace(go.Scatter(

	x=_val_pfds, y=_a_vals, mode="lines",

	name=f"{t} °C (model)",

	line=dict(color=_temp_colors[t]),

	))

	# Paper reference point (Amax)

	_lr_fig.add_trace(go.Scatter(

	x=[1800], y=[_paper_amax[t]],

	mode="markers",

	name=f"{t} °C (Greer & Weedon)",

	marker=dict(color=_temp_colors[t], size=12, symbol="star",

	line=dict(width=1, color="black")),

	error_y=dict(type="data", array=[_paper_amax_se[t]], visible=True),

	showlegend=True,

	))

	_lr_fig.update_layout(

	xaxis_title="PFD [µmol photons m⁻² s⁻¹]",

	yaxis_title="A [µmol CO₂ m⁻² s⁻¹]",

	height=500,

	legend=dict(font=dict(size=10)),

	)

	st.plotly_chart(_lr_fig)

	else:

	st.info("Install plotly for interactive charts.")


	# ---- Section 2: A comparison table ----

	st.subheader("2. Light-saturated A: Model vs Paper")

	_rows = []

	for t in _val_temps:

	a_model = _val_model.calc_photosynthesis(

	PAR=2000, Tleaf=t, CO2=389, VPD=_val_vpd, Tair=t

	)

	_rows.append({

	"T_leaf (°C)": t,

	"A_model": round(a_model, 1),

	f"A_paper (Greer & Weedon)": _paper_amax[t],

	"Difference (%)": round((a_model - _paper_amax[t]) / _paper_amax[t] * 100, 0),

	"Limitation": "RuBP regen." if t <= 30 else "Rubisco",

	})

	st.dataframe(pd.DataFrame(_rows), hide_index=True)


	# ---- Section 3: Vcmax / Jmax temperature response ----

	st.subheader("3. Vcmax and Jmax temperature response")

	st.markdown(

	"Temperature dependence of maximum carboxylation rate (Vcmax) and "

	"electron transport capacity (Jmax). Model uses modified Arrhenius "

	"(Medlyn et al. 2002) with Greer & Weedon (2012) activation/deactivation "

	"energies. Topt(Vcmax) = 39 °C, Topt(Jmax) = 36 °C."

	)


	_t_range = np.arange(15, 50, 0.5)

	_vcmax_curve = [_val_model.calc_Vcmax(t + 273.15) for t in _t_range]

	_jmax_curve = [_val_model.calc_Jmax(t + 273.15) for t in _t_range]


	# Scale paper's Cc-based Vcmax/Jmax to our Ci-based values for comparison

	_scale_v = _val_model.params["k25_vcmax"] / 38.5 # ratio Ci-based/Cc-based

	_scale_j = _val_model.params["k25_jmax"] / 98.3

	_paper_vcmax_scaled = {t: v * _scale_v for t, v in _paper_vcmax.items()}

	_paper_jmax_scaled = {t: j * _scale_j for t, j in _paper_jmax.items()}


	if _HAS_PLOTLY:

	_vj_fig = make_subplots(rows=1, cols=2,

	subplot_titles=("Vcmax", "Jmax"))

	_vj_fig.add_trace(go.Scatter(

	x=list(_t_range), y=_vcmax_curve, mode="lines",

	name="Vcmax (model)", line=dict(color="#d62728"),

	), row=1, col=1)

	_vj_fig.add_trace(go.Scatter(

	x=list(_paper_vcmax_scaled.keys()),

	y=list(_paper_vcmax_scaled.values()),

	mode="markers", name="Vcmax (paper, scaled)",

	marker=dict(color="#d62728", size=10, symbol="star",

	line=dict(width=1, color="black")),

	), row=1, col=1)

	_vj_fig.add_trace(go.Scatter(

	x=list(_t_range), y=_jmax_curve, mode="lines",

	name="Jmax (model)", line=dict(color="#1f77b4"),

	), row=1, col=2)

	_vj_fig.add_trace(go.Scatter(

	x=list(_paper_jmax_scaled.keys()),

	y=list(_paper_jmax_scaled.values()),

	mode="markers", name="Jmax (paper, scaled)",

	marker=dict(color="#1f77b4", size=10, symbol="star",

	line=dict(width=1, color="black")),

	), row=1, col=2)

	_vj_fig.update_xaxes(title_text="Leaf temperature (°C)")

	_vj_fig.update_yaxes(title_text="µmol m⁻² s⁻¹")

	_vj_fig.update_layout(height=400)

	st.plotly_chart(_vj_fig)

	else:

	st.info("Install plotly for interactive charts.")


	# ---- Section 4: Limitation regime ----

	st.subheader("4. RuBP regeneration vs Rubisco carboxylation limitation")

	st.markdown(

	"The paper's key finding: below 30 °C, photosynthesis is limited by "

	"RuBP regeneration (electron transport / light reactions). "

	"Above 30 °C, Rubisco carboxylation becomes limiting due to "

	"declining CO$_2$ affinity and increased photorespiration.\n\n"

	"This 30 °C transition is critical for shading decisions:\n"

	"- Below 30 °C: shading reduces light and hurts photosynthesis (RuBP-limited)\n"

	"- Above 30 °C: shading may help by reducing heat stress on Rubisco"

	)


	if _HAS_PLOTLY:

	_ac_vals = []

	_aj_vals = []

	_t_lim = np.arange(15, 46, 0.5)

	for t in _t_lim:

	Tk = t + 273.15

	Vcmax = _val_model.calc_Vcmax(Tk)

	Jmax = _val_model.calc_Jmax(Tk)

	J = _val_model.calc_electron_transport(2000, Jmax)

	gamma = _val_model.calc_gamma_star(Tk)

	Kc = _val_model.calc_Kc(Tk)

	Ko = _val_model.calc_Ko(Tk)

	ci = _val_model._ci_from_ca(389, 1.5, 0.0)

	Ac = Vcmax * (ci - gamma) / (ci + Kc * (1 + 210.0 / Ko))

	Aj = J * (ci - gamma) / (4 * ci + 8 * gamma)

	Rd = 0.015 * Vcmax

	_ac_vals.append(Ac - Rd)

	_aj_vals.append(Aj - Rd)


	_lim_fig = go.Figure()

	_lim_fig.add_trace(go.Scatter(

	x=list(_t_lim), y=_ac_vals, mode="lines",

	name="Ac (Rubisco-limited)",

	line=dict(color="#d62728", dash="dash"),

	))

	_lim_fig.add_trace(go.Scatter(

	x=list(_t_lim), y=_aj_vals, mode="lines",

	name="Aj (RuBP-limited)",

	line=dict(color="#1f77b4", dash="dash"),

	))

	# Actual A = min(Ac, Aj)

	_a_net = [max(0, min(ac, aj)) for ac, aj in zip(_ac_vals, _aj_vals)]

	_lim_fig.add_trace(go.Scatter(

	x=list(_t_lim), y=_a_net, mode="lines",

	name="A_net = min(Ac, Aj)",

	line=dict(color="black", width=3),

	))

	_lim_fig.add_vline(x=30, line_dash="dot", line_color="gray",

	annotation_text="30 °C transition")

	_lim_fig.update_layout(

	xaxis_title="Leaf temperature (°C)",

	yaxis_title="A [µmol CO₂ m⁻² s⁻¹]",

	height=450,

	)

	st.plotly_chart(_lim_fig)


	# ---- Section 5: Key findings ----

	st.subheader("5. Key findings from validation")

	st.markdown("""

	Agreement with Greer & Weedon (2012):

	- Temperature ranking of Amax matches: 30 °C > 25 °C > 35 °C > 20 °C > 40 °C

	- RuBP/Rubisco limitation transition occurs at ~30-32 °C (paper: 30 °C)

	- Vcmax peaks at 39 °C, Jmax peaks at 36 °C (exact match with paper)

	- Quantitative match within 2-15% at 20-35 °C

	- Jmax/Vcmax ratio declines from ~2.4 at 20 °C to ~1.1 at 45 °C (paper: 3.0 to 1.5)


	Known limitations:

	- At 40 °C, model underestimates A by ~12% due to Bernacchi (2001) Rubisco kinetics

	being parameterised for tobacco, not heat-adapted grapevine

	- Stomatal response to temperature is modelled via VPD only; the paper shows

	direct temperature effects on gs (Table 1: 0.199 at 20 °C to 0.140 at 40 °C)

	- The model uses Ci-based (intercellular CO2) calculations; the paper uses

	Cc-based (chloroplast CO2) with mesophyll conductance gm = 5-10 µmol m⁻² s⁻¹ Pa⁻¹


	Reference:

	Greer, D.H. & Weedon, M.M. (2012) Modelling photosynthetic responses to temperature

	of grapevine (Vitis vinifera cv. Semillon) leaves on vines grown in a hot climate.

	Plant, Cell & Environment, 35, 1050-1064.

	[DOI: 10.1111/j.1365-3040.2011.02471.x](https://doi.org/10.1111/j.1365-3040.2011.02471.x)

	""")


	# ---------------------------------------------------------------------------


	if _data_section == "Data Explorer":

	st.header("Data Explorer")

	st.markdown(

	"This tab lets you explore the raw data behind the predictions. "

	"Choose a data source below:\n\n"

	"- Vineyard sensors \u2014 Photosynthesis rate A, PAR, leaf/air temperature, and \u0394T from on-site crop sensors.\n"

	"- Weather station data \u2014 IMS station 43 (Sde Boker) and the merged dataset used for ML training.\n"

	"- AI Data Engineering \u2014 Gemini-powered sensor anomaly detection (Z-score/IQR + physical bounds) and engineered features "

	"(cyclical time encodings, Stress Risk Score). Run the pipeline and inspect thresholds, cleaning summary, and the daytime stress profile."

	)

	eda_stage = st.radio(

	"Data source",

	["Vineyard sensors", "Weather station data", "AI Data Engineering"],

	horizontal=True,

	label_visibility="visible",

	)


	if eda_stage == "Vineyard sensors":

	st.subheader("Vineyard sensor data")

	with st.expander("About this data"):

	st.markdown(

	"Shows the distribution and temporal patterns of the computed photosynthesis "

	"rate A, plus the raw sensor inputs used to calculate it. "

	"This helps verify that the model produces physiologically plausible values."

	)

	try:

	from scripts.eda import get_stage1_eda

	s1 = get_stage1_eda()

	except Exception as e:

	st.error(str(e))

	s1 = {"error": str(e)}


	if s1.get("error"):

	st.warning(s1["error"])

	else:

	stats = s1["labels_stats"]

	c1, c2, c3, c4 = st.columns(4)

	c1.metric("Observations", stats["count"])

	c2.metric("Mean A", f"{stats['A_mean']:.2f}")

	c3.metric("Std A", f"{stats['A_std']:.2f}")

	c4.metric("Range", f"{stats['A_min']:.1f} \u2013 {stats['A_max']:.1f}")

	st.caption(f"Date range: {stats['date_min']} to {stats['date_max']}")


	if _HAS_PLOTLY and s1.get("labels") is not None:

	A = s1["labels"].iloc[:, 0]

	with st.expander("About: Distribution of A"):

	st.markdown(

	"Histogram of all computed A values. A right-skewed distribution is typical: "

	"many low-A values (early/late day, cloudy) with a tail of high-A values "

	"(midday, full sun). The peak should be between 5\u201315 \u00b5mol m\u207b\u00b2 s\u207b\u00b9 for grapevines."

	)

	fig = px.histogram(x=A[A >= 1].dropna(), nbins=50, title="Distribution of A (Stage 1 labels, A \u2265 1)")

	fig.update_layout(xaxis_title="A (\u00b5mol m\u207b\u00b2 s\u207b\u00b9)", xaxis_range=[1, None])

	st.plotly_chart(fig)


	with st.expander("About: A over time"):

	st.markdown(

	"Time series of A across the dataset. Only the growing season "

	"(May\u2013Sep) is included \u2014 the gaps between clusters represent the "

	"dormant months (Oct\u2013Apr) when the vine does not photosynthesize "

	"and no data is collected. Within each season, look for diurnal "

	"oscillations and any anomalous spikes that may indicate sensor issues."

	)

	# Resample to daily mean to compress gaps and smooth diurnal noise

	A_daily = A.resample("D").mean().dropna()

	fig2 = go.Figure()

	fig2.add_trace(go.Scatter(

	x=A_daily.index, y=A_daily.values, mode="lines",

	name="A (daily mean)", line=dict(width=1.5, color=_BRAND_GREEN),

	connectgaps=False,

	))

	fig2.update_layout(

	title="A over time (daily mean)",

	xaxis_title="Time", yaxis_title="A (\u00b5mol m\u207b\u00b2 s\u207b\u00b9)",

	)

	st.plotly_chart(fig2)


	if s1.get("sensor_sample") is not None and not s1["sensor_sample"].empty and _HAS_PLOTLY:

	df = s1["sensor_sample"]

	st.subheader("Sensor distributions (daytime PAR > 50)")

	with st.expander("About sensor distributions"):

	st.markdown(

	"Histograms of the main sensor inputs used in the Farquhar model, filtered "

	"to daytime only (PAR > 50 \u00b5mol m\u207b\u00b2 s\u207b\u00b9).\n\n"

	"- PAR: Light energy for photosynthesis (400\u2013700 nm). "

	"Values above 2500 are sensor artifacts and are excluded.\n"

	"- Leaf Temp: Leaf surface temperature (\u00b0C).\n"

	"- Air Temp: Ambient temperature near the canopy (\u00b0C)."

	)

	sensor_cols = [c for c in ["Air1_PAR_ref", "Air1_leafTemperature_ref", "Air1_airTemperature_ref"] if c in df.columns]

	if sensor_cols:

	cols = st.columns(len(sensor_cols))

	for col_st, col_name in zip(cols, sensor_cols):

	with col_st:

	series = df[col_name].dropna()

	# Remove PAR outliers (sensor artifacts above 2500)

	if col_name == "Air1_PAR_ref":

	series = series[series <= 2500]

	fig = px.histogram(series, nbins=40, title=col_name.replace("Air1_", "").replace("_ref", ""))

	fig.update_layout(height=300)

	st.plotly_chart(fig)


	# --- Air-Leaf Temperature Delta ---

	if "Air1_leafTemperature_ref" in df.columns and "Air1_airTemperature_ref" in df.columns:

	st.subheader("Leaf\u2013Air temperature difference (\u0394T)")

	with st.expander("Why is \u0394T important?"):

	st.markdown(

	"The difference between leaf and air temperature "

	"(\u0394T = T_leaf \u2212 T_air) is a direct indicator of "

	"plant water stress.\n\n"

	"- \u0394T < 0 (leaf cooler than air): the vine is transpiring "

	"normally \u2014 evaporative cooling keeps the leaf below air "

	"temperature. The stomata are open and photosynthesis is active.\n"

	"- \u0394T \u2248 0: transpiration is slowing down.\n"

	"- \u0394T > 0 (leaf warmer than air): the vine has partially "

	"or fully closed its stomata due to water stress or extreme VPD. "

	"Transpiration has stopped cooling the leaf, so it heats up "

	"above ambient. Photosynthesis is severely limited.\n\n"

	"This is the basis of the Crop Water Stress Index (CWSI) "

	"used in the Farquhar model. In agrivoltaics, a rising \u0394T is "

	"the signal that the vine would benefit from tracker shading: "

	"the extra light cannot be used anyway because the stomata "

	"are shut."

	)

	delta_t = df["Air1_leafTemperature_ref"] - df["Air1_airTemperature_ref"]

	delta_t = delta_t.dropna()

	col_hist, col_time = st.columns(2)

	with col_hist:

	fig_dt = px.histogram(

	delta_t, nbins=50,

	title="\u0394T distribution (daytime)",

	color_discrete_sequence=[_BRAND_GREEN],

	)

	fig_dt.update_layout(

	xaxis_title="\u0394T = T_leaf \u2212 T_air (\u00b0C)",

	yaxis_title="Count",

	height=350,

	)

	fig_dt.add_vline(x=0, line_dash="dash", line_color="red",

	annotation_text="T_leaf = T_air")

	st.plotly_chart(fig_dt)

	with col_time:

	if "time" in df.columns:

	# Filter to growing season only (May-Sep)

	_ts = pd.to_datetime(df["time"], utc=True)

	_grow_mask = _ts.dt.month.isin([5, 6, 7, 8, 9])

	_dt_grow = delta_t[_grow_mask]

	_ts_grow = _ts[_grow_mask]

	fig_dt2 = go.Figure()

	fig_dt2.add_trace(go.Scatter(

	x=_ts_grow, y=_dt_grow.values,

	mode="markers", marker=dict(size=2, color=_BRAND_GREEN, opacity=0.4),

	name="\u0394T",

	))

	fig_dt2.add_hline(y=0, line_dash="dash", line_color="red")

	fig_dt2.update_layout(

	title="\u0394T over time",

	xaxis_title="Time",

	yaxis_title="\u0394T (\u00b0C)",

	height=350,

	)

	st.plotly_chart(fig_dt2)


	elif eda_stage == "Weather station data":

	st.subheader("Weather station data")

	with st.expander("About this data"):

	st.markdown(

	"Shows the IMS weather station data and the merged dataset used for "

	"prediction model training. This helps verify data overlap, check for "

	"missing values, and understand the weather patterns."

	)

	try:

	from scripts.eda import get_stage2_eda

	s2 = get_stage2_eda()

	except Exception as e:

	st.error(str(e))

	s2 = {"error": str(e)}


	if s2.get("error"):

	st.warning(s2["error"])

	else:

	stats = s2["stats"]

	c1, c2, c3 = st.columns(3)

	c1.metric("IMS rows", f"{stats['ims_rows']:,}")

	c2.metric("Merged rows", f"{stats['merged_rows']:,}")

	c3.metric("Features", len(stats["feature_cols"]))

	st.caption(f"IMS range: {stats['ims_date_min']} to {stats['ims_date_max']}")


	with st.expander("What are the feature columns?"):

	st.markdown(

	"IMS weather variables and engineered time features used as "

	"ML inputs. No on-site sensor data is included (strict separation to avoid leakage).\n\n"

	"- air_temperature_c, tdmax_c, tdmin_c: Temperature from IMS station.\n"

	"- ghi_w_m2: Global Horizontal Irradiance (solar radiation) \u2014 proxy for PAR.\n"

	"- rh_percent: Relative humidity.\n"

	"- rain_mm: Precipitation.\n"

	"- wind_speed_ms: Wind speed.\n"

	"- hour_sin, hour_cos: Cyclical encoding of hour-of-day.\n"

	"- doy_sin, doy_cos: Cyclical encoding of day-of-year (seasonality)."

	)

	merged = s2["merged"]

	st.dataframe(merged.describe())


	if _HAS_PLOTLY and "A" in merged.columns:

	with st.expander("About: Distribution of A (merged)"):

	st.markdown(

	"Distribution of A in the merged IMS+labels dataset. This is the subset "

	"of Stage 1 labels that have matching IMS timestamps. Compare with Stage 1 "

	"distribution to check for sampling bias."

	)

	fig = px.histogram(merged["A"][merged["A"] >= 1].dropna(), nbins=50, title="Distribution of A (merged set, A \u2265 1)")

	fig.update_layout(xaxis_title="A (\u00b5mol m\u207b\u00b2 s\u207b\u00b9)", xaxis_range=[1, None])

	st.plotly_chart(fig)


	if _HAS_PLOTLY and merged is not None:

	num_cols = [c for c in stats["feature_cols"] if c in merged.columns][:4]

	if num_cols:

	with st.expander("About: Feature distributions"):

	st.markdown(

	"Histograms of the first four numeric IMS features in the merged dataset. "

	"Check for: reasonable value ranges, skewness, outliers, and missing-value "

	"patterns that might affect model training."

	)

	fig = make_subplots(rows=2, cols=2, subplot_titles=num_cols)

	for i, col in enumerate(num_cols):

	r, c = i // 2 + 1, i % 2 + 1

	fig.add_trace(go.Histogram(x=merged[col].dropna(), nbinsx=30), row=r, col=c)

	fig.update_layout(title="Feature distributions (merged)")

	st.plotly_chart(fig)


	else:

	# ── AI Data Engineering ──────────────────────────────────────────────

	st.subheader("AI Data Engineering")

	st.markdown(

	"Gemini analyzes each sensor column's statistics against known physical constraints "

	"for grapevines in the Negev desert, then returns per-column anomaly thresholds "

	"(hard bounds + Z-score + IQR multiplier). The pipeline also generates five "

	"engineered features fed directly into the ML prediction models."

	)


	_llm_err = None

	try:

	from src.llm_data_engineer import LLMDataEngineer, SENSOR_CONTEXT


	_loader_path = settings.SENSORS_WIDE_SAMPLE_PATH

	if not _loader_path.exists():

	_loader_path = settings.SENSORS_WIDE_PATH

	if not _loader_path.exists():

	raise FileNotFoundError("Sensor data file not found.")


	_df_raw = pd.read_csv(_loader_path)

	_engineer = LLMDataEngineer(verbose=False)

	_key_cols = list(SENSOR_CONTEXT.keys())

	_present = [c for c in _key_cols if c in _df_raw.columns]


	with st.spinner("Querying Gemini for anomaly thresholds…"):

	_thresholds = _engineer.analyze_anomalies(_df_raw, columns=_present)


	_df_clean = _engineer.apply_cleaning(_df_raw, _thresholds, strategy="clip")


	with st.spinner("Querying Gemini for feature engineering spec…"):

	_feat_spec = _engineer.get_feature_spec(list(_df_clean.columns))


	_df_eng = _engineer.engineer_features(_df_clean, feature_spec=_feat_spec)


	# Violation counts

	_viol_before, _viol_after = {}, {}

	for _col, _t in _thresholds.items():

	if _col not in _df_raw.columns:

	continue

	_lo, _hi = _t.get("lower_bound"), _t.get("upper_bound")

	_m = pd.Series(False, index=_df_raw.index)

	if _lo is not None:

	_m \|= _df_raw[_col] < _lo

	if _hi is not None:

	_m \|= _df_raw[_col] > _hi

	_viol_before[_col] = int(_m.sum())

	_m2 = pd.Series(False, index=_df_clean.index)

	if _lo is not None:

	_m2 \|= _df_clean[_col] < _lo

	if _hi is not None:

	_m2 \|= _df_clean[_col] > _hi

	_viol_after[_col] = int(_m2.sum())


	# Stress profile

	_df_eng["_hr_local"] = (pd.to_datetime(_df_eng["time"], utc=True).dt.hour + 3) % 24

	_daytime = _df_eng[_df_eng["Air1_PAR_ref"] > 50] if "Air1_PAR_ref" in _df_eng.columns else _df_eng

	_stress_profile = (

	_daytime.groupby("_hr_local")["stress_risk_score"].mean()

	.reindex(range(24), fill_value=float("nan"))

	)


	_used_gemini = "Statistical fallback" not in list(_thresholds.values())[0].get("rationale", "")


	except Exception as _exc:

	_llm_err = str(_exc)


	if _llm_err:

	st.error(f"Pipeline error: {_llm_err}")

	else:

	_source_badge = (

	"🤖 Thresholds sourced from Gemini"

	if _used_gemini

	else "⚙️ Thresholds from statistical fallback (set `GOOGLE_API_KEY` to enable Gemini)"

	)

	st.caption(_source_badge)


	# ── Section 1: Anomaly thresholds ──────────────────────────────

	st.markdown("#### Gemini anomaly thresholds")

	with st.expander("How are thresholds generated?"):

	st.markdown(

	"For each sensor column, `LLMDataEngineer` sends the full descriptive "

	"statistics (min, max, percentiles) plus domain context — physical units, "

	"expected range for the Negev site, known failure modes — to Gemini. "

	"Gemini returns a JSON with:\n\n"

	"- Hard bounds (`lower_bound` / `upper_bound`): values outside these are "

	"physically impossible or known sensor faults.\n"

	"- Z-score threshold: flags readings that deviate more than N standard "

	"deviations from the column mean.\n"

	"- IQR multiplier: flags readings outside Q1 − k·IQR … Q3 + k·IQR.\n\n"

	"A reading is flagged only when the hard-bound violation OR both the Z-score "

	"AND IQR conditions are met simultaneously. Default strategy is clip "

	"(clamp to bounds), preserving row count."

	)


	_thresh_rows = []

	_col_labels = {

	"Air1_PAR_ref": "PAR (μmol/m²/s)",

	"Air1_leafTemperature_ref": "T_leaf (°C)",

	"Air1_airTemperature_ref": "T_air (°C)",

	"Air1_VPD_ref": "VPD (kPa)",

	"Air1_airHumidity_ref": "Humidity (%)",

	"Air1_CO2_ref": "CO₂ raw (ppm)",

	}

	for _col in _present:

	_t = _thresholds.get(_col, {})

	_vb = _viol_before.get(_col, 0)

	_va = _viol_after.get(_col, 0)

	_thresh_rows.append({

	"Sensor": _col_labels.get(_col, _col),

	"Lower bound": _t.get("lower_bound", "—"),

	"Upper bound": _t.get("upper_bound", "—"),

	"Z-score σ": _t.get("zscore_threshold", "—"),

	"IQR ×": _t.get("iqr_multiplier", "—"),

	"Violations (raw)": _vb,

	"After clip": _va,

	"Rationale": _t.get("rationale", ""),

	})

	_thresh_df = pd.DataFrame(_thresh_rows)

	st.dataframe(_thresh_df, hide_index=True)


	# ── Section 2: Cleaning summary metrics ────────────────────────

	st.markdown("#### Cleaning summary")

	_total_viol = sum(_viol_before.values())

	_total_cleared = sum(v for v in _viol_after.values() if v == 0)

	_pct_retained = len(_df_clean) / len(_df_raw) * 100

	_cm1, _cm2, _cm3, _cm4 = st.columns(4)

	_cm1.metric("Rows in dataset", f"{len(_df_raw):,}")

	_cm2.metric("Physical violations found", str(_total_viol))

	_cm3.metric("Columns fully cleared", f"{_total_cleared} / {len(_viol_before)}")

	_cm4.metric("Rows retained (clip)", f"{_pct_retained:.1f}%")


	if _HAS_PLOTLY and _total_viol > 0:

	_viol_cols = [_col_labels.get(c, c) for c in _viol_before if _viol_before[c] > 0]

	_viol_vals = [_viol_before[c] for c in _viol_before if _viol_before[c] > 0]

	_fig_viol = px.bar(

	x=_viol_cols, y=_viol_vals,

	labels={"x": "Sensor", "y": "Violation count"},

	title="Physical violations by sensor (before cleaning)",

	color_discrete_sequence=[_BRAND_GREEN],

	)

	_fig_viol.update_layout(height=300)

	st.plotly_chart(_fig_viol)


	# ── Section 3: Before / after distributions ────────────────────

	if _HAS_PLOTLY:

	st.markdown("#### Before vs after cleaning — PAR & VPD")

	with st.expander("What to look for"):

	st.markdown(

	"The raw histogram (red) includes all sensor readings. "

	"The cleaned histogram (green) shows the same column after "

	"the Gemini-generated thresholds are applied. Outlier spikes at the "

	"far right of PAR and VPD should disappear or be clipped to the bound."

	)

	_ba_cols = st.columns(2)

	for _idx, _col in enumerate(["Air1_PAR_ref", "Air1_VPD_ref"]):

	if _col not in _df_raw.columns:

	continue

	_label = _col_labels.get(_col, _col)

	_hi_bound = _thresholds.get(_col, {}).get("upper_bound")

	_raw_s = _df_raw[_col].dropna()

	_clean_s = _df_clean[_col].dropna()

	_fig_ba = go.Figure()

	_fig_ba.add_trace(go.Histogram(

	x=_raw_s, nbinsx=60, name="Raw",

	marker_color="crimson", opacity=0.55,

	))

	_fig_ba.add_trace(go.Histogram(

	x=_clean_s, nbinsx=60, name="Cleaned",

	marker_color=_BRAND_GREEN, opacity=0.7,

	))

	if _hi_bound is not None:

	_fig_ba.add_vline(

	x=_hi_bound, line_dash="dash", line_color="orange",

	annotation_text=f"bound={_hi_bound}",

	)

	_fig_ba.update_layout(

	barmode="overlay", title=f"{_label} — raw vs cleaned",

	xaxis_title=_label, height=320,

	)

	with _ba_cols[_idx]:

	st.plotly_chart(_fig_ba)


	# ── Section 4: Engineered features ─────────────────────────────

	st.markdown("#### Engineered features")

	with st.expander("How are features engineered?"):

	st.markdown(

	"After cleaning, the pipeline asks Gemini to confirm the optimal weights "

	"and normalisation bounds for the Stress Risk Score, given the available "

	"sensor columns and the Semillon grapevine stress physiology. "

	"It then computes five new columns:\n\n"

	"\| Feature \| Formula \| Purpose \|\n"

	"\|---\|---\|---\|\n"

	"\| `hour_sin` \| sin(2π·h/24) \| Cyclical hour-of-day \|\n"

	"\| `hour_cos` \| cos(2π·h/24) \| Cyclical hour-of-day \|\n"

	"\| `doy_sin` \| sin(2π·d/365) \| Seasonal position \|\n"

	"\| `doy_cos` \| cos(2π·d/365) \| Seasonal position \|\n"

	"\| `stress_risk_score` \| w_VPD·norm(VPD) + w_CWSI·norm(CWSI) \| Acute stress in [0, 1] \|\n\n"

	"Cyclical encodings ensure that midnight→01:00 and 23:00→midnight are "

	"treated as equally close by the model — something a raw hour integer cannot do."

	)


	# Feature spec card

	_fs_vpd_w = _feat_spec.get("vpd_weight", "—")

	_fs_cwsi_w = _feat_spec.get("cwsi_weight", "—")

	_fs_vpd_clip = _feat_spec.get("vpd_clip_max", "—")

	_fs_rat = _feat_spec.get("rationale", "")

	_fc1, _fc2, _fc3 = st.columns(3)

	_fc1.metric("VPD weight", _fs_vpd_w)

	_fc2.metric("CWSI weight", _fs_cwsi_w)

	_fc3.metric("VPD clip max (kPa)", _fs_vpd_clip)

	st.caption(f"Gemini rationale: {_fs_rat}")


	# Feature stats table

	_eng_feat_cols = ["hour_sin", "hour_cos", "doy_sin", "doy_cos", "stress_risk_score"]

	_feat_stats = (

	_df_eng[[c for c in _eng_feat_cols if c in _df_eng.columns]]

	.describe(percentiles=[0.25, 0.5, 0.75])

	.loc[["min", "mean", "max"]]

	.round(4)

	)

	st.dataframe(_feat_stats)


	# ── Section 5: Daytime stress profile ──────────────────────────

	if _HAS_PLOTLY and "stress_risk_score" in _df_eng.columns:

	st.markdown("#### Daytime stress profile")

	with st.expander("How to read this chart"):

	st.markdown(

	"Mean Stress Risk Score per local hour (Israel = UTC+3), "

	"computed over all daytime readings (PAR > 50 μmol m⁻² s⁻¹). "

	"A score of 1.0 means the vine is under maximum atmospheric demand; "

	"0.0 means no stress. The midday–afternoon peak is the primary "

	"window where SolarWine shading interventions are concentrated."

	)

	_profile_df = _stress_profile.dropna().reset_index()

	_profile_df.columns = ["Hour (local)", "Stress Risk Score"]

	_peak_hr = int(_profile_df.loc[_profile_df["Stress Risk Score"].idxmax(), "Hour (local)"])

	_fig_stress = px.bar(

	_profile_df,

	x="Hour (local)", y="Stress Risk Score",

	color="Stress Risk Score",

	color_continuous_scale=["#00BD3E", "#f5c518", "#e63946"],

	range_y=[0, 1],

	title=f"Hourly stress profile — peak at {_peak_hr:02d}:00 local",

	)

	_fig_stress.add_hline(

	y=0.5, line_dash="dot", line_color="orange",

	annotation_text="Intervention threshold (0.5)",

	)

	_fig_stress.update_layout(

	xaxis=dict(tickmode="linear", dtick=1),

	coloraxis_showscale=False,

	height=380,

	)

	st.plotly_chart(_fig_stress)

	st.caption(

	f"Peak stress: {_peak_hr:02d}:00 local "

	f"(score = {_stress_profile.max():.3f}). "

	f"Low-stress morning window (before 10:00): "

	f"mean score = {_stress_profile.loc[6:9].mean():.3f} — shading withheld."

	)


	# ---------------------------------------------------------------------------

	# ---------------------------------------------------------------------------
	# Tab — Shading Simulator (was Panel vs Open Sky)
	# ---------------------------------------------------------------------------