Today we’ll dive into the foundations of AI – Data – because without data, there would be no AI. Just like we can’t make a diagnosis without a patient’s history, observations, and test results, we can’t train an AI model without data. The analogy between the two worlds doesn’t stop there, as clinicians our diagnostic accuracy depends on a number of factors, a detailed, structured history, a complete set of test results, trends in observations from the patient, similarly, with an AI model, the better the quality, structure and relevance of data provided to it is, the safer and powerful the model becomes.
So lets get started!
What is Clinical Data?
Clinical Data refers to any information collected about a patient during the process of delivering healthcare. As a clinician, you read it, write it, and act on it everyday.
This includes:
| Type | Examples |
| Patient Demographics | Name, Age, Sex, Ethnicity |
| Vitals | Blood pressure, heart rate, respiratory rate |
| lab results | Full blood counts, glucose levels, CRP and so on |
| Diagnosis | ICD-10 codes |
| Medications | Prescription history, dosages |
| Procedures | Imaging and Surgeries |
| Notes | Progress Notes, Discharge Summaries |
| Images | X-rays, CT-scans, MRI scans, pathology slides |
| Monitoring Data | ECG waveforms, oxygen saturation trends |
As a clinician, during your early years you learned patterns and trends from this data, you developed your internal algorithms. AI does something quite similar, however:
- If the data is incomplete, AI will guess
- If the data is inaccurate, AI will mislead
- If the data is biased, AI will reflect the bias.
Thus, when it comes to providing data to AI, it becomes a responsibility.
What are the types of data generated in a healthcare setting?
Any setting in general generates 3 core types of data. This includes:
- Structured Data:
- Data is organised in clear formats (rows and columns)
- This could include, age, blood pressure, glucose levels, diagnosis codes, etc.
- Main benefit is – easy to analyse and train models on.
- Unstructured Data:
- This includes Free text or images.
- Examples include- clinical notes, radiology images, lab reports and results.
- This requires Natural Language Processing or Computer Vision to be AI ready.
- Semi-Structured Data:
- This is data that has structure but not strictly structured. It sits between structured and unstructured.
- It is not strictly organised into rows and columns like a data base.And for this to be used in an AI model, would require a degree of processing(parsing – sorting and extracting).
- Examples include – JSON files, XML formatted EHR exports, monitor logs

Why Structure and Quality of Clinical Data Matters in AI?
Think of data like ingredients in a recipe. Even the best chef can’t cook a good meal with spoiled or mislabeled ingredients. So, for you to have a safe AI tool you need your data to be clean, structured and reliable.
This means Structured Data is “AI ready data”. AI models use structured data to read and learn patterns. This is where the Quality of Data becomes important.
If the data contains:
- Errors ( BP recorded as 1200 by mistake)
- Bias ( only males are included)
- Missingness (no oxygen saturations for half the patients)
- Confusion (SOB stands for shortness of breath or WHAT???)
Predictions from AI models would cause:
- Inaccurate alerts
- Wrong diagnosis
- Missed Risks
- Harm to patients.
Therefore, it is important for us as clinicians to ensure that the data collected is standardised, and the quality of data collected is improved.
Hold on – What about Unstructured Data?
Unstructured data as I mentioned earlier is not organised in a predefined format like rows and columns. This includes narrative texts, visual media, audio notes. Working with such data is hard for computers, and requires special techniques – such as Natural Language Processing(for texts), Computer Vision(for images), Speech to text(for audio), Frame by frame image analysis(for videos).
Unstructured data is often rich in meaning – but is:
- Inconsistent – every clinician writes differently
- Time consuming to clean
- Hidden inside silos (like scanned PDF’s and audio files)
Thus whilst unstructured data contains hidden gold, if we can extract it properly, Structured data is preferred.
That’s it for today. Stayed tuned in for the next blog as we dive deeper into the world of Clinical AI – understanding it in byte sized posts.


Leave a Reply