一、前言
电子病历,很多市中心医院都在使用,却很少有人将其中的数据用于机器学习,以达到智能诊断的功能
本文对此做了一个实战案例。
二、可行性分析
1、功能介绍:
用户输入个人身体特征的信息,机器返回最可能的得出的疾病类型及其可能性 %。
2、分析:
① 用户输入个人病症时,为一段 中文 字符串
② 首先要进行做特征处理,即将文字进行分词(英文则可以直接跳过!),
③ 打上标签 label,进行 Spark 机器学习训练
④ 预测
三、代码
1、开发环境
保存模型
# -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
# pip freeze > requirements.txt
# pip install -r requirements.txt
import jieba
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, SparkSession
from pyspark.ml.feature import HashingTF, Tokenizer
# 读取原生数据
df = pd.read_excel("C:/Users/linhongcun/Desktop/t_sickness.xlsx")
# 中文分词
for indexs in df.index:
string = df.loc[indexs].values[2]
# print(string)
con = jieba.cut(string, cut_all=True)
content = list(con)
c = ' '.join(content)
# print(c)
df.iloc[indexs, 2] = c
# 0.构建 Spark 对象
spark = SparkSession.builder.master("local").appName("sickness").getOrCreate()
# 1.训练样本
training = spark.createDataFrame(df)
training.show(truncate=False)
""" 必须要有字段为 label 作为预测点——1妇科疾病、2神经系统疾病、3循环系统疾病、4呼吸系统疾病、5消化系统疾病
+---+--------+--------------------------+-----+
|id |name |symptom |label|
+---+--------+--------------------------+-----+
|1 |月经失调 |月经 月经周期 周期 不 固定 |1 |
|2 |痛经 |月经 来潮 前后 腹部 疼痛 |1 |
|3 |盆腔炎 |发热 下腹 下腹部 腹部 疼痛 |1 |
|4 |膀胱炎 |尿急 瘙痒 灼热 |1 |
|5 |附件炎 |月经 量 增多 痛经 严重 |1 |
|6 |阴道炎 |阴道 灼热 痛痒 白带 腥臭 |1 |
|7 |乳腺炎 |乳房 红肿 热 痛 有 硬块 |1 |
|8 |宫颈炎 |以 白带 增多 为主 主要 主要症状 症状 |1 |
|9 |经前期紧张综合征|情绪 不稳 稳定 易怒 |1 |
|10 |更年期综合征 |月经 紊乱 潮热 盗汗 多疑 易怒 |1 |
|11 |乳腺增生 |乳房 胀痛 并 有 肿块 出现 |1 |
|12 |葡萄胎 |闭经 腹痛 |1 |
|13 |子宫肌瘤 |下腹 下腹部 腹部 出现 梨 大小 的 肿块 |1 |
|14 |宫颈癌 |月经 之外 的 出血 |1 |
|15 |卵巢肿瘤 |腹痛 下腹 出现 肿块 |1 |
|16 |乳腺癌 |最 可怕 的 肿块 往往 没有 痛感 |1 |
|17 |淋病 |尿 痛 尿急 尿道 尿道口 道口 口红 红肿|1 |
|18 |头痛 |头部 出现 反复 反复无常 无常 的 疼痛 |2 |
|19 |眩晕 |感觉 周围 物体 旋转 站立 不稳 |2 |
|20 |晕动病 |乘车 时 头晕 恶心 |2 |
+---+--------+--------------------------+-----+
only showing top 20 rows
"""
# 2.参数设置:tokenizer、hashingTF、lr
tokenizer = Tokenizer(inputCol="symptom", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# 3.训练模型
model = pipeline.fit(training)
# 4.测试数据
test = spark.createDataFrame([
(0, "兴奋"),
(1, "拒食"),
(2, "胀痛"),
(3, "咳嗽")
], ["id", "symptom"])
test.show(truncate=False)
"""
+---+-------+
|id |symptom|
+---+-------+
|0 |兴奋 |
|1 |拒食 |
|2 |胀痛 |
|3 |咳嗽 |
+---+-------+
"""
# 5.模型预测
prediction = model.transform(test)
prediction.show(truncate=False)
""" 正确率100%——1妇科疾病、2神经系统疾病、3循环系统疾病、4呼吸系统疾病、5消化系统疾病
+---+-------+-----+-----------------------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+----------+
|id |symptom|words|features |rawPrediction |probability |prediction|
+---+-------+-----+-----------------------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+----------+
|0 |兴奋 |[兴奋] |(262144,[85159],[1.0]) |[-2.477496467830346,0.23894253214354277,2.502729581080473,-0.026096802888310178,-0.2103237120516836,-0.027755130453694155]|[0.005142032398933459,0.07778023963377388,0.7482030950579408,0.05967111542618967,0.04963127430859228,0.05957224317457003] |2.0 |
|1 |拒食 |[拒食] |(262144,[70639],[1.0]) |[-2.479710267998232,0.22296399538676587,0.40711232995063407,0.12542881442781045,-0.22694517875278053,1.9511503069858014] |[0.007096899804584044,0.10588274523648426,0.12729161551306126,0.09604310718632243,0.06751995141297448,0.5961656808465735] |5.0 |
|2 |胀痛 |[胀痛] |(262144,[204799],[1.0])|[-2.4936994604638585,0.815143389891353,0.29659283389333035,-0.024482861902893505,-0.2759726416266452,1.6824187402087265] |[0.00764809839775098,0.2092019162066145,0.12455524293424604,0.09034843000203284,0.07025868007154462,0.4979876323878109] |5.0 |
|3 |咳嗽 |[咳嗽] |(262144,[222472],[1.0])|[-2.484678277185686,0.17511466926473163,0.36108768539188024,-0.20941504781420683,2.233242879410125,-0.0753519090668317] |[0.0060495948111361445,0.08646885081373447,0.10414219855157512,0.05886546146075085,0.6771632986309767,0.06731059573182673]|4.0 |
+---+-------+-----+-----------------------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+----------+
"""
# 6 模型保存
pipeline.write().overwrite().save('C:\LLLLLLLLLLLLLLLLLLL\BigData_AI\pyspark\pipeline')
model.write().overwrite().save('C:\LLLLLLLLLLLLLLLLLLL\BigData_AI\pyspark\model')
2、生产环境
直接读取
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import SparkSession
# 0.构建 Spark 对象
spark = SparkSession.builder.master("local").appName("medical").getOrCreate()
# 2 训练方法:加载的方式
loadedPipeline = Pipeline.load('C:\LLLLLLLLLLLLLLLLLLL\BigData_AI\pyspark\pipeline')
# 3.训练模型:加载的方式
loadedPipelineModel = PipelineModel.load('C:\LLLLLLLLLLLLLLLLLLL\BigData_AI\pyspark\model')
# 4.测试数据
test = spark.createDataFrame([
(0, "兴奋")
], ["id", "symptom"])
# 5.模型预测 —— 1 妇科疾病、2 神经系统疾病、3 循环系统疾病、4 呼吸系统疾病、5 消化系统疾病
prediction = loadedPipelineModel.transform(test)
prediction.select("prediction").show(truncate=False)
四、训练数据
① 疾病及其特征
② 疾病类型
已经放到 GitHub 上了:https://github.com/larger5/SparkML_TrainingData.git
读者可以先导入 mysql 数据库中,再导出为 .xls
等文件格式进行机器学习训练。
包括 requirements.txt
,Spark 版本要注意,其他版本会报错。