【发布时间】:2020-04-03 07:09:32
【问题描述】:
如何在CRAFT corpus 上使用 spaCy 为生物医学 NER 构建命名实体识别 (NER) 模型?
我很难将该语料库中给出的xml 文件预处理为spacy 使用的任何格式,我们将不胜感激。
我首先将xml 文件转换为json 格式,但spacy 不接受。 spacy 期望什么格式的训练数据? 我什至尝试构建自己的 NER 模型,但无法预处理 xml 文件,如 article 中给出的那样。
这是一个使用 spacy 训练 NER 模型的示例,包括训练数据的预期格式(来自spacy's docs):
import random
import spacy
TRAIN_DATA = [
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
nlp = spacy.blank("en")
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")
我正在使用的 XML 文件可在线获取 here。示例记录如下所示:
<passage>
<infon key="section_type">ABSTRACT</infon>
<infon key="type">abstract</infon>
<offset>141</offset>
<text>
Breast cancer is the most frequent tumor in women, and in nearly two-thirds of cases, the tumors express estrogen receptor alpha (ERalpha, encoded by ESR1). Here, we performed whole-exome sequencing of 16 breast cancer tissues classified according to ESR1 expression and 12 samples of whole blood, and detected 310 somatic mutations in cancer tissues with high levels of ESR1 expression. Of the somatic mutations validated by a different deep sequencer, a novel nonsense somatic mutation, c.2830 C>T; p.Gln944*, in transcriptional regulator switch-independent 3 family member A (SIN3A) was detected in breast cancer of a patient. Part of the mutant protein localized in the cytoplasm in contrast to the nuclear localization of ERalpha, and induced a significant increase in ESR1 mRNA. The SIN3A mutation obviously enhanced MCF7 cell proliferation. In tissue sections from the breast cancer patient with the SIN3A c.2830 C>T mutation, cytoplasmic SIN3A localization was detected within the tumor regions where nuclear enlargement was observed. The reduction in SIN3A mRNA correlates with the recurrence of ER-positive breast cancers on Kaplan-Meier plots. These observations reveal that the SIN3A mutation has lost its transcriptional repression function due to its cytoplasmic localization, and that this repression may contribute to the progression of breast cancer.
</text>
<annotation id="38">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="246" length="23"/>
<text>estrogen receptor alpha</text>
</annotation>
<annotation id="39">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="271" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="40">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="291" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="41">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="392" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="42">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="512" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="43">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="720" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="44">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="868" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="45">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="915" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="46">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="930" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="47">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1048" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="48">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1087" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="49">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1201" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="50">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1331" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="51">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="185" length="5"/>
<text>women</text>
</annotation>
<annotation id="52">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="762" length="7"/>
<text>patient</text>
</annotation>
<annotation id="53">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="1031" length="7"/>
<text>patient</text>
</annotation>
<annotation id="54">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="397" length="10"/>
<text>expression</text>
</annotation>
<annotation id="55">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="517" length="10"/>
<text>expression</text>
</annotation>
<annotation id="56">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="1054" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="57">
<infon key="identifier">CVCL:0031</infon>
<infon key="type">CellLine</infon>
<location offset="964" length="4"/>
<text>MCF7</text>
</annotation>
<annotation id="58">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1494" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="59">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="346" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="60">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="743" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="61">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1017" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="62">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="477" length="6"/>
<text>cancer</text>
</annotation>
<annotation id="63">
<infon key="identifier">p.Q944*</infon>
<infon key="type">ProteinMutation</infon>
<location offset="642" length="9"/>
<text>p.Gln944*</text>
</annotation>
<annotation id="64">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="1130" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="65">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="176" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="66">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="630" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="67">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1258" length="14"/>
<text>breast cancers</text>
</annotation>
<annotation id="68">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="231" length="6"/>
<text>tumors</text>
</annotation>
<annotation id="69">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="141" length="13"/>
<text>Breast cancer</text>
</annotation>
</passage>
【问题讨论】:
-
请在最后添加一些你已经尝试过的东西来展示你的努力。此外,更多的描述也将帮助这里的人们了解问题所在。
-
@VPK 我希望现在更清楚了。
-
XML 数据是什么样的? spacy 期望什么数据格式?我敢打赌,如果你把这些东西放在问题中,你会得到答案
-
@SamH。 XML 数据可以在以下位置找到:ncbi.nlm.nih.gov/research/pubtator-api/publications/export/… 另外,我认为 spacy 的 ner 格式更像:[(text) {list of entity}],例如- [("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})] 在 spacy 的网站上给出.. spacy 使用 JSON 格式。因此,我尝试将其转换为 JSON 格式,但 spacy-train 在尝试将其用作 TRAIN DATA 后出现错误。
标签: python nlp bioinformatics spacy named-entity-recognition