【问题标题】:mapping data from text file and create multi dimentional array in matlab从文本文件映射数据并在matlab中创建多维数组
【发布时间】:2012-10-23 12:30:29
【问题描述】:

我刚刚创建了一个 matlab 文件,该文件从一个包含 [Term] 的文本文件中获取数据并制作向量,并包含有关 is_a 关系和部分关系(生物信息学领域)的信息

the code is as follows:  

    clear all;
    % This code is for opening and getting information from a text file
    s={}; 
    fid = fopen('gos.txt'); 
    tline = fgetl(fid); 
    while ischar(tline) 
       s=[s;tline]; 
       tline = fgetl(fid); 
    end 
     %To generate the GO_Terms vector from the text file
    tok = regexp(s, '^id: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    GO_Terms   = cellfun(@(x)x{1}, {tok{idx}})'


    %To generate the is_a relations vector from the text file
    tok = regexp(s, '^is_a: (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    is_a_relations  = cellfun(@(x)x{1}, {tok{idx}})'


    %To generate the part_of relaions vector from the text file
    tok = regexp(s, '^relationship: part_of (GO:\w*)', 'tokens'); 
    idx = ~cellfun('isempty', tok); 
    part_of_relations = cellfun(@(x)x{1}, {tok{idx}})'

结果如下:

s= 



   '[Term]'    
    'id: GO:0008150'
    'name: biological_process'
    'namespace: biological_process'
    [1x180 char]
    [1x445 char]

    '[Term]'    
    'id: GO:0016740'
    'name: transferase activity'
    'namespace: molecular_function'
    'xref: Reactome:REACT_25050 "Molybdenum ion transfer onto molybdopterin, Homo sapiens"'
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0008150 ! molecular_function (added by Zaid, To be Removed Later)'
    '//relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0016787'
    'name: hydrolase activity'
    'namespace: molecular_function'
    'xref: Reactome:REACT_110436 "Hydrolysis of phosphatidylcholine, Bos taurus"'
    [1x92  char]
    'xref: Reactome:REACT_87959 "Hydrolysis of phosphatidylcholine, Gallus gallus"'
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0016740 ! molecular_function (added by Zaid, to be removed later)'
    'relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0006810'
    'name: transport'
    'namespace: biological_process'
    'alt_id: GO:0015457'
    'alt_id: GO:0015460'
    [1x255 char]
    'subset: goslim_aspergillus'
    'synonym: "transport accessory protein activity" RELATED [GOC:mah]'
    'is_a: GO:0016787 ! biological_process'
    'relationship: part_of GO:0008150 ! biological_process'

    '[Term]'    
    'id: GO:0006412'
    'name: translation'
    'namespace: biological_process'
    'alt_id: GO:0006416'
    [1x522 char]
    'subset: gosubset_prok'
    'synonym: "protamine kinase activity" NARROW []'
    'is_a: GO:0016740 ! transferase activity'
    '//relationship: part_of GO:0006464 ! cellular protein modification process'

    '[Term]'    
    'id: GO:0016779'
    'name: nucleotidyltransferase activity'
    'namespace: molecular_function'
    'is_a: GO:0016740 ! transferase activity'

    '[Term]'    
    'id: GO:0004386'
    'helicases, Xenopus tropicalis"'
    [1x100 char]
    'is_a: GO:0016787 ! hydrolase activity'

    '[Term]'    
    'id: GO:0003774'
    'name: motor activity'
    'namespace: molecular_function'
    [1x178 char]
    'is_a: GO:0016787 ! hydrolase activity'
    [1x110 char]

    '[Term]'    
    'id: GO:0016298'
    'name: lipase activity'
    'namespace: molecular_function'
    'holesterol ester + H2O -> cholesterol + fatty acid, Caenorhabditis elegans"'
    'is_a: GO:0016787 ! hydrolase activity'

    '[Term]'    
    'id: GO:0016192'
    'name: vesicle-mediated transport'
    'namespace: biological_process'
    'alt_id: GO:0006899'
    [1x429 char]
    'subset: goslim_aspergillus'
    'synonym: "vesicular transport" EXACT [GOC:mah]'
    'is_a: GO:0006810 ! transport'

    '[Term]'    
    'id: GO:0005215'
    'name: transporter activity'
    'namespace: molecular_function'
    [1x92  char]
    '//is_a: GO:0003674 ! molecular_function'
    'is_a: GO:0006412 ! molecular_function (to be removed later)'
    'relationship: part_of GO:0006810 ! transport'

    '[Term]'    
    'id: GO:0030533'
    'name: triplet codon-amino acid adaptor activity'
    'namespace: molecular_function'
    'is_a: GO:0004672 ! RNA binding (added by Zaid, to be removed later)'
    'relationship: part_of GO:0005215 ! translation'





GO_Terms = 

    'GO:0008150'
    'GO:0016740'
    'GO:0016787'
    'GO:0006810'
    'GO:0006412'
    'GO:0004672'
    'GO:0016779'
    'GO:0004386'
    'GO:0003774'
    'GO:0016298'
    'GO:0016192'
    'GO:0005215'
    'GO:0030533'


is_a_relations = 

    'GO:0008150'
    'GO:0016740'
    'GO:0016787'
    'GO:0008150'
    'GO:0016740'
    'GO:0016740'
    'GO:0016787'
    'GO:0016787'
    'GO:0016787'
    'GO:0006810'
    'GO:0006412'
    'GO:0004672'


part_of_relations = 

    'GO:0008150'
    'GO:0008150'
    'GO:0006810'
    'GO:0016192'
    'GO:0006810'
    'GO:0005215'

我想在一个多维数组中收集这些数据,第一列是:'GO_Term',第二列是:'is_a_relations',第三列是:'part_of_relations'

问题是文本文件中的所有 [Terms] 都不包含第二列和第三列('is_a' 和 'part of Relations')... 那么如何通过其 is_a 和文本文件中每个 [Term] 段落的部分关系(如果有)映射每个 GO_Term。

【问题讨论】:

    标签: matlab text create-table


    【解决方案1】:

    在这种情况下,您必须逐个学期并在途中创建地图:

    % find start and end positions of every [Term] marker in s 
    terms = [find(~cellfun('isempty', regexp(s, '\[Term\]'))); numel(s)+1];
    
    % for every [Term] section, run the previously implemented regexps
    % and save the results into a map - a cell array with 3 columns
    map = cell(0,3);
    for term=1:numel(terms)-1
        % extract single [Term]  data
        s_term = s(terms(term):terms(term+1)-1);
    
        % match regexps
        %To generate the GO_Terms vector from the text file
        tok = regexp(s_term, '^id: (GO:\w*)', 'tokens');
        idx = ~cellfun('isempty', tok); 
        GO_Terms   = cellfun(@(x)x{1}, {tok{idx}})';
    
        %To generate the is_a relations vector from the text file
        tok = regexp(s_term, '^is_a: (GO:\w*)', 'tokens'); 
        idx = ~cellfun('isempty', tok); 
        is_a_relations  = cellfun(@(x)x{1}, {tok{idx}})';
    
        %To generate the part_of relaions vector from the text file
        tok = regexp(s_term, '^relationship: part_of (GO:\w*)', 'tokens'); 
        idx = ~cellfun('isempty', tok); 
        part_of_relations = cellfun(@(x)x{1}, {tok{idx}})';
    
        % map. note the end+1 - here we create a new map row. Only once!
        map{end+1,1} = GO_Terms;
        map{end,  2} = is_a_relations;
        map{end,  3} = part_of_relations;
    end
    

    map 现在是一个包含 3 列的元胞数组。有些条目是空的,这意味着该特定 [Term] 条目没有对应的值。

    【讨论】:

    • 听起来不错,但是为什么当我想显示地图数组时只包含单元格维度 {1x1 cell} 而不是 GOterm (GO:******)...
    • @Gloria 检查map{1,1}map{2,2} 的内容;)这是cell array。在开始编程之前,您应该阅读文档。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-01-16
    • 1970-01-01
    • 1970-01-01
    • 2018-11-16
    • 2013-02-14
    相关资源
    最近更新 更多