目标站点:https://www.ncbi.nlm.nih.gov/sra/?term=SRR3301029
爬取内容:full内所有字,还需要点开show abstract
以下是代码:只挑选主要代码
<div><p class="details expand e-hidden"><b><a href="/sra/SRX1664908[accn]">SRX1664908</a>: GSM2099562: Schip-1 (9721); Arabidopsis thaliana; Bisulfite-Seq</b><br />6 ILLUMINA (Illumina Genome Analyzer IIx) runs: 92M spots, 9.2G bases, 6.1Gb downloads</p><div class="rprt"><p class="title"><a href="" ref="ordinalpos=1&ncbi_uid=2388247&link_uid=2388247"></a></p><p class="rprtbody"><div id="ResultView" uid="2388247"><div class="sra-full-data">Submitted by: <span>NCBI (GEO)</span></div><div class="sra-full-data">Study: <span>Patterns of Population Epigenomic Diversity in Arabidopsis thaliana (Methyl-Seq)<div class="expand-body"><a href="/bioproject/PRJNA187927" title="Link to BioProject">PRJNA187927</a> • <a href="//trace.ncbi.nlm.nih.gov/Traces/sra?study=SRP018263" title="Link to SRA Study">SRP018263</a> • <a href="/sra?term=SRP018263">All experiments</a> • <a href="/Traces/study?acc=SRP018263">All runs</a></div><div class="expand e-hidden expand-body"><a href="#" class="expand-handler"><span class="more">show Abstract</span><span class="less">hide Abstract</span></a><div class="expand-body">Natural epigenetic variation provides a source for the generation of phenotypic diversity, but to understand its contribution to phenotypic diversity, its interaction with genetic variation requires further investigation. Overall design: MethylC-seq from naturally-occurring Arabidopsis accessions</div></div></span></div><div class="sra-full-data">Sample: <span>Schip-1 (9721)<div class="expand-body"><a href="/biosample/SAMN04581607" title="Link to BioSample">SAMN04581607</a> • SRS1363562 • <a href="/sra?term=SAMN04581607">All experiments</a> • <a href="/Traces/study?acc=SAMN04581607">All runs</a></div></span><div class="expand-body">Organism: <span><a href="/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=3702">Arabidopsis thaliana</a></span></div></div><div class="expand showed sra-full-data">Library: <div class="expand-body"><div>Instrument: <span>Illumina Genome Analyzer IIx</span></div><div>Strategy: <span>Bisulfite-Seq</span></div><div>Source: <span>GENOMIC</span></div><div>Selection: <span>RANDOM</span></div><div>Layout: <span>SINGLE</span></div><div>Construction protocol: <span>DNA was isolated using a Qiagen Plant DNeasy kit (Qiagen, Valencia, CA) following the manufacturer’s recommendations. Approximately one to three micrograms of genomic DNA was sonicated to ~100 bp using the Covaris S2 System using the following parameters: cycle number = 6, duty cycle = 20%, intensity = 5, cycles/burst = 200 and time = 60 seconds. Sonicated DNA was purified using Qiagen DNeasy minielute columns (Qiagen). Each sequencing library was constructed similar to genomic DNA libraries except the ligation was performed with methylated adapters provided by Illumina. Ligation products were purified with AMPure XP beads (Beckman) at a ratio of 1.8X of beads to sample. Up to 450 ng of ligated DNA was bisulfite treated using the MethylCode Kit (Invitrogen, Carlsbad, CA) following the manufacturer’s guidelines and then PCR amplified using Pfu Cx Turbo (Agilent, Santa Clara, CA) using the following PCR conditions (2 minutes at 95C, 4 cycles of 15 seconds at 98C, 30 seconds at 60C, 4 minutes at 72C and 10 minutes at 72C).</span></div></div></div><div class="sra-full-data">Experiment attributes: <div class="expand-body"><div>GEO Accession: <span>GSM2099562 </span></div></div></div><div class="sra-full-data">Links: <div></div></div><div class="sra-full-data">Runs: <span>6 runs, 92M spots, 9.2G bases, <a href="/Traces/study?acc=SRX1664908" title="All runs for this experiment">6.1Gb</a></span></div><table border="0" cellpadding="1" cellspacing="0"><thead><tr class="sra-run-list-header"><th width="20%">Run</th><th width="20%" align="right"># of Spots</th><th width="20%" align="right"># of Bases</th><th width="20%" align="right">Size</th><th width="20%" align="right">Published</th></tr></thead><tbody><tr><td align="left"><a href="//trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3301026">SRR3301026</a></td><td align="right">25,373,382</td><td align="right">2.5G</td><td align="right">1.7Gb</td><td>2016-10-06</td></tr><tr><td align="left"><a href="//trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3301027">SRR3301027</a></td><td align="right">10,160,043</td><td align="right">1G</td><td align="right">719.1Mb</td><td>2016-10-06</td></tr><tr><td align="left"><a href="//trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3301028">SRR3301028</a></td><td align="right">11,110,198</td><td align="right">1.1G</td><td align="right">787.2Mb</td><td>2016-10-06</td></tr><tr><td align="left"><a href="//trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3301029"><span class="highlight" style="background-color:">SRR3301029</span></a></td><td align="right">14,257,985</td><td align="right">1.4G</td><td align="right">917.9Mb</td><td>2016-10-06</td></tr><tr><td align="left"><a href="//trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3301030">SRR3301030</a></td><td align="right">14,436,020</td><td align="right">1.4G</td><td align="right">1,002.5Mb</td><td>2016-10-06</td></tr><tr><td align="left"><a href="//trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3301031">SRR3301031</a></td><td align="right">16,695,235</td><td align="right">1.7G</td><td align="right">1Gb</td><td>2016-10-06</td></tr></tbody></table><br /></div></p><div class="aux"><div class="resc"><dl class="rprtid"><dt>ID:</dt> <dd>2388247</dd> </dl></div><p class="links nohighlight"></p></div></div></div>
难点:标签过多/含有表格/含有文本附带超链接
mmp解决不了~果断放弃:如果遇到大神请帮我看看这怎么解决?
我想到一个方法,既然爬不了就手工起码省劲多了,气死我啦~头大
1.先利用批量生成网址:关于这一点是需要自己观察的呦
这里我以自己为例
kaitou='https://www.ncbi.nlm.nih.gov/sra/?term=SRR' for i in range(3301026,3301125,1): print(kaitou+str(i))
3301026为开始/3301125为结束
2.得到网址,用https://www.a-site.cn/tool/kai/批量打开/或者百度下载urlopen的插件,复制粘贴自己的网址
3.批量建立docx
@echo off
for /L %%x in (3301026,1,3301126) do @echo %%x>SSR%%x.docx
4.批量打开用wps,快捷键附上 ctrl+s/ctrl+w