【问题标题】:Web scraping publication from a google scholar profile via simplehtmldom PHP通过 simplehtmldom PHP 从谷歌学者个人资料中抓取出版物
【发布时间】:2018-12-17 09:41:20
【问题描述】:

我正在尝试从谷歌学者个人资料中抓取出版物,但我不知道如何从个人资料中抓取每个出版物,我知道个人资料页面可以显示的最大出版物是每页 100 个来自这个问题:

Google Scholar profile scrape PHP

我只想知道如何将 url 应用到我的 php 代码中,以便我可以从配置文件中获取每个出版物并将它们插入到数组中

我可以使用以下代码将单个页面中的每个出版物放到一个数组中:

<?php 
set_time_limit(0);  
include 'simple_html_dom.php';

$data = json_decode(file_get_contents('php://input'),true);
$scholarID =  $data["gScholarID"];
$kodeDosen = $data["kodeDosen"];
$page = 1;
$offset = ($page - 1)* 100;
$cStart = 0+$offset;
$profile = 'https://scholar.google.com/citations?user='.$scholarID.'&hl=en&cstart='.$cStart.'&view_op=list_works&pagesize=100';
$html = file_get_html($profile);
$table = $html->find('#gsc_a_t',0);
$rowData = array();

foreach($table->find('tr.gsc_a_tr') as $row){
    $paperjudul  = $row->find('td.gsc_a_t a', 0)->plaintext;
    $paper['kodeDosen'] = $kodeDosen;
    $paper['judul'] = $paperjudul;
    $cited   = $row->find('td.gsc_a_c', 0)->plaintext;
    if($cited === ''){
        $cited = 0;
    }
    $cited = preg_replace('/[\*]+/', '', $cited);
    $paper['citedBy'] = $cited;
    $paper['namaJurnal']    = $row->find('td.gsc_a_t .gs_gray', 1)->plaintext;
    if($paper['namaJurnal'] === ''){
        $paper['namaJurnal'] = 'n/a';
    }
    $paper['periode']   = $row->find('td.gsc_a_y', 0)->plaintext;
    if($paper['periode'] === ' '){
        $paper['periode'] = 'n/a';
    }
    $paper['status'] = 'Published';
    $rowData[] = $paper;
}

print_r($rowData);


?>

我只是想知道如何将此代码应用于多个页面以从 google 学者个人资料中获取所有出版物

【问题讨论】:

  • 你什么都没试过,是吗?
  • 我编辑了问题以添加我使用的 php 代码。
  • 听起来你是在窃取来自 Google Scholar 的数据。你要达到什么目标?
  • @Raptor 我正在尝试抓取一个学者简介页面,以便我可以在个人资料中获取有关该论文的信息,以便将其放在我的大学分配申请中。该应用程序仅打印有关论文的信息,学者ID由具有谷歌学者个人资料的用户提供,以便他们的论文可以插入到数据库中,该数据库包含他们已上传并在应用程序中显示的论文信息

标签: php simple-html-dom google-scholar


【解决方案1】:

我找到了一种可行的方法,首先,我创建了一个循环来搜索网页以指示该页面没有要显示的出版物并插入包含出版物的 url,然后我循环以在网址。 这是我使用的代码:

<?php 
set_time_limit(0);  
include 'simple_html_dom.php';
include 'connectdb.php';

$scholarID =  $_GET["gScholarID"];
$kodeDosen = $_GET["kodeDosen"];

$page = 1;
$finalPage = false;

$sqlTest = 'INSERT INTO tbl_publikasi(kodeDosen,jenis,namaJurnal,judul,status,tipePublikasi,periode,tahun,citedCount) VALUES ';
$response = array();


while (!$finalPage) {
    $offset = ($page - 1)* 100;
    $cStart = 0+$offset;
    $profile = 'https://scholar.google.com/citations?user='.$scholarID.'&hl=en&cstart='.$cStart.'&view_op=list_works&pagesize=100';
    $html = file_get_html($profile);
    if(is_object($html)){
        $empty = $html->find('td.gsc_a_e',0);
        if($empty){
            $finalPage = true;
            unset($html);
        }
        else{
            $urlArray[] = $profile;
            $page++;
        }
    }
    else{
        $response['success'] = 0;
        $response['message'] = "URL tidak valid ";
    }

}

if($finalPage){
    foreach ($urlArray as $urlPublikasi) {
        $html = file_get_html($urlPublikasi);
        $table = $html->find('#gsc_a_t',0);
        $rowData = array();
        if($table){
            foreach($table->find('tr.gsc_a_tr') as $row){
                $paper['kodeDosen'] = $kodeDosen;
                $paperjudul  = $row->find('td.gsc_a_t a', 0)->plaintext;
                $paper['judul'] = $paperjudul;
                $cited   = $row->find('td.gsc_a_c', 0)->plaintext;
                if($cited === ''){
                    $cited = 0;
                }
                $cited = preg_replace('/[\*]+/', '', $cited);
                $paper['citedBy'] = trim($cited);
                $paper['jenis'] = 'Scholar';
                $paper['namaJurnal']    = $row->find('td.gsc_a_t .gs_gray', 1)->plaintext;
                if($paper['namaJurnal'] === ''){
                    $paper['namaJurnal'] = 'n/a';
                }
                $paper['periode'] = 'n/a';
                $paper['tahun']   = $row->find('td.gsc_a_y', 0)->plaintext;
                if($paper['tahun'] === ' '){
                    $paper['tahun'] = '0000';
                }
                $paper['tipePublikasi'] = 'Scholar'; 
                $paper['status'] = 'Published';
                $rowData[] = $paper;
            }

            foreach ($rowData as $paperValue) {
                $judul = $paperValue['judul'];
                $jenis = $paperValue['jenis'];
                $citedCount = $paperValue['citedBy'];
                $namaJurnal = $paperValue['namaJurnal'];
                $periode = $paperValue['periode'];
                $tahun = $paperValue['tahun'];
                $status = $paperValue['status'];
                $tipePublikasi = $paperValue['tipePublikasi'];
                $sqlTest .= "('".$kodeDosen."','".$jenis."','".$namaJurnal."','".$judul."','".$status."','".$tipePublikasi."','".$periode."','".trim($tahun)."','".$citedCount."'),";

            }
            $query = rtrim($sqlTest, ',');
            $query .= "ON DUPLICATE KEY UPDATE idPublikasi=LAST_INSERT_ID(idPublikasi), kodeDosen = VALUES(kodeDosen), jenis = VALUES(jenis), 
            namaJurnal=VALUES(namaJurnal),status=VALUES(status),
            tipePublikasi = VALUES(tipePublikasi),periode=VALUES(periode),tahun = VALUES(tahun),citedCount = VALUES(citedCount)";
        }
        else{
            $response['success'] = 0;
            $response['message'] = "Tabel Publikasi tidak ditemukan ";
        }

    }



     if (mysqli_query($conn, $query)) {
        $response['success'] = 1;
        $response['message'] = "Array Uploaded Successfully";
     } 
     else {
        $response['success'] = 0;
        $response['message'] = "Array Upload Failed, Alasan : ".mysqli_error($conn);
     }


}
else{
    $response['success'] = 0;
    $response['message'] = "Gagal ditemukan ";
}

echo json_encode($response);




?>

【讨论】:

    猜你喜欢
    • 2023-03-23
    • 1970-01-01
    • 2013-05-20
    • 1970-01-01
    • 2021-04-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多