关于Node.js编写爬虫获取特殊的URL的问题。

第一次使用node.js编写爬虫，希望爬虫能够爬取一个页面上的所有链接。

看了论坛的一些文章，尝试了以下方法：一、使用request和cheerio模块，解析dom树来获取URL。二、使用正则匹配来获取url。

但是这些方法遇到一些特殊的情况，比如ajax或者javascript代码时产生的url就没有办法了。比如有一个链接，需要用户点击一个按钮，才能生成链接等各种情况。请各位帮帮忙，看看有没有这些方面的模块或者方法。

另：附上爬虫实际测试的测试地址：http://demo.aisec.cn/demo/aisec/。爬虫希望能够爬取到上面的所有链接。请各位不吝赐教！

nnabuuu 1楼•22天前

这种用phantom.js就好了嘛。。。我记得phantom.js已经可以集成到node.js里面了

asfman 2楼•22天前

phantom没有集成到node.js吧，还是要单独装phantom.js的吧，装好后，npm install spooky,可以去github看看spooky怎么使用 try { var Spooky = require(‘spooky’); var spooky = new Spooky({ child: { transport: ‘http’ }, casper: { pageSettings: { loadImages: false, loadPlugins: false }, verbose: false } }, function (err) { if (err) { e = new Error(‘Failed to initialize SpookyJS’); e.details = err; throw e; }

                spooky.start(fetchUrl);
                spooky.on('html', function (doc) {

                    //console.log(doc.url);//最终抓取的url
                    var cheerio = require('cheerio');
                    var $ = cheerio.load(doc.html);
                    var product = {};
                    //todo
                    res.json(product);
                });
                spooky.then(function () {

                    this.emit('html', this.evaluate(function () {

                        return {
                            url: location.href,
                            html: document.querySelector('html').outerHTML
                        };
                    }));
                });
                spooky.run();
        });

        spooky.on('error', function (e, stack) {
            res.status(500).json({error: (stack?JSON.stringify(stack):"spooky error")});
        });

nodevc 3楼•22天前

@asfman 我安装好了三个库，phantomjs，casperjs，spooky，然后运行spooky里的example目录下的hello.js。程序报错了。我是在windows下运行的。安装三个库都是用的npm。报错如下：

events.js:72 throw er; // Unhandled ‘error’ event ^ Error: spawn ENOENT at errnoException (child_process.js:1011:11) at Process.ChildProcess._handle.onexit (child_process.js:802:34)

不知道是什么原因？

nnabuuu 4楼•22天前

@asfman 有一些非官方的解决方法，看起来是会损失一些性能，不过用用应该没事。

见 https://github.com/sgentle/phantomjs-node