推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Ewig
V2EX  ›  Python

mongodb 在 scrapy 如何去重,然后下载管道如何管理

  •  
  •   Ewig · Oct 9, 2018 · 3468 views
    This topic created in 2783 days ago, the information mentioned may be changed or developed.
    from scrapy.pipelines.files import FilesPipeline

    from scrapy import Request

    from scrapy.conf import settings

    import pymongo


    class XiaoMiQuanPipeLines(object):
    def __init__(self):
    host = settings["MONGODB_HOST"]
    port = settings["MONGODB_PORT"]
    dbname = settings["MONGODB_DBNAME"]
    sheetname = settings["MONGODB_SHEETNAME"]

    client = pymongo.MongoClient(host=host, port=port)

    mydb = client[dbname]

    self.post = mydb[sheetname]

    def process_item(self, item):
    url = item['file_url']
    name = item['name']

    result = self.post.aggregate(
    [
    {"$group": {"_id": {"url": url, "name": name}}}
    ]
    )
    if result:
    pass
    else:

    self.post.insert({"url": url, "name": name})
    return item


    class DownLoadPipelines(FilesPipeline):

    def file_path(self, request, response=None, info=None):
    return request.meta.get('filename', '')

    def get_media_requests(self, item, info):
    file_url = item['file_url']
    meta = {'filename': item['name']}
    yield Request(url=file_url, meta=meta)


    这里写两个管道,先判断,如何重复不下载,如果不重复,写入数据库,然后下载,这里用 aggregate 联合键去重
    9 replies    2019-01-24 18:33:58 +08:00
    watsy0007
        1
    watsy0007  
       Oct 9, 2018
    ```python

    class MongoCache:
    db = None

    def __init__(self):
    if not hasattr(MongoCache, 'pool'):
    MongoCache.create_instance()

    @staticmethod
    def create_instance():
    client = MongoClient(config.MONGO_URL)
    MongoCache.db = client['spider']

    def create(self, table, unique_key, origin_data):
    if self.exists(table, unique_key):
    return None

    summaries = {k: generator_summary(v) for (k, v) in origin_data.items()}

    return self.db[table].insert({
    'unique_key': unique_key,
    'data': origin_data,
    'summaries': summaries
    })

    def get(self, table, unique_key):
    data = self.db[table].find_one({'unique_key': unique_key})
    if data is None:
    return None
    return data['data']

    def exists(self, table, unique_key):
    data = self.db[table].find_one({'unique_key': unique_key})
    return data is not None

    def is_changed(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return True

    last_summaries = self.db[table].find_one({'unique_key': unique_key})['summaries']
    for (k, v) in origin_data.items():
    summary = generator_summary(v)
    last_summary = last_summaries.get(k, None)
    # print('{} -> {} | {} -> {}'.format(k, v, summary, last_summary))
    if last_summary is None or last_summary != summary:
    return True
    return False

    def change_fields(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return origin_data
    changes = {}
    last_summaries = self.db[table].find_one({'unique_key': unique_key})['summaries']
    for (k, v) in origin_data.items():
    last_summary = last_summaries.get(k, None)
    # print('{} -> {} | {} -> {}'.format(k, v, summary, last_summary))
    if last_summary is None or last_summary != generator_summary(v):
    changes[k] = v
    return changes

    def update(self, table, unique_key, origin_data):
    if not self.exists(table, unique_key):
    return origin_data
    new_summaries = {k: generator_summary(v) for (k, v) in origin_data.items()}
    self.db[table].update_one({'unique_key': unique_key},
    {'$set': {'data': origin_data, 'summaries': new_summaries}})
    return origin_data
    ```
    watsy0007
        2
    watsy0007  
       Oct 9, 2018
    v2ex 不支持 markdown...

    https://gist.github.com/watsy0007/779c27fb0ceab283cc434b5eec10b7c4

    封装了针对数据处理的公共方法.
    picone
        3
    picone  
       Oct 9, 2018
    我是直接 mongo 加 unique 索引,并捕捉索引冲突异常。。
    Ewig
        4
    Ewig  
    OP
       Oct 12, 2018
    @picone 你的是联合键吗?我说的是 url 和 name 一起
    picone
        5
    picone  
       Oct 12, 2018
    Ewig
        6
    Ewig  
    OP
       Oct 12, 2018
    @picone db.XiaoMiQuan.find()
    { "_id" : ObjectId("5bbf14dbc96b5b3f5627d11d"), "file_url" : "https://baogaocos.seedsufe.com/2018/07/19/doc_1532004923556.pdf", "name" : "AMCHAM-中国的“一带一路”:对美国企业的影响(英文)-2018.6-8 页.pdf" }我现在是这样写的
    这是对的?
    pyfrog
        7
    pyfrog  
       Jan 24, 2019
    @Ewig 用不用把他全站 pdf 发你
    Ewig
        8
    Ewig  
    OP
       Jan 24, 2019
    @pyfrog 人家网站是更新的吧
    pyfrog
        9
    pyfrog  
       Jan 24, 2019
    @Ewig 是啊,直接给你服务器
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   1587 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 35ms · UTC 00:00 · PVG 08:00 · LAX 17:00 · JFK 20:00
    ♥ Do have faith in what you're doing.