Issue16099
Created on 2012-10-01 12:58 by XapaJIaMnu, last changed 2015-10-08 09:34 by berker.peksag. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| robotparser.patch | XapaJIaMnu, 2012-10-01 12:58 | patch for robotparser.py | ||
| robotparser.patch | XapaJIaMnu, 2012-10-01 13:37 | same patch for python3X | ||
| robotparser.patch | XapaJIaMnu, 2012-10-07 18:20 | Changes + test cases + documentation | review | |
| robotparser_reformatted.patch | XapaJIaMnu, 2012-10-07 19:56 | Changes, test cases, documentation, reformatted | review | |
| robotparser_v2.patch | XapaJIaMnu, 2013-12-10 00:22 | V2 with fixes | review | |
| robotparser_v3.patch | XapaJIaMnu, 2014-05-27 09:29 | V3 crawl delay and request rate patch | review | |
| Messages (17) | |||
|---|---|---|---|
| msg171711 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2012-10-01 12:58 | |
Robotparser doesn't support two quite important optional parameters from the robots.txt file. I have implemented those in the following way: (Robotparser should be initialized in the usual way: rp = robotparser.RobotFileParser() rp.set_url(..) rp.read ) crawl_delay(useragent) - Returns time in seconds that you need to wait for crawling if none is specified, or doesn't apply to this user agent, returns -1 request_rate(useragent) - Returns a list in the form [request,seconds]. if none is specified, or doesn't apply to this user agent, returns -1 |
|||
| msg171712 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2012-10-01 13:16 | |
Thanks for the patch. New features must be implemented in Python 3.4. Python 2.7 is in feature freeze mode and therefore doesn't get new features. |
|||
| msg171715 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2012-10-01 13:37 | |
Okay, sorry didn't know that (: Here's the same patch (Same functionality) for python3 Feedback is welcome, as always (: |
|||
| msg171719 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2012-10-01 13:52 | |
We have a team that mentors new contributors. If you are interested to get your patch into Python 3.4, please read http://pythonmentors.com/ . The people are really friendly and will help you with every step of the process. |
|||
| msg172327 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2012-10-07 18:20 | |
Okay, here's a proper patch with documentation entry and test cases. Please review and comment |
|||
| msg172338 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2012-10-07 19:56 | |
Reformatted patch |
|||
| msg205567 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2013-12-08 14:41 | |
Hey, it has been more than an year since the last activity. Is there anything else I should do in order for someone of the python devs team to review my changes and perhaps give some feedback? Nick |
|||
| msg205641 - (view) | Author: Berker Peksag (berker.peksag) * | Date: 2013-12-09 02:31 | |
I left a few comments on Rietveld. |
|||
| msg205755 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2013-12-10 00:22 | |
Thank you for the review! I have addressed your comments and release a v2 of the patch: Highlights: No longer crashes when provided with malformed crawl-delay/robots.txt parameter. Returns None when parameter is missing or syntax is invalid. Simplified several functions. Extended tests. http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst File Doc/library/urllib.robotparser.rst (right): http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.... Doc/library/urllib.robotparser.rst:56: .. method:: crawl_delay(useragent) On 2013/12/09 03:30:54, berkerpeksag wrote: > Is crawl_delay used for search engines? Google recommends you to set crawl speed > via Google Webmaster Tools instead. > > See https://support.google.com/webmasters/answer/48620?hl=en. Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents. http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py File Lib/urllib/robotparser.py (right): http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco... Lib/urllib/robotparser.py:168: for entry in self.entries: On 2013/12/09 03:30:54, berkerpeksag wrote: > Is there a better way to calculate this? (perhaps O(1)?) I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the: for entry in self.entries: if entry.applies_to(useragent): logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer. Thanks Nick |
|||
| msg205761 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2013-12-10 00:41 | |
Oh... Sorry for the spam, could you please verify my documentation link syntax. I'm not entirely sure I got it right. |
|||
| msg208721 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2014-01-21 23:30 | |
Hey, Just a reminder friendly reminder that there hasn't been any activity for a month and I have released a v2, pending for review (: |
|||
| msg219212 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2014-05-27 09:29 | |
Updated patch, all comments addressed, sorry for the 6 months delay. Please review |
|||
| msg223099 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2014-07-15 10:38 | |
Hey, Just a friendly reminder that there has been no activity for a month and a half and v3 is pending for review (: |
|||
| msg225916 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2014-08-26 13:15 | |
Hey, Just a friendly reminder that the patch is pending for review and there has been no activity for 3 months (: |
|||
| msg252483 - (view) | Author: Nikolay Bogoychev (XapaJIaMnu) | Date: 2015-10-07 20:01 | |
Hey, Friendly reminder that there has been no activity on this issue for more than an year. Cheers, Nick |
|||
| msg252521 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-10-08 09:27 | |
New changeset dbed7cacfb7e by Berker Peksag in branch 'default': Issue #16099: RobotFileParser now supports Crawl-delay and Request-rate https://hg.python.org/cpython/rev/dbed7cacfb7e |
|||
| msg252525 - (view) | Author: Berker Peksag (berker.peksag) * | Date: 2015-10-08 09:34 | |
I've finally committed your patch to default. Thank you for not giving up, Nikolay :) Note that currently the link in the example section doesn't work. I will open a new issue for that. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2015-10-08 09:34:21 | berker.peksag | set | status: open -> closed versions: + Python 3.6, - Python 3.5 messages: + msg252525 resolution: fixed |
| 2015-10-08 09:27:17 | python-dev | set | nosy:
+ python-dev messages: + msg252521 |
| 2015-10-07 20:01:21 | XapaJIaMnu | set | messages: + msg252483 |
| 2014-08-26 13:15:36 | XapaJIaMnu | set | messages: + msg225916 |
| 2014-07-15 10:38:02 | XapaJIaMnu | set | messages: + msg223099 |
| 2014-07-07 10:35:56 | berker.peksag | set | assignee: berker.peksag |
| 2014-05-27 09:29:06 | XapaJIaMnu | set | files:
+ robotparser_v3.patch messages: + msg219212 |
| 2014-05-13 04:21:46 | rhettinger | set | assignee: rhettinger -> (no value) |
| 2014-05-12 15:01:15 | rhettinger | set | assignee: rhettinger nosy: + rhettinger |
| 2014-01-21 23:30:03 | XapaJIaMnu | set | messages: + msg208721 |
| 2013-12-10 00:41:57 | XapaJIaMnu | set | messages: + msg205761 |
| 2013-12-10 00:22:51 | XapaJIaMnu | set | files:
+ robotparser_v2.patch messages: + msg205755 |
| 2013-12-09 02:31:39 | berker.peksag | set | nosy:
+ berker.peksag messages:
+ msg205641 |
| 2013-12-08 14:41:56 | XapaJIaMnu | set | messages: + msg205567 |
| 2012-11-02 07:34:19 | hynek | set | nosy:
+ orsenthil |
| 2012-10-08 06:43:39 | hynek | set | nosy:
+ hynek |
| 2012-10-07 20:03:39 | christian.heimes | set | keywords:
+ needs review stage: test needed -> patch review |
| 2012-10-07 19:56:11 | XapaJIaMnu | set | files:
+ robotparser_reformatted.patch messages: + msg172338 |
| 2012-10-07 18:20:53 | XapaJIaMnu | set | files:
+ robotparser.patch messages: + msg172327 |
| 2012-10-01 13:52:20 | christian.heimes | set | keywords:
+ easy, - gsoc messages: + msg171719 |
| 2012-10-01 13:37:51 | XapaJIaMnu | set | files:
+ robotparser.patch messages: + msg171715 |
| 2012-10-01 13:16:01 | christian.heimes | set | versions:
+ Python 3.4, - Python 2.7 nosy: + christian.heimes messages: + msg171712 keywords:
+ gsoc |
| 2012-10-01 12:58:25 | XapaJIaMnu | create | |