![]() ![]() In theory the Internet Archive has a lot of this data. Ability to import new datasets (whether web-like or not)įor the web index to be useful for some applications, it would need to have a PageRank-like attribute per page to expose importance (and/or a number of visits/traffic).Ability to run arbitrary MapReduces over the data (see ZeroVM for a way to do this safely).Ability to do quick (semi-interactive) analyses over the crawl data that has been pre-processed (think regular expressions, or Dremel-like queries for attributes).This would include not just raw HTML, but the full set of resources necessary to recreate that page (think crawling with headless WebKit and capturing all HTTP requests). Large (billion and up) archive of sites on the internet."Aleph" would be an infrastructure company that offers a few services: If Foursquare wanted to know how many pages have a link to a URL, would they want to crawl the whole web? Presumably this is because crawling is hard. W3Techs also only looks at the Alexa top 1 million. Only Google has been able to release interesting n-gram data ( from books too).Īlong the same lines, the HTTP Archive tracks 55K URLs (with a goal of 1M by the end of 2012). ![]() Replicating those services outside of Google is hard they presuppose a scalable, well-oiled crawling/indexing/serving machine. Google has shut down Code Search and the Social Graph API. In addition to the read tag, a "filtered" state tag should also be applied to the items, so that these items can be differentiated during subsequent API calls. It does mean that there is a window during which items that should be filtered our are visible, but the filtering is likely to be cheap enough (get items since the last check, apply a read tag) that it can be done quite frequently. The filtering could either be done periodically, or for feeds that are PubSubHubbub-enabled, it could be triggered by PSHB pings. filtered feed URLs generally having only one subscriber and thus being in the slow crawl bucket) and having item metadata be preserved (though that matters less now that likes and shares are gone). This would have the advantage of not having to re-subscribe when deciding to filter something, benefiting from the regular crawl speed (vs. Instead, it would work by connecting to your Google Reader account and then marking matching items as read, so that they don't show up (if using new items only view). Unlike Feed Rinse or Yahoo! Pipes, it wouldn't involve taking a feed URL, giving it to the tool, and getting a filtered URL to subscribe to in return. export GYP_DEFINES="component=shared_library disable_nacl=1"īuild a Stream Spigot tool for filtering of feeds.add ~/Developer/source/chromium to the list of paths that Spotlight excludes.Run svn ls -username svn:///chrome/trunk/src and svn ls -username svn:///blink/trunk to verify said credentials.Visit Chromium Access and get credentials.Some things in the faster Linux builds page apply to the Mac too (disable NaCl, use shared libraries). ![]()
0 Comments
Leave a Reply. |