Archive for the ‘Programming’ Category
Twitter comes clean
Twitter developer manager Alex Payne has updated the Twitter FAQ with the actual, real, honest story on the return of Track to its users. First, the relevant text:
When will the firehose be ready?
By late January, early February 2009. For at least Q1 2009, the “firehose” (the near-realtime stream of all public status updates on Twitter) will only be available to a small group of trusted partners. The firehose is a stream HTTP solution; a client connects to it and the stream begins, ceasing only when the client disconnects. Once we’re confident in the stability of the service, we’ll add partners on a case-by-case basis. We may allow a wider selection of clients to consume subsets of the public stream (that is, updates from a collection of user IDs or matching specific search terms). We do not intend to allow anonymous, unregulated public access to this stream for any number of legal, financial, and technical reasons.
Now, the translation:
Real soon now, especially now that FriendFeed has a quarter of our page views with a stunningly familar hockeystick of growth, we will release the firehose to trusted partners. Trusted means those vendors who will agree not to allow access to… see below. The firehose is the full stream of our data that has been blocked from its contributors since May, 2008. Once we’re sure it is stable, we’ll continue to make it available while adding what must be semi-trusted cases. It’s also possible we’ll deliver a subset of the firehose (analogous to somewhat pregnant) defined as Track on identities and keywords. The keyword here is “may”. Finally, we won’t allow anonymous unregulated access, period. That is, even though we have numerous partners and untrusted startups currently recording Twitter notices and storing them for unregulated anonymous access since Twitter began.
FriendFeed co-founder Bret Taylor appeared on NewsGang Live Friday, and told me relationships with Twitter continue to be good. The two companies are working through some problems with the rate limiting curbs introduced by Twitter several weeks ago, but Taylor anticipates a resolution shortly. Several third party Track projects, most notably including Dustin Sallings’ TwitterSpy, have been disabled due to the 20,000 API call limit imposed. Sallings is blunt in this FriendFeed thread:
They’re going to offer a friendfeed-style HTTP firehose to a limited group. My suspicion is that that group will be limited more by how threatening a business is than even by how much twitter’s traffic may be reduced by such a partnership. I might be wrong, but the only ideas they seem to have for making money from their business involve removing value their customers want.
Meanwhile, Taylor says FriendFeed is moving forward with enhanced realtime tools to help model Twitter and other data. Rooms will gain new controls for aggregating multiple streams, a major search-related announcement is coming later this week, broader filtering and track functionality awaits a several-month rewrite of some parts of the core architecture, and most importantly, FriendFeed will continue to employ an open, inward and outward-facing data strategy. This is in sharp contrast to both Twitter and Facebook, who allow ingress but limit outbound flow.
There are several efforts underway to work around or via the back channel with Twitter to reengage track services. Services such as Twhirl that have released betas with “track” support may fall into both categories, but eventually Twitter will find a happy medium where monetization will begin to flow. In the meantime, FriendFeed continues to offer a more conversational and flexible model, making it a significant competitor for user contributions. Even now, it’s trivial to maintain a Twitter presence via FriendFeed that would require a fundamental change in developer relations to undermine.
Now that Twitter has achieved a certain stability and clarity in its rate-limiting strategy, the next phase will focus on identifying and rationalizing its trusted partners. The fundamental value proposition of track – the filtering of micromessages based on a combination of identity and conversational context – can now be achieved in FriendFeed with greater fidelity and, soon, realtime alert mechanisms that allow more personalized and affinity-powered flow regulation. The result: time-efficient information at the center of the user experience.
Over time, Twitter’s huge audience size and mainstream media acceptance will become less significant, forcing Twitter to name its price for its unique value even as it is watered down by more flexible tools and micro-community adoption of its competitors. Regardless of the anger in the community, which clearly has been discounted as a small minority in Twitter’s game plan, the clarity of Twitter’s rate limiting and brute force approach in managing its developer community now stand in sharp contrast to FriendFeed’s approach.
Google, Yahoo spiders can now crawl through Flash sites
As anyone who has had the pleasure of doing web design and development through marketing agencies knows, Flash tends to be wildly popular among clients and wildly unpopular among, well, pretty much everyone else. Part of the reason for this is because Flash is so inherently un-Googleable; anything that goes into a Flash-only site is basically invisible to search engines and therefore, the world. That will no longer be the case, however, as Adobe announced today that it has teamed up with Google and Yahoo to make Flash files indexable by search engines.
This announcement has been a long time coming, as Flash developers have been wishing for ways to make their content searchable for close to a decade. Adobe acknowledges this in its announcement, saying that although search engines are able to index static text and links within Flash SWF files, “[Rich Internet Applications] and dynamic Web content have been generally difficult to fully expose to search engines because of their changing states—a problem also inherent in other RIA technologies.”
This announcement may also result in some major usability changes (for the better) for Flash on the web. In a post to its Webmaster Central Blog, Google wrote that it can now index all kinds of textual content in SWF files, like that included in Flash gadgets, buttons, menus, entirely self-contained Flash web sites, “and everything in between.” Google can now also follow URLs embedded within Flash files to add to the crawling pipeline. This new indexing technology does not, however, include FLV files (video files that are found on sites like YouTube) because those are generated as videos and don’t contain any text elements like an SWF file does.
Google says it’s able to do this by developing an algorithm that “explores Flash files in the same way that a person would,” by clicking buttons and manually going through Flash content. “Our algorithm remembers all of the text that it encounters along the way, and that content is then available to be indexed,” wrote the company. “We can’t tell you all of the proprietary details, but we can tell you that the algorithm’s effectiveness was improved by utilizing Adobe’s new Searchable SWF library.”
Of course, Google (and eventually Yahoo) won’t be able to index everything embedded within a Flash file—at least not yet. Anything that is image-related, including text that is embedded into images, will be invisible to the search engines for the time being. Google also noted that it can’t execute certain JavaScripts that may be embedded into a Flash file, and that while it indexes content that is contained in a separate HTML or XML file, it won’t be counted as part of the content in the Flash file. These are all issues that are being worked on, however, and are likely to change in the future.
Yahoo is also working with Adobe to index SWF files, but doesn’t appear to be as far along as Google just yet. One player that is noticeably missing is Microsoft, though. From Adobe’s announcement and the language used by Google, it appears as if each search engine has to work with Adobe to make this possible—meaning that Microsoft has either been excluded by Adobe for this round or has decided to voluntarily sit this one out. Either way, with searchable SWF files down, usability experts can now focus all of their attention on other Flash-related concerns, like blatant design perversion and excessive animation abuse.
Firefox 3 and Safari 4 in browser speed race
Most of today’s web sites and web applications are built using the JavaScript scripting language. Some may say that a trend towards the fine-tuning of JavaScript interpreters in modern browsers was just a matter of time since any such optimization translates into performance gains. Mozilla recently launched the browser speed race with Firefox 3, which delivers more speed than any other previous Firefox version. Apple answered with Safari 4, claiming the browser’s JavaScript engine has been accelerated by 53%. Welcome to the browser speed race.
Safari 4 has just been seeded to the developers at Apple’s developer conference. The manufacturer claims that the software has a 53% faster JavaScript engine than the preceding and current version 3.1 (based on the SunSpider JavaScript Performance test conducted on iMac with an Intel Core 2 Duo processor at 2.8 GHz, with 2 GB of RAM and running under Mac OS X Snow Leopard.) Although Firefox 3 RC3 was the first to deliver significant JavaScript performance improvement, Apple apparently is exceeding those gains with Safari 4.
Apple uses a new and improved JavaScript interpreter code-named SquirrelFish, which is provided on an open-source basis from the WebKit project, the same organization that makes the open-source engine used by Safari to render web pages. According to the WebKit project, the SquirrelFish engine is 1.6 times faster than the JavaScript engine in Safari 3.1.
SquirrelFish does its magic by turning JavaScript script into so-called bytecodes, an optimized code much more suitable for run-time execution than natural language-based JavaScript commands, which are longer and more complicated to interpret – and therefore are slower.
Why JavaScript performance matters
Most today’s web applications and web 2.0 sites rely on the JavaScript scripting language originally created by current Mozilla CTO Brendan Eich while he was employed by Netscape. JavaScript acts as glue that connects a user interface rendered in a web browser with a database and programming logic running in a web server. The browser’s JavaScript engine is solely responsible for interpreting and executing JavaScript commands embedded in HTML code. As a result, a browser’s JavaScript engine performance is directly related to the performance and responsiveness of a web application, contributing to an improved user experience.
The fact that many applications grow in size and become more bloated with each release means that a browser that can run web applications faster and make user interfaces more responsive on any computer is actually a big deal. You don’t have to have any specific market forecasting talent to predict that this trend may be impacting browser market shares: Speed can directly translate into more usability for most of us. Clearly, JavaScript handling is on its way to become a powerful weapon in the browser market.
SpiderMonkey, SquirrelFish, Tamarin and more
Mozilla was the first to introduce significant speed gains with Firefox 3 beta 5 (the final version is expected to ship by mid-June). Firefox has its Gecko engine to render web pages, which is generally considered to be slightly slower than Safari’s WebKit – which is largely responsible for the “fastest browser in the world” status Safari enjoys. Firefox’ JavaScript implementation is based on Mozilla’s own and decade old SpiderMonkey technology, which many considered to be the fastest JavaScript interpreter until SquirrelFish came out.
Although in beta, Firefox 3 scored with many reviewers who are praising the browser’s performance improvements, with WSJ’s Walt Mossberg declaring the browser a “winner.” But now that the SquirrelFish/Safari combination appears to be offsetting the speed gains in Firefox 3 and may set a new benchmark, we can expect more direct competition between Mozilla and Apple. Mozilla has plans to expand SpiderMonkey with Adobe’s JavaScript engine called Tamarin, included in Flash 9, which has a so-called “tracing” feature designed to enable faster code execution. However, the SunSpider JavaScript benchmark claims that SquirrelFish is at least 1.9 times faster than Tamarin.
Mozilla plans to wedge Tamarin into Firefox and match the API’s of both technologies “There are areas in which SpiderMonkey is faster than Tamarin and areas where it’s not. We’re looking to build hybrids that are best-of-breed for both worlds and we’re going to pull those into the Firefox release when ready,” Mozilla co-founder Mike Shaver recently said.
Can IE8 compete?
The big variable in this game is Microsoft’s Internet Explorer 8, currently in beta 1 phase. IE8 is expected to deliver speed gains in JavaScript performance as well. However, Microsoft is facing a tough task. The fact that the software giant is often criticized for delivering bloated and inefficient software certainly doesn’t help. In our tests, the first beta of IE8 shows no noticeable speed gains in running web applications.
Quite the opposite is the case, actually. Websites and web applications run noticeably slower than in IE7. The whole browsing experience generally appears to be less responsive. Of course, IE8 is in an early development stage and you can bet Microsoft is going to tweak its performance. The only problem is that the software giant will have to work to raise the stakes in the browser race. If IE8 under-delivers, the market could respond with further market share erosion for IE. It is evident now that JavaScript engine performance has become a key metric in the newest race for the title of fastest browser.
The battle ahead is nicely summed by Mozilla co-founder Mike Shaver who said the following: “They [Apple] have dropped SquirrelFish in now and got a big speed up there. We’ve got more coming on our side. You’ll see this leapfrog pattern over and over. We’re not going to let anybody slack on that and the other browser vendors need to keep up, too.”
According to Net Applications, Firefox 3 captured almost one fifth (18.41%) of the browser market in May, followed by Safari 3.1 which hit 6.25%. Microsoft’s Internet Explorer continues on its pace of a slow but steady decline, ending up at 73.75% in May. Microsoft has scheduled second beta of IE8 for an August release, with a generally expected final release in the fourth quarter of this year.
Twitter At scale: Will it work?
Only two days ago the contact messaging application Twitter suffered another bout of downtime, leaving some users frustrated and others asking why the platform continues to suffer problems.
Techcrunch recently spoke to an individual who is familiar with the technical problems at Twitter as well as the challenges that lay ahead for the startup. He re-iterated his belief that the problems lay not with Blaine Cook (the former head of engineering who was shown the door), nor with NTT (their host) but with the early lack of understanding of how complex their problems would be.
The issue is that group messaging is very difficult to achieve at a grand scale. Other large sites such as WordPress and Digg are mostly dealing with known problems, such as how to serve a large number of pages or a large number of images. Twitter is unique in that it needs to parse a large number of messages and deliver them to multiple recipients, with each user having unique connections to other users.
Social networks have similar complexity issues, but they only usually need to route a message to a single user (or at the most to a defined group). Even so, social networks like Friendster struggled for years with technical and scaling issues. Twitter is specifically dealing with text messages, and in most cases with active users those messages are very frequent and go out to hundreds of contacts (or followers, as they are referred to in Twitter). Every new Twitter user and every new connection results in an exponentially greater computational requirement.
Some of the best web applications are able to efficiently solve very complex problems to produce simple results for users (Eg. Google). The success of these applications is due to the innovative efforts by developers to solve large technical challenges, where they have often had to break new ground for solutions. For Twitter to reach a similar point of reliability they too will need a very comprehensive, ground-breaking solution.
The source that I spoke to also commented on how ill-prepared the Twitter team were and are for their current and future challenges. The small team contains a handful of engineers, with only a person or two committed to infrastructure and architecture. He goes on to point out that at Digg the team for network and systems alone is bigger than the total engineering team at Twitter, and that at Digg they are lead by well-known “A-list rockstars”.
The problems at Twitter are often attributed to their use of RubyOnRails, a web development framework. Twitter is almost certainly the largest site running on Rails, so fans of the framework and its developers have been quick to deflect the criticism and point it back at the engineers at Twitter. Utilizting a framework that has never conquered large-scale territory must certainly add to the risk and work required to find a solution. As an out-of-the box framework, Rails certainly doesn’t lend itself to large-scale application development.
Rails enabled Twitter to be developed quickly, to get to launch quickly and then to improve with new features relatively rapidly also. But the old adage of “Good, Fast, Cheap – pick two” certainly applies and Rails would do itself no harm by conceding that it isn’t a platform that can compete with Java or C when it comes to intensive tasks. Twitter is at a cross-roads as an application and Rails has served its purpose very well to date, but you are unlikely to see a computational cluster built with Ruby at Apache any time soon.
What we see at Twitter today is a very useful and popular service, but one with very complex underlying technical challenges to overcome. Twitter will require not only a new architecture approach and a big injection of the best minds they can find ($15 million can help), but will also need a little patience from users and those of us observing.
Googlebot crawls through HTML forms
Google will stop at nothing in its quest to index the world’s information. Last year it ate through 100 exabytes of data, but there’s still a lot that it can’t get access to. Known as the deep web (or hidden web, or invisibe web, etc.), it is estimated that the majority of online data is hidden safely from Google’s prying eyes — private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.
“For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made,” explained Jayant Madhavan and Alon Halevy in a blog post. “If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”
Google, which says that the crawling of dynamic form results doesn’t affect the “crawling, ranking, or selection of other web pages in any significant way,” also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won’t be crawled.
It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never — and should never — get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.
It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That’s mildly disappointing as we were looking forward to befriending Googlebot on MySpace…
Leave a Comment
Leave a Comment
Leave a Comment




