Archive for the 'Programming' Category

Google, Yahoo spiders can now crawl through Flash sites

As anyone who has had the pleasure of doing web design and development through marketing agencies knows, Flash tends to be wildly popular among clients and wildly unpopular among, well, pretty much everyone else. Part of the reason for this is because Flash is so inherently un-Googleable; anything that goes into a Flash-only site is basically invisible to search engines and therefore, the world. That will no longer be the case, however, as Adobe announced today that it has teamed up with Google and Yahoo to make Flash files indexable by search engines.

This announcement has been a long time coming, as Flash developers have been wishing for ways to make their content searchable for close to a decade. Adobe acknowledges this in its announcement, saying that although search engines are able to index static text and links within Flash SWF files, “[Rich Internet Applications] and dynamic Web content have been generally difficult to fully expose to search engines because of their changing states—a problem also inherent in other RIA technologies.”

This announcement may also result in some major usability changes (for the better) for Flash on the web. In a post to its Webmaster Central Blog, Google wrote that it can now index all kinds of textual content in SWF files, like that included in Flash gadgets, buttons, menus, entirely self-contained Flash web sites, “and everything in between.” Google can now also follow URLs embedded within Flash files to add to the crawling pipeline. This new indexing technology does not, however, include FLV files (video files that are found on sites like YouTube) because those are generated as videos and don’t contain any text elements like an SWF file does.

Google says it’s able to do this by developing an algorithm that “explores Flash files in the same way that a person would,” by clicking buttons and manually going through Flash content. “Our algorithm remembers all of the text that it encounters along the way, and that content is then available to be indexed,” wrote the company. “We can’t tell you all of the proprietary details, but we can tell you that the algorithm’s effectiveness was improved by utilizing Adobe’s new Searchable SWF library.”

Of course, Google (and eventually Yahoo) won’t be able to index everything embedded within a Flash file—at least not yet. Anything that is image-related, including text that is embedded into images, will be invisible to the search engines for the time being. Google also noted that it can’t execute certain JavaScripts that may be embedded into a Flash file, and that while it indexes content that is contained in a separate HTML or XML file, it won’t be counted as part of the content in the Flash file. These are all issues that are being worked on, however, and are likely to change in the future.

Yahoo is also working with Adobe to index SWF files, but doesn’t appear to be as far along as Google just yet. One player that is noticeably missing is Microsoft, though. From Adobe’s announcement and the language used by Google, it appears as if each search engine has to work with Adobe to make this possible—meaning that Microsoft has either been excluded by Adobe for this round or has decided to voluntarily sit this one out. Either way, with searchable SWF files down, usability experts can now focus all of their attention on other Flash-related concerns, like blatant design perversion and excessive animation abuse.

Read more »

Firefox 3 and Safari 4 in browser speed race

Most of today’s web sites and web applications are built using the JavaScript scripting language. Some may say that a trend towards the fine-tuning of JavaScript interpreters in modern browsers was just a matter of time since any such optimization translates into performance gains. Mozilla recently launched the browser speed race with Firefox 3, which delivers more speed than any other previous Firefox version. Apple answered with Safari 4, claiming the browser’s JavaScript engine has been accelerated by 53%. Welcome to the browser speed race.

Safari 4 has just been seeded to the developers at Apple’s developer conference. The manufacturer claims that the software has a 53% faster JavaScript engine than the preceding and current version 3.1 (based on the SunSpider JavaScript Performance test conducted on iMac with an Intel Core 2 Duo processor at 2.8 GHz, with 2 GB of RAM and running under Mac OS X Snow Leopard.) Although Firefox 3 RC3 was the first to deliver significant JavaScript performance improvement, Apple apparently is exceeding those gains with Safari 4.

Apple uses a new and improved JavaScript interpreter code-named SquirrelFish, which is provided on an open-source basis from the WebKit project, the same organization that makes the open-source engine used by Safari to render web pages. According to the WebKit project, the SquirrelFish engine is 1.6 times faster than the JavaScript engine in Safari 3.1.

SquirrelFish does its magic by turning JavaScript script into so-called bytecodes, an optimized code much more suitable for run-time execution than natural language-based JavaScript commands, which are longer and more complicated to interpret – and therefore are slower.

Why JavaScript performance matters
Most today’s web applications and web 2.0 sites rely on the JavaScript scripting language originally created by current Mozilla CTO Brendan Eich while he was employed by Netscape. JavaScript acts as glue that connects a user interface rendered in a web browser with a database and programming logic running in a web server. The browser’s JavaScript engine is solely responsible for interpreting and executing JavaScript commands embedded in HTML code. As a result, a browser’s JavaScript engine performance is directly related to the performance and responsiveness of a web application, contributing to an improved user experience.

The fact that many applications grow in size and become more bloated with each release means that a browser that can run web applications faster and make user interfaces more responsive on any computer is actually a big deal. You don’t have to have any specific market forecasting talent to predict that this trend may be impacting browser market shares: Speed can directly translate into more usability for most of us. Clearly, JavaScript handling is on its way to become a powerful weapon in the browser market.

SpiderMonkey, SquirrelFish, Tamarin and more
Mozilla was the first to introduce significant speed gains with Firefox 3 beta 5 (the final version is expected to ship by mid-June). Firefox has its Gecko engine to render web pages, which is generally considered to be slightly slower than Safari’s WebKit – which is largely responsible for the “fastest browser in the world” status Safari enjoys. Firefox’ JavaScript implementation is based on Mozilla’s own and decade old SpiderMonkey technology, which many considered to be the fastest JavaScript interpreter until SquirrelFish came out.

Although in beta, Firefox 3 scored with many reviewers who are praising the browser’s performance improvements, with WSJ’s Walt Mossberg declaring the browser a “winner.” But now that the SquirrelFish/Safari combination appears to be offsetting the speed gains in Firefox 3 and may set a new benchmark, we can expect more direct competition between Mozilla and Apple. Mozilla has plans to expand SpiderMonkey with Adobe’s JavaScript engine called Tamarin, included in Flash 9, which has a so-called “tracing” feature designed to enable faster code execution. However, the SunSpider JavaScript benchmark claims that SquirrelFish is at least 1.9 times faster than Tamarin.

Mozilla plans to wedge Tamarin into Firefox and match the API’s of both technologies “There are areas in which SpiderMonkey is faster than Tamarin and areas where it’s not. We’re looking to build hybrids that are best-of-breed for both worlds and we’re going to pull those into the Firefox release when ready,” Mozilla co-founder Mike Shaver recently said.

Can IE8 compete?
The big variable in this game is Microsoft’s Internet Explorer 8, currently in beta 1 phase. IE8 is expected to deliver speed gains in JavaScript performance as well. However, Microsoft is facing a tough task. The fact that the software giant is often criticized for delivering bloated and inefficient software certainly doesn’t help. In our tests, the first beta of IE8 shows no noticeable speed gains in running web applications.

Quite the opposite is the case, actually. Websites and web applications run noticeably slower than in IE7. The whole browsing experience generally appears to be less responsive. Of course, IE8 is in an early development stage and you can bet Microsoft is going to tweak its performance. The only problem is that the software giant will have to work to raise the stakes in the browser race. If IE8 under-delivers, the market could respond with further market share erosion for IE. It is evident now that JavaScript engine performance has become a key metric in the newest race for the title of fastest browser.

The battle ahead is nicely summed by Mozilla co-founder Mike Shaver who said the following: “They [Apple] have dropped SquirrelFish in now and got a big speed up there. We’ve got more coming on our side. You’ll see this leapfrog pattern over and over. We’re not going to let anybody slack on that and the other browser vendors need to keep up, too.”

According to Net Applications, Firefox 3 captured almost one fifth (18.41%) of the browser market in May, followed by Safari 3.1 which hit 6.25%. Microsoft’s Internet Explorer continues on its pace of a slow but steady decline, ending up at 73.75% in May. Microsoft has scheduled second beta of IE8 for an August release, with a generally expected final release in the fourth quarter of this year.

Read more »

Twitter At scale: Will it work?

Only two days ago the contact messaging application Twitter suffered another bout of downtime, leaving some users frustrated and others asking why the platform continues to suffer problems.

Techcrunch recently spoke to an individual who is familiar with the technical problems at Twitter as well as the challenges that lay ahead for the startup. He re-iterated his belief that the problems lay not with Blaine Cook (the former head of engineering who was shown the door), nor with NTT (their host) but with the early lack of understanding of how complex their problems would be.

The issue is that group messaging is very difficult to achieve at a grand scale. Other large sites such as WordPress and Digg are mostly dealing with known problems, such as how to serve a large number of pages or a large number of images. Twitter is unique in that it needs to parse a large number of messages and deliver them to multiple recipients, with each user having unique connections to other users.

Social networks have similar complexity issues, but they only usually need to route a message to a single user (or at the most to a defined group). Even so, social networks like Friendster struggled for years with technical and scaling issues. Twitter is specifically dealing with text messages, and in most cases with active users those messages are very frequent and go out to hundreds of contacts (or followers, as they are referred to in Twitter). Every new Twitter user and every new connection results in an exponentially greater computational requirement.

Some of the best web applications are able to efficiently solve very complex problems to produce simple results for users (Eg. Google). The success of these applications is due to the innovative efforts by developers to solve large technical challenges, where they have often had to break new ground for solutions. For Twitter to reach a similar point of reliability they too will need a very comprehensive, ground-breaking solution.

The source that I spoke to also commented on how ill-prepared the Twitter team were and are for their current and future challenges. The small team contains a handful of engineers, with only a person or two committed to infrastructure and architecture. He goes on to point out that at Digg the team for network and systems alone is bigger than the total engineering team at Twitter, and that at Digg they are lead by well-known “A-list rockstars”.

The problems at Twitter are often attributed to their use of RubyOnRails, a web development framework. Twitter is almost certainly the largest site running on Rails, so fans of the framework and its developers have been quick to deflect the criticism and point it back at the engineers at Twitter. Utilizting a framework that has never conquered large-scale territory must certainly add to the risk and work required to find a solution. As an out-of-the box framework, Rails certainly doesn’t lend itself to large-scale application development.

Rails enabled Twitter to be developed quickly, to get to launch quickly and then to improve with new features relatively rapidly also. But the old adage of “Good, Fast, Cheap - pick two” certainly applies and Rails would do itself no harm by conceding that it isn’t a platform that can compete with Java or C when it comes to intensive tasks. Twitter is at a cross-roads as an application and Rails has served its purpose very well to date, but you are unlikely to see a computational cluster built with Ruby at Apache any time soon.

What we see at Twitter today is a very useful and popular service, but one with very complex underlying technical challenges to overcome. Twitter will require not only a new architecture approach and a big injection of the best minds they can find ($15 million can help), but will also need a little patience from users and those of us observing.

Read more »

Googlebot crawls through HTML forms

Google will stop at nothing in its quest to index the world’s information. Last year it ate through 100 exabytes of data, but there’s still a lot that it can’t get access to. Known as the deep web (or hidden web, or invisibe web, etc.), it is estimated that the majority of online data is hidden safely from Google’s prying eyes — private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.

“For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made,” explained Jayant Madhavan and Alon Halevy in a blog post. “If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”

Google, which says that the crawling of dynamic form results doesn’t affect the “crawling, ranking, or selection of other web pages in any significant way,” also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won’t be crawled.

It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never — and should never — get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.

It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That’s mildly disappointing as we were looking forward to befriending Googlebot on MySpace…

Read more »

Amazon takes on Oracle and IBM with SimpleDB

Companies can now go ahead and fire their expensive database administrators—those engineers who keep the Oracle or IBM databases humming. Amazon has just added an enterprise-class database called SimpleDB to its suite of cloud-based IT infrastructure, which also includes storage (S3) and computation (EC2) available by the drink. Today, Amazon is taking sign-ups for the SimpleDB beta, which should start in a few weeks. As it points out on the new Simple DB page:

Amazon SimpleDB is a web service for running queries on structured data in real time. This service works in close conjunction with Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2), collectively providing the ability to store, process and query data sets in the cloud. These services are designed to make web-scale computing easier and more cost-effective for developers.

Traditionally, this type of functionality has been accomplished with a clustered relational database that requires a sizable upfront investment, brings more complexity than is typically needed, and often requires a DBA to maintain and administer. In contrast, Amazon SimpleDB is easy to use and provides the core functionality of a database - real-time lookup and simple querying of structured data - without the operational complexity. Amazon SimpleDB requires no schema, automatically indexes your data and provides a simple API for storage and access. This eliminates the administrative burden of data modeling, index maintenance, and performance tuning. Developers gain access to this functionality within Amazon’s proven computing environment, are able to scale instantly, and pay only for what they use.

This will be especially attractive for Web startups. Amazon has just taken another major infrastructure cost off the table for them. Relational databases are expensive to buy and maintain. Whatever features or performance SimpleDB lacks, it should make up for in price. Amazon wants to democratize the database by making it available to more businesses, and even individuals, thus leveling the playing field between big companies and startups even more.

And since SimpleDB operates at Web scale, larger companies will wake up to the cost saving opportunities of such a service as well. IBM, for one, is already trying to preempt any customer defections with its copycat Blue Cloud initiative. If speed is of the essence, you might still want to keep your database on your own servers. But the Web is where most software will one day live, whether consumer or enterprise. And Amazon’s got nothing to lose by speeding that day along.

Pricing for SimpleDB is as follows:

Machine Utilization - $0.14 per Amazon SimpleDB Machine Hour consumed

Data Transfer
$0.10 per GB - all data transfer in
$0.18 per GB - first 10 TB / month data transfer out
$0.16 per GB - next 40 TB / month data transfer out
$0.13 per GB - data transfer out / month over 50 TB

Data transfer “in” and “out” refers to transfer into and out of Amazon SimpleDB. Data transferred between Amazon SimpleDB and other Amazon Web Services is free of charge (i.e., $0.00 per GB).

Structured Data Storage - $1.50 per GB-month

Read more »

Next Page »