Category Archives: EC2

Expand Excel’s horizons – look to the cloud

http://www.flickr.com/photos/donabelandewen/470780785/I’ve added a new facility to microETL’s SQLScript  – the TIMER command.

 

 

 

TIMER takes up to 4 arguments:

  • 1st, the SQLScript to call.
  • 2nd, an optional SQLScript to call when the primary script finishes or is cancelled.
  • 3rd, the number of  milliseconds secondsto wait between each call to the primary script, defaults to 1.
  • 4th, the mode of the controlling user form; MODELESS (the default) users may continue to navigate about the workbook or MODAL, the user must exit the form before any access is allowed to the workbook. Note: in MODELESS mode, if the user attempts to change any workbook cells, the TIMER (and any other TIMERs running) will stop.

So how to demonstrate TIMER in action?

I thought I’d set my sights to a further horizon, leave the humble laptop behind; I’d go all modern and use the cloud.

A conversation with Giles Thomas of  Resovler One and now Dirigible fame (the first a desktop Python-power spreadsheet, the other a cloud-based one), lead to my discovery of PiCloud. This is an Amazon EC2 based service that allows the execution of Python code remotely; handling all the messy business of provisioning of servers and marshalling of code and results.

It is not only easier than using EC2 directly but can also be much cheaper to use, as PiCloud charge by the second (rather than EC2s “use any portion of an hour, pay for the full hour” payment terms).

I thought I’d give it a go.

I used the same data and web site as Daniel Ferry’s VBA ‘threading’  example; a list of properties to check against the realestateabc.com “What’s your home worth?” site; returning various bits of information such as estimated value, size, last sale date and price achieved etc.

Being Python I used the wonderful Beautiful Soup package to extract the needed data from the site’s HTML. I also submitted a separate request for each property, but if this were for real, I would most likely submit fewer jobs to handle multiple properties as most of the cost of each job in this example was probably due to set-up and tear down rather than the url fetching and parsing. Not to mention the time to submit 100 jobs in the first place.

Above you can see the two SQLScripts involved. The LaunchCloud script submits the jobs and makes a call to TIMER to run the CloudChecker script every second. The CloudChecker script checks for results from PiCloud and pastes them back to the source table. When no more results due, it issues a STOP command which forces the TIMER to stop.

Above is the pyGetDetails script which is passed to PiCloud (via the cloud.call() statement) to do the actual fetching and parsing of the URL associated with each property.

Download latest version of microETL here. The workbook is called PiCloud (2007/2010 format), requires Python 2.7 and a subscription to PiCloud.

Advertisements

The 64-bit question, and the birth of MicroETL

Like the original $64 question, I have for a long time being pondering the Excel 64-bit question. Whether “To take it, or leave it”?

When first announced, I believed 64bit Excel would only be of interest to a minority of demented quants in investment banks hacking their way to yet another evil model to bankrupt the world. The problem of incompatible add-ins, COM controls etc. would also make its widespread adoption less likely.

But three things changed my mind:

  • first, the appearance of Windows 7 powered 64-bit PCs on many worker’s desks
  • second, my embedding of SQLite into Excel which enables me to effectively use the extra memory available to a 64-bit instance to hold, analyse and transform a huge amount of raw relational data (in the form of a “:memory:” database)
  • and finally, the arrival on the scene of PowerPivot.

Although a PowerPivot workbook doesn’t allow more than 4Gig of memory to be addressed (this is a SharePoint restriction, as a 4G workbook is likely to compress down to <= the 2G upload limit imposed by SharePoint), this represents a doubling of the 32bit limits in memory terms, but much more in raw data terms as PowerPivot can achieve up to 10x-20x times data compression.

So, I decided to bite the bullet; fired up an AWS 64-bit image and started to upgrade my xLite code-base to handle this new world.

I decided on the TDN-GCC MinGW-w64 distro to re-compile 64-bit versions of my SQLite wrapper  libraries, and set about converting  my VBA code to handle both 64-bit and 32-bit dlls. So far, easy peasy.

The problems started when I went to convert my JavaScript & Python wrappers. I couldn’t get JSDB to compile to either 32-bit or 64-bit using MinGW-w64, so I abandoned the attempt as my main use of JSDB is as a “command-line data-crunching tool” similar to, and used along side, CSVFix and SFK (Swiss File Knife). Having the ability to call JavaScript in-process was relatively new and I could live without it.

More worrying was the problem with Python embedding, in-process Python would be harder to live without. The APSW SQLite library which I had used is 32-bit only. Luckily the Python 2.7 version of its SQLite3 standard module  (aka pysqlite) now allows the loading of SQLite C extensions. This enables me to load databases under Python and extract the C handle using a specially written extension, thus enabling Excel/VBA to attach to a Python opened SQLite shared memory.

Without this new  loaded extension trick I would have had to modify the SQLite3 library itself as the previous 2.6 version would not divulge SQLite’s handle, while APSW did, which was  the reason I picked APSW in the first place.  As a result, xLite’s Python functionality now requires Python 2.7.

So the end result is a somewhat reduced-functionality xLite  (no in-process JavaScript, missing APSW functionality such as Python-based SQLite virtual tables), but running on the ultimate data-smithing platform:

  • Excel 64-bit combining with
  • the relational power of SQLite 64bit,
  • the fast and easy scripting power of Python 64-bit,
  • and topped off with the king of pivots, 64-bit PowerPivot.

A week well spent, I think!

Oh, I’ve also decided to rename the add-in to MicroETL. This is mainly to reflect the fact that xLite (SQLite embedded in Excel) is just one element within the tool. The  pure-VBA detached-ADO recordset functionality, embedded Python and xLiteScript elements can exist independently of xLite.  That, and I’m fed-up getting enquiries about X-Lite http://www.counterpath.com/x-lite.html and I own the http://www.microETL.com URL 😉

Windows on EC2 = SMEs on EC2

The announcement that Win2003 is now an an option on EC2, is very significant, that and EC2’s exit from beta status with an SLA in tow, means that AWS is now very much more appealing to the great unwashed, the SMEs. i.e. the businesses who form the backbone of most of our economies.

Large companies and start-ups are comfortable in the world of Linux servers but most small companies are Windows to the core.  This may not be “right”, this may not be how it “should be”, but it is so.   Even within large companies, departmental computing is largely a Windows only enclave, with MS Office (and Excel in particular) as the backbone and MS SQL Server as the database of choice (or is that, no choice).

The other interesting thing is that my fear that EC2 SQL Server Standard instances would be licensed as per Oracle has not come to pass (Oracle while making a “big thing” of their recent EC2 cloud conversion, still insist on traditional licensing for EC2 database instances). SQL Server Standard is available on a pay-as-you-go model, brilliant!.

Even if running Win2003 as a server doesn’t catch your fancy and in fact you would much rather get rid of your existing Window’s laptop to be replaced by a cool new Apple Mac. Unfortunately you still need the ability to run Windows-only software, why not use EC2 as your on-demand pay-as-you-go Window’s desktop replacement?  Simply configure a Windows AMI with your required software (you may have to use something like this, if software is only available on CD); you could then use Jungle Disk to easily share data (via S3) between your new shiny Mac and the AMI.  Power up and down as required, easier than using VMWare or Parallels and @ 12.5c per hour, probably cheaper too.

Clouds no longer pass by Windows.

Amazon today announced that later this year, Windows Server woud be available on EC2. No details on cost and licensing etc. but this is major.  Up until now, that portion of the business world who are pure MS shops (a very large percentage especially amongst SMEs) were excluded from taking advantage of Amazon’s amazing (and getting more amazing everyday) EC2 platform

From my point of view, as with Oracle’s announcement last week, this releases yet more of my “legacy” skillset for deployment in the clouds. Although I’ve been involved with  *nix servers for 20 years or so, as corporate servers became more locked-down (and removed to the control of 3rd party data centres) I lost day-to-day experience of using them; in latter years my main ‘hands-on’ platform was Windows, either my own PC or local departmental NT servers. Windows on EC2 will allow me to use a whole new set of Windows only software (e.g. RSSBus or XLsgen) and of course SQLServer.

The lack of SQLServer on EC2 has been a major problem for me as a datasmith; there’s an awful lot of data out there sitting in SQLServer databases, but currently if I need to “cloud burst” such datasets I would have to first extract the data to, say, csv files and then load the data on to a Linux compatible database. But with a SQLServer instance running in the cloud, I could simply use SQLServer’s native backup/replication tools.  No more need to download data to my “ground-based” PCs resulting in quicker turnaround and fewer data security risks.

On the licensing front,  I’m presuming that the OS licence will be on a pay-as-you-go basis, but what about SQLServer and other server products?  Will MS do an Oracle on it, i.e. require a traditional upfront use-it-or-lose-it payment or will they the go the radical (but I thing inevitable) path of a licence-by-the-hour. 

First RedHat, then Sun, then Oracle and now Microsoft; the mighty beasts of our industry have acknowledged there’s a new mighty beast on the prowl, dressed as a humble bookseller no less!

Amazon’s SAN in the cloud is a mirage…

This morning I got very excited.  While quickly scanning the headlines of the 1000+ unread feeds that had accumulated in my Google Reader this week, one heading in particular caught my attention, “Amazon Elastic Block Store goes live!“.

The post from the Right Scale folks gives a detailed overview of the new  Amazon ‘SAN storage in the cloud’ service, aka Elastic Block Store, aka EBS.  Alas, this particular cloud offering was a mirage, the post was subsequently removed (but can still be viewed on Robert Scoble’s Shared Items) it seems the post was a work-in-progress and not intended for publishing, yet!

Why was I so excited?  Amazon EC2 had two major shortcomings when it launched 2 or so years ago; the first, ephemeral IP addresses, was solved by the new Elastic IP feature; the second, ephemeral storage volumes (when you shutdown an instance the disks are wiped!) is due to be solved by EBS.  With both of these problems solved, EC2, already near perfect, would be perfect.

The article does a good job of explaining the new service…

EBS starts out really simple: you create a volume from 1GB to 1TB in size and then you mount it on a device on an instance, format it, and off you go. Later you can detach it, let it sit for a while, and then reattach it to a different instance. You can also snapshot the volume at anytime to S3, and if you want to restore your snapshot you can create a fresh volume from the snapshot.

The thing that caught my eye in the above paragraph was the snapshot facility.  Snapshots are to be stored on S3 via an EC2-specific incremental-snapshot API.  This means the volumes will come with a built-in back-up facility. This is important as EBS drives reside in one availability zone (that of the instance that they are mounted against) and do not have the data replication security offered by S3.  It also means that disk systems can be restored quickly and simply from snapshots without the overhead  (and bugs!) of writing an S3 specific incremental backup and restore utility.

Back to waiting…

UPDATE: 20th August

Wait over…