Friday, July 31, 2009

Tools: experiment with many, master a few of them

I am admittedly a tinkerer: I really enjoy reading other people's code, experimenting with new languages and technologies. That said, I make the effort to really only master a few technologies (e.g., Ruby, Java, and Lisp for programming languages (my paid for work is split fairly evenly between these languages), and specialize in AI, cloud deployments, Rails and Java-based web apps).

I may be transitioning to adopting two new core technologies. I have been using relational databases since 1986 and have a long term liking for PostgreSQL (less so, MySQL). The "non-SQL" meme has become popular with a lot of justification: for many applications you can get easier scalability and/or better performance on a single server using other types of data stores. Google's AppEngine datastore (built on their big table infrastructure) is clearly less convenient to develop with than a relational database but it may be well worth the extra effort to get scalability and very low hosting fees.

I have been spending a fair amount of time using the AppEngine data store this year, and more recently I am learning to effectively use Tokyo Cabinet with Ruby and Java. Different non-relational data stores have different strengths and weaknesses and in an ideal world I would learn several. In a practical world, this is not possible, so I am looking at Tokyo Cabinet as my "new PostgreSQL" - very much good enough for a wide range of projects where a relational database is not a good fit.

I was talking with a new customer yesterday about Ruby deployments, and efficiency issues in general. Ruby is a "slow" language, but Rails apps do not need to run slowly since so much of the heavy lifting is done with highly efficient external processes like Sphinx, PostgreSQL, Tokyo Cabinet, nginx, etc.

Monday, July 27, 2009

If you have a Google Wave account, then try my Robot extension

I wrote a Robot extension that reads text on a wave (and any child blips added to it), and adds its own child blips with some NLP analysis (entity extraction and auto-tagging). No big deal, but fun to write.

Give it a try: knowledge-books@appspot.com

Saturday, July 25, 2009

Have fun with the AMI containing the examples from my latest Ruby book

I have prepared an Amazon Machine Image (AMI) with most of the examples in my Ruby book Scripting Intelligence: Web 3.0 Information, Gathering and Processing. Because I will be periodically updating the AMI, you should search for the latest version. This is simple to do: after you log in to the Amazon Web Services (AWS) Management Console, select “Start an AMI,” then choose the “Community AMIs” tab and enter markbookimage in the AMI ID search field. Choose the AMI with the largest index.

I have Ruby, Rails, Sesame, Redland, AllegroGraph, D2R, Hadoop, Solr, PostgreSQL, Tomcat, Nutch, etc. pre-installed and configured. I use this AMI for new projects and for new experiments because it contains most of the tools and frameworks that I use. If you know how to use Amazon AWS, it is easy to clone your own copy with whatever additional software you need, hook up a persistent disk volume, etc. If you have not yet learned how to effectively use AWS, this might be a good time to do so. I like AWS because I can start a server, use it for development all day, and have it cost less than a dollar.

There are README files for the examples. For more details on how the examples work, please consider buying my book if you have fun with the AMI.

Writing Wave robots that use blip titles and text

If you follow the Java Wave robot tutorial it is reasonably easy getting started. It took me a short while to get access to the titles and text of both new root blips (i.e., the start of a new Wave object) and child blips (i.e., new blips added to a root blip). Here is some code where I re-worked some of the example code (this is in the servlet that handles incoming JSON encoded messages from the Wave platform):
  public void processEvents(RobotMessageBundle events) {
Wavelet wavelet = events.getWavelet();

if (events.wasSelfAdded()) {
Blip blip = wavelet.appendBlip();
TextView textView = blip.getDocument();
textView.append("I'm alive and ready for testing");
}

for (Event event : events.getBlipSubmittedEvents()) {
// some of my tests:
Blip blip = event.getBlip();
if (!blip.getBlipId().equals(wavelet.getRootBlipId())) {
String text = blip.getDocument().getText();
makeDebugBlip(wavelet, "blip submitted events: child blip: " + text);
}
}

for (Event event: events.getEvents()) {
Blip eventBlip = event.getBlip();
// from original example:
if (event.getType() == EventType.WAVELET_PARTICIPANTS_CHANGED) {
Blip blip = wavelet.appendBlip();
TextView textView = blip.getDocument();
String s3 = textView.getText();
textView.append("Hello, everybody - a test... " + s3);
}
if (eventBlip.getBlipId().equals(wavelet.getRootBlipId())) {
String title = wavelet.getTitle();
String text = eventBlip.getDocument().getText();
makeDebugBlip(wavelet, "blip submitted events: root blip: title: " + title+" text: " + text);
}
}

}
public void makeDebugBlip(Wavelet wavelet, String text) {
Blip blip = wavelet.appendBlip();
TextView textView = blip.getDocument();
textView.append(text);
}
There are APIs for changing the data in blips. This example simply adds new child blips to a wave. Note that child blips do not have their own title. Here, I am dealing with blips that are just text but blips can also contain images, video, sounds, etc.

One bit of advice: the Wave platform is definitely cool (in many ways) but it is being actively developed and modified. Sometimes things just stop working for a while, so I have adopted the practice of walking away from my Wave development experiments for a few hours if my robots stop working. Twice, things simply started working again with no changes. For debugging robot code (Java or Python AppEngne web apps), make sure to enable DEBUG output in logging: then the AppEngine Logs page is your new friend.

Thursday, July 23, 2009

Wave may end up being the new Internet coolness

I continue having fun "kicking the tires." I do wish that I had a completely local Wave robot development environment, but I expect that will be forthcoming. The edit, compile, run cycle takes a while because I need to:
  • Modify robot code
  • Build and upload the code to Java AppEngine
  • Create new test waves, invite the robot, etc.
The development cycle for Gadgets is quicker if you can simply remotely edit a Gadget XML file on whatever server you use to publish it.

I am having a bit of an AppEngine performance issue. I am used to being able to cache (reasonably) static data in memory (loaded from JAR files in WEB-INF/lib). With AppEngine your web app can run on any server and web app startup time should be very quick (and doing on-startup data loading into memory from JAR files is not quick). I am not so happy doing this, but I may keep frequently used static data in the data store. I don't think that using JCache + memcached is an option because if I look up a key and it is not in memcached I don't know if the key is not defined or if the key has expired from memcached.

Tuesday, July 21, 2009

Google Wave gadgets

The gadget tutorial was easy to follow. I am starting with the state-full counter example and experimenting with that. The makeRequest API can be used to call remote web services inside gadgets. Other APIs let you process events inside a wave (from user actions, new or changed content, etc.) Cool stuff. There are many gadget containers but I was never interested in writing them myself until I started experimenting with the Wave platform.

Cool: just wrote my first Google Wave "robot" JSON web service

It is a placeholder, for now, but it will eventually use my KBtextmaster code to perform natural language processing on new replies to a wave that has my robot added as a participant. By following these instructions it only took about 30 minutes to get this going (would have been 20 minutes, but I compiled the Java AppEngine web JSON web service with JDK 1.5 - a re-build with JDK 1.6, and everything worked as advertised).

I have been working on the Common Lisp version of KBtextmaster in the last week, and the Java version badly needs a code cleanup also (both versions contain some of my code going back over ten years). I'll post the public URL for my robot in a week or so when I get a new version of KBtextmaster plugged in.

Monday, July 20, 2009

Book project, Google Wave, and a kayaking video

Except for some consulting work, my big project is a new book on using AllegroGraph for writing Semantic Web applications. Lots of work, but also a lot of fun.

I received a Google Wave Sandbox invitation today. I am going to try to spend an hour or two a day with Wave to get up to speed. Fortunately, I am 100% up to speed using the Java AppEngine (initially, Wave Robots, etc. get hosted on AppEngine, either Java or Python versions) and I have some experience with GWT - so I should already be in good shape -- but I need to write some code :-)

My wife took a short video of me kayaking yesterday.

Sunday, July 19, 2009

Gambit-C Scheme has become my new C

I might be writing an article about this soon: Scheme is a high level language - great for all around development, and Gambit-C can (once an application is developed in a very productive Emacs + Slime + Gambit-C environment) be used to create small and very efficient native applications. BTW, if you use an OS X or Windows installer, also get the source distribution for the examples directory.

In Unix tradition, I like to build a set of tools as command line applications, and Gambit-C is very nice for this.

Saturday, July 18, 2009

Common Lisp RDFa parser and work on my new AllegroGraph book

I am working on a 'three purpose' task this morning: writing an RDFa parser in Common Lisp. I need this for my new book project (semantic web application programming with AllegroGraph), I need this for one of my own (possibly commercial) projects, and to release as an open source project. I am building this on top of Gary King's CL-HTML-Parser, so Gary did the heavy lifting, and I am just adding the bits that I need.

Thursday, July 09, 2009

Measurement promotes success

Computer science involves effort measuring things: profiling code, tracking memory use, looking for inefficiencies in network connections, determining the number of database queries are required for rendering a typical web page in an application, etc.

I have started also measuring something else: how I spend my time. I used to just track billable time and leave time spent learning new languages, new frameworks, writing experimental code, etc. as unmeasured time. I now use a time tracking application on my Mac Book to track 16 different categories (billable, and learning/research - I also track time on Reddit, Slashdot, etc.) The overhead for these measurements is probably about 2 or 3 minutes a day, plus a few minutes to look at time spent at the end of a day, end of a week, etc. For me, this is useful information.

Wednesday, July 08, 2009

Continuing to work on my AllegroGraph book

I started this book late last year, but set it aside to write my Apress Ruby book Scripting Intelligence: Web 3.0 Information, Gathering and Processing.

I don't think that the market will be large for an AllegroGraph (AG) book, but after using AG on one customer project and experimenting (off and on) with it for several years, I decided that it was Semantic Web technology worth mastering. AG is a commercial product, but a free server version (supports Lisp, Ruby, Java, and Python clients) is available that is limited to 50 million RDF triples (a large limit, so many projects can simply use the free version).

AG supports the Sesame (an open source Java RDF data store) REST style APIs so if you stick with SPARQL and only RDFS reasoning, you get portability to also use a BSD licensed alternative. That said, my reason for using AG is all of the proprietary extra goodies!

In addition to a few Lisp, Python, Ruby, and Java client examples, I am going to incorporate a lot of useful Common Lisp utilities for information processing that I have been working on for many years: this will motivate me to package up a great deal of my Common Lisp code and release it with an open source license. I plan on releasing the book for free as a PDF file and as a physical book for people who want to purchase it. The book and the open source examples should be available before the end of this year.

Tuesday, July 07, 2009

W3C killing off XHTML2 in favor of HTML5: bad for the Semantic Web?

As a practical matter, HTML5 looks good for writing human facing next generation web applications with multimedia support and more intuitive elements like <header>, <nav>, <section>, <footer>, etc.

The problem that I have with the W3C's decision (assuming that I understand it correctly) is that at least in my opinion the value of the web goes way beyond supporting manual web browsing and enjoying digital media assets. I think that the web should evolve into a ubiquitous decision support system - this needs software agents that can help you no matter who's computer you may be using, what type of small device (phone, web pad) you may be using, etc. In this context, decision support means help in making dozens of decisions each day. User specific information filters, search agents, and personalized information repositories will require machine readable data with well defined semantics.

One approach is to have content management systems like Drupal and Plone publish information in parallel, both:
  • HTML5 web pages for human consumption
  • RDF/RDFS/RDFS+/OWL for consumption by software agents
It is very easy (like a few lines of Ruby) to convert either entire databases or subsets from SQL queries to RDF and since many web pages are created from information in relational databases, it might be OK to use this "dual publishing" scheme. Similarly, data in RDF repositories can be used, instead of relational databases, to publish web pages. However, I would prefer generating one format of web page with semantic information embedded as, for example, RDFa and micro formats.

My vision of the web is an increasing amount of data and information that we need customizable software agents or interfaces to cull out just what an individual user needs.

HTML5 needs a well designed notation for embedding extensible semantic information that does not rely on XML's extensibility.

Wednesday, July 01, 2009

PragPub - free monthly magazine for developers

While I love writing, I also like to read other people's efforts. I find that I learn a lot reading code that other people write. I started seriously reading other people's code in the 1970s - a habit I never outgrew. When I read what other people write, in addition to the content, I also pay attention to their writing technique: how they introduce a topic, make points, provide examples, the level of detail they use, etc. Check out the new PragMag - good reading.