Tue 13 Apr 2021
Tags: web, serverless, gcp
Let's say you have a list of URLs you need to fetch for some reason -
perhaps to check that they still exist, perhaps to parse their content
for updates, whatever.
If the list is small - say up to 1000 urls - this is pretty easy to do
using just curl(1)
or wget(1)
e.g.
INPUT=urls.txt
wget --execute robots=off --adjust-extension --convert-links \
--force-directories --no-check-certificate --no-verbose \
--timeout=120 --tries=3 -P ./tmp --warc-file=${INPUT%.txt} \
-i "$INPUT"
This iterates over all the urls in urls.txt
and fetches them one by
one, capturing them in WARC format.
Easy.
But if your url list is long - thousands or millions of urls - this is
going to be too slow to be practical. This is a classic
Embarrassingly Parallel
problem, so to make this scalable the obvious solution is to split your
input file up and run multiple fetches in parallel, and then merge your
output files (i.e. a kind of map-reduce job).
But then your problem becomes that you need to run this on multiple
machines, and setting up and managing and tearing down those machines
becomes the core of the problem. But really, you don't want to worry
about machines, you just want an operating system instance available
that you can make use of.
This is the promise of so-called
serverless
architectures such as AWS "Lambda" and Google Cloud's "Cloud Functions",
which provide a container-like environment for computing, without
actually having to worry about managing the containers. The serverless
environment spins up instances on demand, and then tears them down
after a fixed period of time or when your job completes.
So to try out this serverless paradigm on our web fetch problem, I've
written cloudfunc-geturilist,
a Google Cloud Platform "Cloud Function" written in go, that is
triggered by input files being written into an input Google Cloud
Storage bucket, and writes its output files to another GCS output
bucket.
See the README instructions if you'd like to try out (which you can
do using a GCP free tier account).
In terms of scalability, this seems to work pretty well. The biggest
file I've run so far has been 100k URLs, split into 334 input files
each containing 300 URLs. With MAX_INSTANCES=20
, cloudfunc-geturilist
processes these 100k URLs in about 18 minutes; with MAX_INSTANCES=100
that drops to 5 minutes. All at a cost of a few cents.
That's a fair bit quicker than having to run up 100 container instances
myself, or than using wget!
Sun 12 Oct 2014
Tags: web, urls, personal_cloud
I wrote a really simple personal URL shortener a couple of years ago, and
have been using it happily ever since. It's called shrtn
("shorten"), and is just a simple perl script that captures (or generates) a
mapping between a URL and a code, records in a simple text db, and then generates
a static html file that uses HTML meta-redirects to point your browser towards
the URL.
It was originally based on posts from
Dave Winer
and Phil Windley,
but was interesting enough that I felt the itch to implement my own.
I just run it on my laptop (shrtn <url> [<code>]
), and it has settings to
commit the mapping to git and push it out to a remote repo (for backup),
and to push the generated html files up to a webserver somewhere (for
serving the html).
Most people seem to like the analytics side of personal URL shorteners
(seeing who clicks your links), but I don't really track that side of it
at all (it would be easy enought to add Google Analytics to to your html
files to do that, or just doing some analysis on the access logs). I
mostly wanted it initially to post nice short links when microblogging,
where post length is an issue.
Surprisingly though, the most interesting use case in practice is the
ability to give custom mnemonic code codes to URLs I use reasonably often, or
cite to other people a bit. If I find myself sharing a URL with more
than a couple of people, it's easier just to create a shortened version and
use that instead - it's simpler, easier to type, and easier to remember for
next time.
So my shortener has sort of become a cross between a Level 1 URL cache
and a poor man's bookmarking service. For instance:
If you don't have a personal url shortener you should give it a try - it's
a surprisingly interesting addition to one's personal cloud. And all you
need to try it out is a domain and some static webspace somewhere to host
your html files.
Too easy.
[ Technical Note: html-based meta-redirects work just fine with browsers,
including mobile and text-only ones. They don't work with most spiders and
bots, however, which may a bug or a feature, depending on your usage. For a
personal url shortener meta-redirects probably work just fine, and you gain
all the performance and stability advantages of static html over dynamic
content. For a corporate url shortener where you want bots to be able to
follow your links, as well as people, you probably want to use http-level
redirects instead. In which case you either go with a hosted option, or look
at something like YOURLS for a slightly more heavyweight
self-hosted option. ]
Wed 10 Sep 2014
Tags: finance, billing, web
You'd think that 20 years into the Web we'd have billing all sorted out.
(I've got in view here primarily bill/invoice delivery, rather than
payments, and consumer-focussed billing, rather than B2B invoicing).
We don't. Our bills are probably as likely to still come on paper as in
digital versions, and the current "e-billing" options all come with
significant limitations (at least here in Australia - I'd love to hear
about awesome implementations elsewhere!)
Here, for example, are a representative set of my current vendors, and
their billing delivery options (I'm not picking on anyone here, just
grounding the discussion in some specific examples).
So that all looks pretty reasonable, you might say. All your vendors have
some kind of e-billing option. What's the problem?
The current e-billing options
Here's how I'd rate the various options available:
email: email is IMO the best current option for bill delivery - it's
decentralised, lightweight, push-rather-than-pull, and relatively easy
to integrate/automate. Unfortunately, not everyone offers it, and sometimes
(e.g. Citibank) they insist on putting passwords on the documents they send
out via email on the grounds of 'security'. (On the other hand, emails
are notoriously easy to fake, so faking a bill email is a straightforward
attack vector if you can figure out customer-vendor relationships.)
(Note too that most of the non-email e-billing options still use email
for sending alerts about a new bill, they just don't also send the bill
through as an attachment.)
web (i.e. a company portal of some kind which you log into and can
then download your bill): this is efficient for the vendor, but pretty
inefficient for the customer - it requires going to the particular
website, logging in, and navigating to the correct location before you
can view or download your bill. So it's an inefficient, pull-based
solution, requiring yet another username/password, and with few
integration/automation options (and security issues if you try).
BillPayView
/ Australia Post Digital Mailbox:
for non-Australians, these are free (for consumers) solutions for
storing and paying bills offered by a consortium of banks
(BillPayView) and Australia Post (Digital Mailbox) respectively.
These provide a pretty decent user experience in that your bills are
centralised, and they can often parse the bill payment options and
make the payment process easy and less error-prone. On the other
hand, centralisation is a two-edged sword, as it makes it harder to
change providers (can you get your data out of these providers?);
it narrows your choices in terms of bill payment (or at least makes
certain kinds of payment options easier than others); and it's
basically still a web-based solution, requiring login and navigation,
and very difficult to automate or integrate elsewhere. I'm also
suspicious of 'free' services from corporates - clearly there is value
in driving you through their preferred payment solutions and/or in the
transaction data itself, or they wouldn't be offering it to you.
Also, why are there limited providers at all? There should be a
standard in place so that vendors don't have to integrate separately
with each provider, and so that customers have maximum choice in whom
they wish to deal with. Wins all-round.
And then there's the issue of formats. I'm not aware of any Australian
vendors that bill customers in any format except PDF - are there any?
PDFs are reasonable for human consumption, but billing should really be
done (instead of, or as well as) in a format meant for computer consumption,
so they can be parsed and processed reliably. This presumably means billing
in a standardised XML or JSON format of some kind (XBRL?).
How billing should work
Here's a strawman workflow for how I think billing should work:
the customer's profile with the vendor includes a billing delivery
URL, which is a vendor-specific location supplied by the customer to
which their bills are to be HTTP POST-ed. It should be an HTTPS URL to
secure the content during transmission, and the URL should be treated
by the vendor as sensitive, since its possession would allow someone
to post fake invoices to the customer
if the vendor supports more than one bill/invoice format, the customer
should be able to select the format they'd like
the vendor posts invoices to the customer's URL and gets back a URL
referencing the customer's record of that invoice. (The vendor might,
for instance, be able to query that record for status information, or
they might supply a webhook of their own to have status updates on the
invoice pushed back to them.)
the customer's billing system should check that the posted invoice has
the correct customer details (at least, for instance, the vendor/customer
account number), and ideally should also check the bill payment methods
against an authoritative set maintained by the vendor (this provides
protection against someone injecting a fake invoice into the system with
bogus bill payment details)
the customer's billing system is then responsible for facilitating the
bill payment manually or automatically at or before the due date, using
the customer's preferred payment method. This might involve billing
calendar feeds, global or per-vendor preferred payment methods, automatic
checks on invoice size against vendor history, etc.
all billing data (ideally fully parsed, categorised, and tagged) is then
available for further automation / integration e.g. personal financial
analytics, custom graphing, etc.
This kind of solution would give the customer full control over their
billing data, the ability to choose a billing provider that's separate from
(and more agile than) their vendors and banks, as well as significant
flexibility to integrate and automate further. It should also be pretty
straightforward on the vendor side - it just requires a standard HTTP POST
and provides immediate feedback to the vendor on success or failure.
Why doesn't this exist already - it doesn't seem hard?
Fri 29 May 2009
Tags: delicious, feeds, web
I've been playing with using delicious
as a lightweight URL database lately, mostly for use by
greasemonkey
scripts of various kinds (e.g.
squatter_redirect).
For this kind of use I really just need a lightweight anonymous
http interface to the bookmarks, and delicious provides a number of
nice lightweight RSS and JSON feeds
suitable for this purpose.
But it turns out the feed I really need isn't currently available.
I mostly want to be able to ask, "Give me the set of bookmarks stored
for URL X by user Y", or even better, "Give me the set of bookmarks
stored for URL X by users Y, Z, and A".
Delicious have a feed for recent bookmarks by URL:
http://feeds.delicious.com/v2/{format}/url/{url md5}
and a feed for all a user's bookmarks:
http://feeds.delicious.com/v2/{format}/{username}
and feeds for a user's bookmarks limited by tag(s):
http://feeds.delicious.com/v2/{format}/{username}/{tag[+tag+...+tag]}
but not one for a user limited by URL, or for URL limited by user.
Neither alternative approach is both feasible and reliable: searching
by url will only return the most recent set of N bookmarks; and searching
by user and walking the entire (potentially large) set of their bookmarks
is just too slow.
So for now I'm having to workaround the problem by adding a special
hostname tag to my bookmarks (e.g. squatter_redirect=www.openfusion.net
),
and then using the username+tag
feed as a proxy for my username+domain
search.
Any cluesticks out there? Any nice delicious folk want to whip up a shiny
new feed for the adoring throngs? :-)
Tue 24 Feb 2009
Tags: web, disqus
I'm trying out disqus, since I like the idea
of being able to track/collate my comments across multiple endpoints,
rather than have them locked in to various blogging systems. So this
is a test post to try out commenting. Please feel free to comment ad
nauseum below (and sign up for a disqus account, if you don't already
have one).
Wed 26 Nov 2008
Tags: web, qtcba
Was thinking this morning about my interactions with the web over
the last couple of weeks, and how I've been frustrated with not
being able to (simply) get answers to relatively straightforward
questions from the automated web. This is late 2008, and Google
and Google Maps and Wikipedia and Freebase etc. etc. have clearly
pushed back the knowledge boundaries here hugely, but at the same
time lots of relatively simple questions are as yet largely
unanswerable.
By way of qualification, I mean are not answerable in an automated
fashion, not that they cannot be answered by asking the humans on
the web (invoking the 'lazyweb'). I also don't mean that these
questions are impossible to answer given the time and energy to
collate the results available - I mean that they are not simply
and reasonably trivially answerable, more or less without work on
my part. (e.g. "How do I get to address X" was kind of answerable
before Google Maps, but they were arguably the ones who made it
more-or-less trivial, and thereby really solved the problem.)
So in the interests of helping delineate some remaining frontiers,
and challenging ourselves, here's my catalogue of questions from
the last couple of weeks:
what indoor climbing gyms are there in Sydney?
where are the indoor climbing gyms in Sydney (on a map)?
what are the closest gyms to my house?
how much are the casual rates for adults and children for the
gyms near my house?
what are the opening hours for the gyms near my house?
what shops near my house sell the Nintendo Wii?
what shops near my house have the Wii in stock?
what shops near my house are selling Wii bundles?
what is the pricing for the Wii and Wii bundles from shops near my
house?
of the shops near my house that sell the Wii, who's open late on
Thursdays?
of the shops near my house that sell the Wii, what has been the best
pricing on bundles over the last 6 months?
trading off distance to travel against price, where should I buy a Wii?
what are the "specials" at the supermarkets near my house this week?
given our grocery shopping habits and the current specials, which
supermarket should I shop at this week?
I need cereal X - do any of the supermarkets have have it on special?
That's a useful starting set from the last two weeks. Anyone else? What
are your recent questions-that-cannot-be-answered? (And if you blog, tag
with #qtcba pretty please).
Thu 29 May 2008
Tags: banking, finance, web
Heard via @chieftech on twitter that the
Banking Technology 2008
conference is on today. It's great to see the financial world engaging with
developments online and thinking about new technologies and the Web 2.0 space, but
the agenda strikes me as somewhat weird, perhaps driven mainly by the vendors they
could get willing to spruik their wares?
How, for instance, can you have a "Banking Technology" conference and not have
at least one session on 'online banking'? Isn't this the place where your
technology interfaces with your customers? Weird.
My impression of the state of online banking in Australia is pretty
underwhelming. As a geek who'd love to see some real technology innovation
impact our online banking experiences, here are some wishlist items dedicated
to the participants of Banking Technology 2008. I'd love to see the following:
Multiple logins to an account e.g. a readonly account for downloading
things, a bill-paying account that can make payments to existing vendors,
but not configure new ones, etc. This kind of differentiation would allow
automation (scripts/services) using 'safe' accounts, without having to
put your master online banking details at risk.
API access to certain functions e.g. balance checking, transaction
downloads, bill payment to existing vendors, internal transfers, etc.
Presumably dependent upon having multiple logins (previous), to help
mitigate security issues.
Tagging functionality - the ability to interactively tag transactions (e.g.
'utilities', 'groceries', 'leisure', etc.), and to get those tags included
in transaction reporting and/or downloading. Further, allow autotagging of
transactions via descriptions/type/other party details etc.
Alert conditions - the ability to setup various kinds of alerts on
various conditions, like low or negative balances, large withdrawals,
payroll deposit, etc. I'm not so much thinking of plugging into particular
alert channels here (email, SMS, IM, etc), just the ability to set 'flags'
on conditions.
RSS support - the ability to configure various kinds of RSS feeds of
'interesting' data. Authenticated, of course. Examples: per-account
transaction feeds, an alert condition feed (low balance, transaction
bouncing/reversal, etc.), bill payment feed, etc. Supplying RSS feeds
also means that such things can be plugged into other channels like email,
IM, twitter, SMS, etc.
Web-friendly interfaces - as Eric Schmidt of Google says, "Don't fight the
internet". In the online banking context, this means DON'T use technologies
that work against the goodness of the web (e.g. frames, graphic-heavy design,
Flash, RIA silos, etc.), and DO focus on simplicity, functionality, mobile
clients, and web standards (HTML, CSS, REST, etc.).
Web 2.0 goodness - on the nice-to-have front (and with the proviso that it
degrades nicely for non-javascript clients) it would be nice to see some
ajax goodness allowing more friendly and usable interfaces and faster
response times.
Other things I've missed? Are there banks out there already offering any of
these?
Mon 21 Apr 2008
Tags: fire eagle, location, web, microformats
I've been thinking about Yahoo's new fire eagle
location-broking service over the last few days. I think it is a really
exciting service - potentially a game changer - and has the potential to
move publishing and using location data from a niche product to something
really mainstream. Really good stuff.
But as I posted here, I also think fire
eagle (at least as it's currently formulated) is probably only usable by
a relatively small section of the web - roughly the relatively
sophisticated "web 2.0" sites who are comfortable with web services and api
keys and protocols like OAuth.
For the rest of the web - the long web 1.0 tail - the technical bar is
simply too high for fire eagle as it stands to be useful and usable.
In addition, fire eagle as it currently stands is unicast, acting as a
mediator between you some particular app acting as a producer or a consumer
of your location data. But, at least on the consumer side, I want some kind
of broadcast service, not just a per-app unicast one. I want to be able to
say "here's my current location for consumption by anyone", and allow that
to be effectively broadcast to anyone I'm interacting with.
Clearly my granularity/privacy settings might be different for my public
location, and I might want to be able to blacklist certain sites or parties
if they prove to be abusers of my data, but for lots of uses a broadcast
public location is exactly what I want.
How might this work in the web context? Say I'm interacting with an
e-commerce site, and if they some broad idea of my location (say,
postcode, state, country) they could default shipping addresses for me,
and show me shipping costs earlier in the transaction (subject to change,
of course, if I want to ship somewhere else). How can I communicate my
public location data to this site?
So here's a crazy super-simple proposal: use Microformat HTTP Request
Headers.
HTTP Request Headers are the only way the browser can pass information
to a website (unless you consider cookies a separate mechanism, and they
aren't really useful here because they're domain specific). The
HTTP spec
even carries over the
"From"
header from email, to allow browsers to communicate who the user is to
the website, so there's some kind of precedent for using HTTP headers for
user info.
Microformats are useful here because they're
really simple, and they provide useful standardised vocabularies around
addresses (adr) and geocoding
(geo).
So how about (for example) we define a couple of custom HTTP request
headers for public location data, and use some kind of microformat-inspired
serialisation (like e.g. key-value pairs) for the location data? For
instance:
X-Adr-Current: locality=Sydney; region=NSW; postal-code=2000; country-name=Australia
X-Geo-Current: latitude=33.717718; longitude=151.117158
For websites, the usage is then about as trivial as possible: check for
the existence of the HTTP header, do some very simple parsing, and use
the data to personalise the user experience in whatever ways are
appropriate for the site.
On the browser side we'd need some kind of simple fire eagle client that
would pull location updates from fire eagle and then publish them via
these HTTP headers. A firefox plugin would probably be a good proof of
concept.
I think this is simple, interesting and useful, though it obviously
requires websites to make use of it before it's of much value in the real
world.
So is this crazy, or interesting?
Tue 15 Apr 2008
Tags: location, fire eagle, web, web2.0
Brady Forrest asked in a recent
post
what kinds of applications people would most like to see working with Yahoo's
new location-broking service Fire Eagle (currently
in private beta).
It's clear that most of the shiny new web 2.0 sites and apps might be able to
benefit from such personal location info:
photo sites that can do automagic geotagging
calendar apps that adapt to our current timezone
search engines that can take proximity into account when weighting results
social networks that can show us people in town when we're somewhere new
maps and mashups that start where you are, rather than with some static default
etc.
And such sites and apps will no doubt be early adopters of fire eagle and
whatever other location brokers might bubble up in the next little while.
Two things struck me with this list though. First, that's a lot of sites and
apps right there, and unless the friction of authorising new apps to have
access to my location data is very low, the pain of micromanaging access is
going to get old fast. Is there some kind of 'public' client level access in
fire eagle that isn't going to require individual app approval?
Second, I can't help thinking that this still leaves most of the web out in
the cold. Think about all the non-ajax sites that you interact with doing
relatively simple stuff that could still benefit from access to your public
location data:
the shipping address forms you fill out at every e-commerce site you buy from
store locators and hours pages that ask for a postcode to help you (every time!)
timetables that could start with nearby stations or routes or lines if they
knew where you were
intelligent defaults or entry points for sites doing everything from movie
listings to real estate to classifieds
This is the long tail of location: the 80% of the web that won't be using ajax
or comet or OAuth or web service APIs anytime soon. I'd really like my location
data to be useful on this end of the web as well, and it's just not going to
happen if it requires sites to register api keys and use OAuth and make web
service api calls. The bar is just too high for lots of casual web developers,
and an awful lot of the web is still custom php or asp scripts written by
relative newbies (or maybe that's just here in Australia!). If it's not almost
trivially easy, it won't be used.
So I'm interested in how we do location at this end of the web. What do we
need on top of fire eagle or similar services to make our location data
ubiquitous and immediately useful to relatively non-sophisticated websites?
How do we deal with the long tail?
Wed 09 Apr 2008
Tags: web, perl
I've been playing around with SixApart's
TheSchwartz for the last few days.
TheSchwartz is a lightweight reliable job queue, typically used for
handling relatively high latency jobs that you don't want to try and
handle from a web process e.g. for sending out emails, placing orders
into some external system, etc. Basically interacting with anything
which might be down or slow or which you don't really need right away.
Actually, TheSchwartz is a job queue library rather than a job queue
system, so some assembly is required. Like most Danga/SixApart
software, it's lightweight, performant, and well-designed, but also
pretty light on documentation. If you're not comfortable reading the
(perl) source, it might be a challenging environment to setup.
Notes from the last few days:
Don't use the version on CPAN, get the latest code from
subversion
instead. At the moment the CPAN version is 1.04, but current
svn is at 1.07, and has some significant additional
functionality.
Conceptually TheSchwartz is very simple - jobs with opaque
function names and arguments are inserted into a database
for workers with a particular 'ability'; workers periodically
check the database for jobs matching the abilities they have,
and grab and execute them. Jobs that succeed are marked
completed and removed from the queue; jobs that fail are
logged and left on the queue to be retried after some time
period up to a configurable number of retries.
TheSchwartz has two kinds of clients - those that submit
jobs, and workers that perform jobs. Both are considered
clients, which is confusing if you're thinking in terms of
client-server interaction. TheSchwartz considers both
sides to be clients.
There are three main classes to deal with: TheSchwartz
,
which is the main client functionality class;
TheSchwartz::Job
, which models the jobs that are submitted
to the job queue; and TheSchwartz::Worker
, which is a
role-type class modelling a particular ability that a worker
is able to perform.
New worker abilities are defined by subclassing
TheSchwartz::Worker
and defining your new functionality in
a work()
method. work()
receives the job object from the
queue as its only argument and does its stuff, marking the
job as completed or failed after processing. A useful real
example worker is TheSchwartz::Worker::SendEmail
(also by
Brad Fitzpatrick, and available on CPAN) for sending emails from
TheSchwartz.
Depending on your application, it may make sense for workers
to just have a single ability, or for them to have multiple
abilities and service more than one type of job. In the latter
case, TheSchwartz tries to use unused abilities whenever it
can to avoid certain kinds of jobs getting starved.
You can also subclass TheSchwartz
itself to modify the standard
functionality, and I've found that useful where I've wanted more
visibility of what workers are doing that you get out of the box.
You don't appear at this point to be able to subclass
TheSchwartz::Job
however - TheSchwartz always uses this as the
class when autovivifying jobs for workers.
There are a bunch of other features I haven't played with yet,
including job priorities, the ability to coalesce jobs into
groups to be processed together, and the ability to delay jobs
until a certain time.
I've actually been using it to setup a job queue system for a cluster,
which is a slightly different application that it was intended for,
but so far it's been working really well.
I'm still feeling like I'm still getting to grips with the breadth
of things it could be used for though - more experimentation required.
I'd be interested in hearing of examples of what people are using it
for as well.
Recommended.
Wed 05 Mar 2008
Tags: billing, finance, web
Was thinking in the weekend about places where I waste time, areas of
inefficiency in my extremely well-ordered life (cough splutter).
One of the more obvious was bill handling. I receive paper bills during
the month from the likes of Energy Australia, Sydney Water, David Jones,
our local council for rates, etc. These all go into a pending file in the
filing cabinet, in date order, and I then periodically check that file
during the month and pay any bills that are coming due. If I get busy or
forgetful I may miss a due date and pay a bill late. If a bill gets lost
in the post I may not pay it at all. And the process is all dependent on
me polling my billing file at some reasonable frequency.
There are variants to this process too. Some of my friends do all their
bills once a month, and just queue the payments in their bank accounts
for future payment on or near the due date. That's a lower workload
system than mine, but for some (mostly illogical) reason I find myself
not really trusting future-dated bill payments in the same way as
immediate ones.
There's also a free (for users) service available in Australia called
BPay View
which allows you to receive your bills electronically directly into your
internet banking account, and pay them from there. This is nice in that
it removes the paper and data entry pieces of the problem, but it's
still a pull model - I still have to remember to check the BPay View
page periodically - and it's limited to vendors that have signed up for
the program.
As I see it, there are two main areas of friction in this process:
using a pull model i.e. the process all being dependent on me
remembering to check my bill status periodically and pay those that
are coming due. My mental world is quite cluttered enough without
having to remember administrivia like bills.
the automation friction around paper-based or PDF-based bills,
and the consequent data entry requirements, the scope for user
errors, etc.
BPay View mostly solves the second of these, but it's a solution that's
closely coupled with your Internet Banking provider. This has security
benefits, but it also limits you to your Internet Banking platform. For
me, the first of these is a bigger issue, so I'd probably prefer a
solution that was decoupled from my internet banking, and accept a few
more issues with #2.
So here's what I want:
a billing service that receives bills from vendors on my behalf
and enters them into its system. Ideally this is via email (or even
a web service) and an XML bill attachment; in the real world it
probably still involves paper bills and data entry for the short to
medium term.
a flexible notification system that pushes alerts to me when bills
are due based on per-vendor criteria I configure. This should
include at least options like email, IM, SMS, twitter, etc.
Notifications could be fire-once or fire-until-acknowledged, as the
user chooses.
for bonus points, an easy method of transferring bills into my
internet banking. The dumb solution is probably just a per-bill
view from which I can cut and paste fields; smarter solutions
would be great, but are probably dependent on the internet
banking side. Or maybe we do some kind of per-vendor pay online
magic, if it's possible to figure out the security side of not
storing credit card info. Hmmm.
That sounds pretty tractable. Anyone know anything like this?
Thu 27 Dec 2007
Tags: web, rss
As the use of RSS and Atom becomes increasingly widespread (we have people
talking about Syndication-Oriented Architecture now), it seems to me that
one of the use cases that isn't particularly well covered off is transient
or short-term feeds.
In this category are things like short-term blogs (e.g. the feeds on the
advent blogs I was reading this year:
Catalyst 2007 and
24 Ways 2007), or comment feeds, for tracking the
comments on a particular post.
Transient feeds require at least the ability to auto-expire a feed after
some period of time (e.g. 30 days after the last entry) or after a certain
date, and secondarily, the ability to add feeds almost trivially to your
newsreader (I'm currently just using the thunderbird news reader, which
is reasonable, but requires about 5 clicks to add a feed).
Anyone know of newsreaders that offer this functionality?
Thu 08 Nov 2007
Tags: web, advertising
Great quote from Dave Winer on
Why Google launched OpenSocial:
Advertising is on its way to being obsolete. Facebook is just another
step along the path. Advertising will get more and more targeted until
it disappears, because perfectly targeted advertising is just
information.
I don't see Facebook seriously threatening Google, as Dave does, but that
quote is a classic, and long-term (surely!) spot on the money.
I'm much more in agreement with Tim O'Reilly's
critique of OpenSocial.
Somehow OpenSocial seems all backwards from the company whose maps openness
help make mashups a whole new class of application.
It smells a lot like OpenSocial was hastily conceived just to get
something out the door in advance of the Facebook announcements today,
by Googlers who don't quite grok the power of the open juice.
Thu 04 Oct 2007
Tags: web, rant, hardware
Today I've been reminded that while the web revolution continues
apace - witness Web 2.0, ajax, mashups, RESTful web services, etc. -
much of the web hasn't yet made it to Web 1.0, let alone Web 2.0.
Take ecommerce.
One of this afternoon's tasks was this: order some graphics cards
for a batch of workstations. We had a pretty good idea of the kind
of cards we wanted - PCIe Nvidia 8600GT-based cards. The unusual
twist today was this: ideally we wanted ones that would only take
up a single PCIe slot, so we could use them okay even if the
neighbouring slot was filled i.e.
select * from graphics_cards
where chipset_vendor = 'nvidia'
and chipset = '8600GT'
order by width desc;
or something. Note that we don't even really care much about price.
We just need some retailer to expose the data on their cards in a
useful sortable fashion, and they would get our order.
In practice, this is Mission Impossible.
Mostly, merchants will just allow me to drill down to their
graphics cards page and browse the gazillion cards they have
available. If I'm lucky, I'll be able to get a view that only
includes Nvidia PCIe cards. If I'm very lucky, I might even be
able to drill down to only 8000-series cards, or even 8600GTs.
Some merchants also allow ordering on certain columns, which
is actually pretty useful when you're buying on price. But none
seem to expose RAM or clockspeeds in list view, let alone card
dimensions.
And even when I manually drill down to the cards themselves,
very few have much useful information there. I did find two
sites that actually quoted the physical dimensions for some
cards, but the in both cases the numbers they were quoting
seemed bogus.
Okay, so how about we try and figure it out from the
manufacturer's websites?
This turns out to be Mission Impossible II. The manufacturer's
websites are all controlled by their marketing departments and
largely consist of flash demos and brochureware. Even finding
a particular card is an impressive feat, even if you have the
merchant's approximation of its name. And when you do they often
have less information than the retailers'. If there is any
significant data available for a card, it's usually in a pdf
datasheet or a manual, rather than available on a webpage.
Arrrghh!
So here are a few free suggestions for all and sundry, born
out of today's frustration.
For manufacturers:
use part numbers - all products need a unique identifier,
like books have an ISBN. That means I don't have to try and
guess whether your 'SoFast HyperFlapdoodle 8600GT' is the
same things as the random mislabel the merchant put on it.
provide a standard url for getting to a product page given
your part number. I know, that's pretty revolutionary, but
maybe take a few tips from google instead of just listening
to your marketing department e.g.
http://www.supervidio.com.tw/?q=sofast-hf-8600gt-256
keep old product pages around, since people don't just buy
your latest and greatest, and products take a long time to
clear in some parts of the world
include some data on your product pages, rather than
just your brochureware. Put it way down the bottom of the
page so your marketing people don't complain as much. For
bonus points, mark it up with semantic microformat-type
classes to make parsing easier.
alternatively, provide dedicated data product pages, perhaps
in xml, optimised for machine use rather than marketing.
They don't even have to be visible via browse paths, just
available via search urls given product ids.
For merchants:
include manufacturer's part numbers, even if you want to
use your own as the primary key. It's good to let your
customers get additional information from the manufacturer,
of course.
provide links at least to the manufacturer's home page, and
ideally to individual product pages
invest in your web interface, particularly in terms of
filtering results. If you have 5 items that are going to
meet my requirements, I want to be able to filter down to
exactly and only those five, instead of having to hunt for
them among 50. Price is usually an important determiner of
shopping decisions, of course, but if I have two merchants
with similar pricing, one of whom let me find exactly the
target set I was interested in, guess who I'm going to buy
from?
do provide as much data as possible as conveniently as
possible for shopping aggregators, particularly product
information and stock levels. People will build useful
interfaces on top of your data if you let them, and will
send traffic your way for free. Pricing is important, but
it's only one piece of the equation.
simple and useful beats pretty and painful - in particular,
don't use frames, since they break lots of standard web
magic like bookmarking and back buttons; don't do things
like magic javascript links that don't work in standard
browser fashion; and don't open content in new windows for
me - I can do that myself
actively solicit feedback from your customers - very few
people will give you feedback unless you make it very clear
you welcome and appreciate it, and when you get it, take it
seriously
End of rant.
So tell me, are there any clueful manufacturers and merchants
out there? I don't like just hurling brickbats ...
Tue 02 Oct 2007
Tags: web, firefox, greasemonkey, top list
I've been meaning to document the set of firefox extensions I'm currently
using, partly to share with others, partly so they're easy to find and install
when I start using a new machine, and partly to track the way my usage changes
over time. Here's the current list:
Obligatory Extensions
Greasemonkey - the
fantastic firefox user script manager, allowing
client-side javascript scripts to totally transform any web page before it
gets to you. For me, this is firefox's "killer feature" (and see below for
the user scripts I recommend).
Flash Block - disable
flash and shockwave content from running automatically, adding placeholders
to allow running manually if desired (plus per-site whitelists, etc.)
AdBlock Plus - block
ad images via a right-click menu option
Chris Pederick's
Web Developer Toolbar - a
fantastic collection of tools for web developers
Joe Hewitt's Firebug -
the premiere firefox web debugging tool - its html and css inspection
features are especially cool
Daniel Lindkvist's
Add Bookmark Here
extension, adding a menu item to bookmark toolbar dropdowns to add the
current page directly in the right location
Optional Extensions
Michael Kaply's Operator -
a very nice microformats toolbar, for discovering
the shiny new microformats embedded in web pages, and providing operations you
can perform on them
Zotero - a very
interesting extension to help capture and organise research information,
including webpages, notes, citations, and bibliographic information
Colorful Tabs - tabs +
eye candy - mmmmm!
Chris Pederick's
User Agent Switcher -
for braindead websites that only think they need IE
ForecastFox - nice
weather forecast widgets in your firefox status bar (and not just
US-centric)
Greasemonkey User Scripts
So what am I missing here?
Updates:
Since this post, I've added the following to my must-have list:
Tony Murray's Print Hint -
helps you find print stylesheets and/or printer-friendly versions of pages
the Style Sheet Chooser II
extension, which extends firefox's standard alternate stylesheet selection
functionality
Ron Beck's JSView
extension, allowing you to view external javascript and css styles used
by a page
The It's All Text
extension, allowing textareas to be editing using the external editor of
your choice.
The Live HTTP Headers
plugin - invaluable for times when you need to see exactly what is going on
between your browser and the server
Gareth Hunt's Modify Headers
plugin, for setting arbitrary HTTP headers for web development
Sebastian Tschan's Autofill Forms
extension - amazingly useful for autofilling forms quickly and efficiently
Wed 12 Sep 2007
Tags: web, web2.0, lifebits, microformats, data blogging
Following on from my earlier data blogging post, and along the
lines of Jon Udell's
lifebits scenarios,
here's the first in a series of posts exploring some ideas about how data blogging
might be interesting in today's Web 2.0 world.
Easy one first: Reviews.
When I write a review on my blog of a book I've read or a movie I've seen,
it should be trivial to syndicate this as a review to multiple relevant
websites. My book reviews might go to Amazon (who else does good user
book review aggregation out there?), movies reviews to IMDB, Yahoo Movies,
Netflix, etc.
I'm already writing prose, so I should just be able to mark it up as a
microformats microformats:"hReview", add some tags to control syndication,
and have that content available via one or more RSS or Atom feeds.
I should then just be able to go to my Amazon account, give it the url
for the feed I want it to monitor for reviews, and - voila! - instant
user-driven content syndication.
This is a win-win isn't it? Amazon gets to use my review on its website,
but I get to retain a lot more control in the process:
I can author content using my choice of tools instead of filling out a
textarea on the Amazon website
I can easily syndicate content to multiple sites, and/or syndicate
content selectively as well
I can make updates and corrections according to my policies, rather than
Amazon's (Amazon would of course still be able to decide what to do with
such updates)
I should be able to revoke access to my content to specific websites
if they do stupid stuff
I and my readers get the benefit of retaining and aggregating my content
on my blog, and all your standard blogging magic (comments, trackbacks,
tagclouds, etc.) still apply
It would probably also be nice if Amazon included a link back to the
review on my blog which would drive additional traffic my way, and allow
interested Amazon users to follow any further conversations (comments and
trackbacks etc.) that have happened there.
So are there any sites out there already doing this?
Thu 06 Sep 2007
Tags: web, web2.0, lifebits, microformats, data blogging, inverted web
I've been spending some time thinking about
a couple of
intriguing posts
by Jon Udell, in which he discusses a hypothetical "lifebits" service
which would host his currently scattered "digital assets" and syndicate
them out to various services.
Jon's partly interested in the storage and persistence guarantees such a
service could offer, but I find myself most intrigued by the way in which
he inverts the current web model, applying the publish-and-subscribe
pull-model of the blogging world to traditional upload/push environments
like Flickr or MySpace, email, and even health records.
The basic idea is that instead of creating your data in some online app,
or uploading your data to some Web 2.0 service, you instead create it in
your own space - blog it, if you like - and then syndicate it to the
service you want to share it with. You retain control and authority over
your content, you get to syndicate it to multiple services instead of
having it tied to just one, and you still get the nice aggregation and
wikipedia:"folksonomy" effects from the social networks you're part of.
I think it's a fascinating idea.
One way to think of this is as a kind of "data blogging", where we blog
not ideas for consumption by human readers, but structured data of
various kinds for consumption by upstream applications and services.
Data blogs act as drivers of applications and transactions, rather than
of conversations.
The syndication piece is presumably pretty well covered via RSS and Atom.
We really just need to define some standard data formats between the
producers - that's us, remember! - and the consumers - which are the
applications and services - and we've got most of the necessary components
ready to go.
Some of the specialised XML vocabularies out there are presumably useful
on the data formats side. But perhaps the most interesting possibility is
the new swag of microformats currently being
put to use in adding structured data to web pages. If we can blog
people and organisations,
events,
bookmarks,
map points,
tags, and
social networks, we've got halfway
decent coverage of a lot of the Web 2.0 landscape.
Anyone else interested in inverting the web?
Sun 19 Aug 2007
Tags: blosxom, web
I've been trying out a few of my
blosxom wishlist
ideas over the last few days, and have now got an experimental version of
blosxom I'm calling
blosphemy (Gr. to speak against, to speak evil of).
It supports the following features over current blosxom:
loads the main blosxom config from an external config file
(e.g. blosxom.conf) rather than from inline in blosxom.cgi.
This is similar to what is currently done in the debian blosxom
package.
supports loading the list of plugins to use from an external config
file (e.g. plugins.conf) rather than deriving it by walking the
plugin directory (but falls back to current behaviour for backwards
compatibility).
uses standard perl @INC to load blosxom plugins, instead of hardcoding
the blosxom plugin directory. This allows blosxom to support CPAN
blosxom plugins as well as stock $plugin_dir ones.
uses a multi-value $plugin_path instead of a single value $plugin_dir
to search for plugins. The intention with this is to allow, for
instance, standard plugins to reside in /var/www/blosxom/plugins,
but to allow the user to add their own or modify existing ones by
copying them to (say) $HOME/blosxom/plugins.
These changes isolate blosxom configuration from the cgi and plugin
directories (configs can live in e.g. $HOME/blosxom/config for tarball/home
directory installs, or /etc/blosxom for package installs), allowing nice
clean upgrades. I've been upgrading using RPMs while developing, and the
RPM upgrades are now working really smoothly.
If anyone would like to try it out, releases are at:
I've tried to keep the changes fairly minimalist and clean, so that
some or all of them can be migrated upstream easily if desired. They
should also be pretty much fully backward compatible with the current
blosxom.
Comments and feedback welcome.