Avi Das

Home for my code, thoughts and else.

How Browserify Improves Client-side Development

For a more modular, maintainable Frontend

As Single Page Applications gain in popularity, the size of front end codebases keeps growing rapidly. For keeping these codebases maintainable, modularity becomes a priority. The easier it is to modularize code, the more incentives developers will have for doing so. With the ease of modularity with CommonJS, npm has seen explosive growth of packages published which has helped the Node ecosystem greatly. Browserify brings that ease to client side development leveraging the CommonJS module system. When used with build tools such as Grunt or Gulp, you can write modular client side code just like you would write your server side Node code, and Browserify takes care of the bundling for you. There is much less excuse these days to make everything global and attach to the window object!

Leveraging npm modules

Package Manager Traction

Looking at the graph above is a big selling point when trying to evaluate the value Browserify can bring to your client side workflow. The graph is a comparison of the rate at which packages are getting published in different package managers Bower, PyPI, RubyGems. npm leads the pack easily. Recently, jQuery registry stopped accepting new plugins, with new packages being published on npm. Cordova recently announced the same change, moving plugins to npm. npm is now hosting much broader range of modules than only server-side Node.js modules and Browserify can help you leverage these modules on the front-end. The flipside of this as a module publisher is that publishing modules on npm now gives you access to a much broader audience since people might use the module on the browser, custom hardware etc.

How it works

In the CommonJS syntax, the “exports” object is the public API of a module and “require” can be used to include a module in your javascript file. Since browsers do not have require available, Browserify traverses the dependency trees of all the required modules, and bundles the dependencies into one self contained file that you can just include with a script tag on the browser. Browserify is aware of package.json and the order in which node_modules are resolved. Moreover, it supports built in Node modules e.g. path and gloabls e.g. Buffer so you have access of those in the client side as well.

Transforms

Core Browserify only bundles modules written in the CommonJS syntax, adhering to the single responsibility principle. However, there are other ways of modularizing client side code, AMD and Global Variables being the two usual ones. Instead of handling every possibly of modules, Browserify exposes a Transforms API so that a plugin can be built which can preprocess a file into Javascript in CommonJS syntax which Browserify can then consume. This means that you can write modular code just like your node codebases regardless of what module system your dependencies may adhere to. There are also lot of people writing in languages that compile into Javascript, such as CoffeeScript or TypeScript. To handle this, there are transforms available for AMD (deamdify), Bower modules (debowerify), globals (deglobalify), coffeescript(coffeeify), harmony (es6ify) etc. A simple search of Browserify on Github or npm brings up thousands of modules and attests to the ecosystem around Browserify. Delegating to transforms helps to keep the footprint of Browserify smaller, while makes it more extensible.

Verifying X509 Certificate Chain of Trust in Python

Executing network spoofing and man in the middle attacks have become easier than ever. This is more of an issue if a client has an open server for you to send push notifications, since the open port can be detected by methods such as port scanning. As such, it is important to sign data, and ship the signature and metadata about verifying the data against the signature along with the data itself. This provides a way for the client to verify that the data received is unaltered, from the correct sender and indented for the correct recipient. Python’s pyopenssl has a handy method called verify for checking the authenticity of data.

1
OpenSSL.crypto.verify(certificate, signature, data, digest)

The problem then becomes how to provide the certificate while retaining the flexibility necessary to update the certificate without clients needing to modify their certificate stores every time. Providing a url that can be used to download the cert provides that but leaves the door open for the same kind of attacks.

Therefore, clients will need to ensure that the downloaded certificate is trustworthy before using it to verify the authenticity of a message. The openssl module on the terminal has a verify method that can be used to verify the certificate against a chain of trusted certificates, going all the way back to the root CA. The builtin ssl module has create_default_context(), which can build a certificate chain while creating a new SSLContext. However, it does not expose that functionality for adhoc post processing when you are not opening new connections.

pyopenssl provides some very handy abstractions for exactly this purpose:

  • X509Store: The chain of certificates you have chosen to trust going back to root Certificate Authority

  • X509StoreContext - Takes in a X509Store and a new certificate which you can now validate against your store by calling verify_certificate. It raises exceptions if the intermediate or root CA is missing in the chain or the certificate is invalid.

The full example of verifying a downloaded certificate against a trust chain is given below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import requests
from OpenSSL import crypto

def _verify_certificate_chain(cert_url, trusted_certs):

    # Download the certificate from the url and load the certificate
    cert_str = requests.get(cert_url)
    certificate = crypto.load_certificate(crypto.FILETYPE_PEM, str(cert_str.text))

    #Create a certificate store and add your trusted certs
    try:
        store = crypto.X509Store()

        # Assuming the certificates are in PEM format in a trusted_certs list
        for _cert in trusted_certs:
            store.add_cert(_cert)

        # Create a certificate context using the store and the downloaded certificate
        store_ctx = crypto.X509StoreContext(store, certificate)

        # Verify the certificate, returns None if it can validate the certificate
        store_ctx.verify_certificate()

        return True

    except Exception as e:
        print(e)
        return False

Using this can be really useful for client libaries where you cannot rely on the system to provide the certificates, so you can ship your trust chain along with the library. There are also other useful abstractions in the pyopenssl library for some useful checks against the certificate. get_subject() provides information about the certificate such as common name, has_expired() which checks if the certificate is within valid time range and other features such as blacklisting potentially compromised certificates are possible. Thus pyopenssl is really handy when you need ssl abstractions beyond the standard library while not needing to execute the openssl shell calls via a subprocess.

Nodeconf 2015: Unconf With the Right Intentions

Conferences can be a great way to get the creative juices flowing, meet people in the community and share stories and problems. They offer great opportunities to learn from core developers building the frameworks that your software depends on.

Nodeconf managed to achieve all this, in the rather unusual form of an unconference. An unconference meant that the structure and events/presentations and talks at the conference were left to be decided by the community rather than a committee. That does make Nodeconf a conference not for everyone. Understanding the format and structure of Nodeconf is important before you make the hike to Walnut Creek Ranch next year.

I thought to distill down the reasons why you might or might not be interested in attending Nodeconf as well as get the most out of it. You might be interested in Nodeconf if you

  1. Build for the web: For a lot of attendees, Nodeconf would feel like living in the future as a lot of attendees are very involved in making the decisions and tradeoff that would shape the future of the web. Specially discussions around packaging and parceling front end assets in npm (Modular UI) was really interesting as was Isomorphic JS, which covered the challenges involved in writing identical client and server side code. The JavaScript landscape is a fast evolving one and Nodeconf offers fantastic perspective on how the decision making can work.

  2. Publish on npm/github: As someone who maintains projects on npm and github, the discussions around distributing node modules were very insightful. Issues such as broadening adoption, getting contributors for github modules and standards for publishing on npm came up and maintainers of hugely popular modules shared their experiences. Picking a good module scope, having really good examples for beginners to start with and publishing with concise yet searchable package descriptions were all emphasized.

Building Realtime User Monitoring and Targeting Platform With Node, Express and Socket.io

Being able to target users and send targeted notifications can be key to turn visitors into conversions and tighten your funnel. Offerings such as mailchimp and mixpanel offer ways to reach out to users but in most of those cases you only get to do them in post processing. However, there are situations when it would be really powerful is to be able to track users as they are navigating your website and send targeted notifications to them.

Use Cases

Imagine that a buyer is looking for cars to buy and is interested in vehicles of a particular model and brand. It is very likely that he/she will visit several sites to compare prices. If there are a few results the buyer has looked at already, there may be an item which would fit the profile of this user. If you are able to prompt and reach out as the user is browsing through several results, it could make the difference between a sale and user buying from a different site. This is particularly useful for high price, high options scenerios e.g. Real Estate/Car/Electronics purchases. For use cases where the price is low or the options are fewer, e.g. a SAAS offering with a 3 tiers, this level of fine grained tracking may not be necessary. However, if you have a fledgling SAAS startup, you may want to do this in the spirit of doing things that don’t scale.

Prerequisites

This article assumes that you have node and npm installed on your system. It would be also be useful to get familiar with Express.js, the de facto web framework on top of Node.js. Socket.io is a Node.js module that abstracts WebSocket, JSON Polling and other protocols to enable simultaneous bi directional communication between connected parties. This article makes heavy use of Socket.io terminology, so it would be good to be familiar with sending and receiving events, broadcasts, namespaces and rooms.

Install and run

Start by git cloning the repo, install dependencies and run the app.

1
2
3
4
git clone git@github.com:avidas/socketio-monitoring.git
cd socketio-monitoring
npm install
npm start

By default this will start the server at port 8080. navigate to localhost:8080/admin on a browser e.g Chrome. Now, on a different browser, e.g. Firefox, navigate to localhost:8080 and browse around. You will see that the admin page gets updated with the url endpoints as you navigate your way through the website in firefox. You can even send an alert to the user on Firefox by pressing the send offer button on Chrome!

Walkthrough

Let’s get into how this works. When an admin visits localhost:8080/admin, she joins a Socket.io namespace called adminchannel.

1
var adminchannel = io.of('/adminchannel');

When a new user visits a page, we get the express sessionID of the user by calling req.sessionID and pass it to the templating engine for rendering. The session id ensures that we can identify a user across pages and browser tabs.

1
res.render('index', {a:req.sessionID});

The template sets the value of sessionID as a hidden input field on the page, with the id “user_session_id”.

1
2
3
4
5
6
7
<body>
<input type="hidden" id="user_session_id" value="<%= a %>" />
  <div id="device" style="font-size: 45px;">2015 Tesla Cars</div>
    <a href="/about">About</a>
  <br />
  <a href="/">Home</a>
</body>

After the page has loaded, it will emit a pageChange socket.io event. Accompanying the event is the url endpoint for the current page and sessionID.

1
2
3
4
5
6
7
8
  var userSID = document.getElementById('user_session_id').value;
  var socket = io();

  var userData = {
    page: currentURL,
    sid: userSID
  }
  socket.emit('pageChange', userData);

On server side, when pageChange is received, a Socket.io event called alertAdmin is sent to the adminchannel namespace. This ensures that only the admins are alerted that user with particular session id and particular socket id has navigated to a different page. Since anyone with access to /admin endpoint will join the adminchannel namespace, this can easily scale to multiple admins.

1
2
3
4
5
6
  socket.on('pageChange', function(userData){
    userData.socketID = socket.id;
    userData.clientIDs = clientIDs;
    console.log('user with sid ' + userData.sid + ' and session id ' + userData.socketID + ' changed page ' + userData.page);
    adminchannel.emit('alertAdmin', userData);
  });

When altertAdmin is received on the client side, the UI dashboard is updated so that the admins have a realtime dashboard of users navigating the site. This is done via Jquery which appends each new page change to a html list as users navigate through the site.

1
2
3
4
5
6
7
8
  adminsocket.on('alertAdmin', function(userData){
    var panel = document.getElementById('panel');
    var val = " User with session id " + userData.sid + " and with socket id " + userData.socketID + " has navigated to " + userData.page;
    userDataGlob = userData;
    var list = $('<ul/>').appendTo('#panel');
    //Dynamic display of users interacting on your website
    $("#panel ul").append('<li> ' + val + ' <button type="button" class="offerClass" id="' + userData.socketID + '">Send Offer</button></li>');
  });

Now, the admin may choose to send certain notifications to the particular user. When the admin clicks on the “Send Offer” button, a socket.io event called adminMessage is sent to the general namespace on the server with the user specific data.

1
2
3
4
  //Allow admin to send adminMessage
  $('body').on('click', '.offerClass', function () {
    socket.emit('adminMessage', userDataGlob);
  });

When adminMessage is received on the server side, we broacast to the specific user the message. Since every user always joins into a room identified by their socketID, we can send a notification only to that user by using socket.broadcast.to(userData.socketID) and we send an event called adminBroadcast with the data.

Here, you could have chosen to broadcast a message to all the users, or to a particular room, which subsets of users could have joined. Thus, you can fine tune how you want to reach out to users as well.

1
2
3
  socket.on('adminMessage', function(userData) {
    socket.broadcast.to(userData.socketID).emit('adminBroadcast', userData);
  });

Finally on the client side of the user when adminBroadcast is received, the user is alterted with a notification. However, you can easily use it for more complex use cases such as dynamically updating the page results, update ads section to show offers and so on by setting up event listeners.

1
2
3
  socket.on('adminBroadcast', function(userData){
    alert('Howdy there ' + userData.sid + ' ' + userData.socketID + ' ' + userData.page);
  })

There you have an end to end way in which a set of admins can track a set of users on a website and send notifications. This system can be particularly valuable when the user’s primary reason for visit accompanies purchasing intent. E-commerce and SAAS platforms have recognized the importance to user segmentation and targeted outreach. This system enables you to minimize the latency of such outreach. On the plus side, you can get to rely on fully open source tools with broad user bases and support.

This particular example used url endpoints as part of the data payload, but you can really strech it to any user events. For example, you can easily track where the user’s cursor is and send that information back in real time. One can imagine High Frequency Trading firms using this technique in bots to track real time user behavior, e.g. user’s cursor hovering on a buy button for a ticker, as information gathered for its trades. How much you want to track and react to can be an exercise in determining the bounderies of being responsive and creepiness for users.

Props to my friend Shah for working with me on this. If you are doing some level of realtime tracking on your site, I would love to hear about it. Please feel free to send over any other feedback as well.

Bug Hunting With Git Bisect

With large projects with Git, feature development tends to happen often in separate branches before they are ready for merge. However, once the merge happens and tests break, it’s often challenging to figure out the commit at which the bug got introduced. Git bisect is an excellent tool to triage that commit. It does so in a binary search like fashion, marking good and bad commits and reducing problem space of commits by half every time.

However, this process can be quite manual so git bisect has a run command. This allows you to set a testing scipt and based on the output of the testing script, it automatically finds the middle commits and continues searching till it finds the breaking commit.

Another neat feature is its ability to log out the output, record and rerun the bisect for further debugging. The git-scm book has some excellent documentation for the complete api and technical details.

There are still a few manual steps, as you would want to stash for saving and recovering state of uncommitted work, get to HEAD and view the log available for record and replay.

For reusability, I wrote the following script to make git bisecting and setup into a handy bash function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Stash current work and and git bisect with given good and 
# bad commit ids, running given script that exits with 0 on failure
# and positive number on success
gbisect() {
    if [ "$#" -ne 1 ]; then
        echo "gbisect good-commit-id bad-commit-id script <arguments>"
    else
        git stash # stash current state
        git checkout HEAD
        git bisect start # initialize git bisect
        git bisect good $1
        shift
        git bisect bad $1
        shift
        git bisect run "$@" # # git bisect 

        git bisect log
        git bisect reset

        git stash list
        git stash apply
    fi
}

If you are using mocha as a test runner, you could use the script as following

1
gbisect 23df33 56dg23 mocha -t 15000

Git is like an iceberg, in a good way. Generally instead of perusing heavy books on something, I like learning as I run into challenges. Once something clicks though, it is great as it has a N times effect into your workflow if you are using git for work and personal projects.

Scipy 2014: Python as Expression to Push Boundaries in Science

It’s not everyday that the person sitting next to you interacts with Mars Rovers everyday or is trying to build a data pipeline to handle petabyte-scale genomics data. But that was perhaps the key takeway from my first Python conference: a large number of people pushing the boundaries in scientific disciplines and using Python as their means of expression.

I have been using Python for a while now, both at work and for hobby projects but until of late have mostly been in the peripheries in contributions to open source projects. When I learned about Scientific Python conference right near to me in Austin, I was immediately interested. If you buy that there is such a thing as language wars, scientific computing has been one of Python’s key wins. With libraries such as NumPy, Matplotlib and Pandas (and of course IPython), Python have dominated the Scientific Python landscape alongside R and Julia.

When such a strong ecosystem is matched by a very welcoming community, there is a recipe for a conference worth being at. Well, If you can get past the imposter syndrome of being at a place with the highest density of phds of any place I have ever been at.

Takeaways

  1. Python catching up in areas where it lacked: Performance, distribution, scalibility and reproducability were some of main themes at the conference. This addresses some of the historic lackings of the language. Sometimes this is via adoption of new tools such as docker for containerizing work enviroments for remote co-working researchers. Dependency on other languages has been one of the major pain points in working with the scientific Python libraries, so it is great to see Conda and HashDist (which I just discovered) to take that head on. Interoperability and scalability are two of the main problems Blaze is solving, and Bokeh and Plotly takes on the problems of publishing and sharing interactive visualizations in Python.

  2. New tools for my workflow: There are many tools which deserve a space here, but I was primarily exited to discover pyspark, yt, plotly, sumatra/vistrails, hashdist and airspeed velocity. Version control and workflow control are familiar terratories for software engineers, but the idea of event control was new to me, something explored in a Birds of a feather discussion.

  3. Birds of a Feather talks are revealing: Birds of a feather discussions were sometimes my favorite, where there was candid sharing of painpoints and their solutions from the community members. It was also good to know what were the open problems in various areas are as they often indicate valuable areas to focus on.

Best Chrome Productivity Extensions

How much time during the day do you spend on an internet browser? As a software engineer, my answer is a scary amount of time. Something I have personally struggled with is avoiding distracting sites on the internet. There is too much good content on the internet that dries up my rather finite amount of attention.

Chrome has been my weapon of choice for browsing the web for a while now. Intuitive, clean and fast, Chrome’s ascent has been amazing to see over the past few years. Extensions on chrome allow addition of new features on Chrome made by third party developers. Lately, and there are a few extensions I simply cannot do without. When you need to buckle up and get something done, these extensions will help you get there.

  1. Pocket: There are enough interesting things to read on the internet for many lifetimes. Just like youtube’s watch it later does for videos, Pocket lets me save articles for future reading. With clients available for iOS and Android, articles saved can be accessed anywhere. At any time on social websites, be it facebook, quora or hacker news, there is a lot I would like to read. Pocket gives me the convenience to customize articles from the internet into my customized feed or magazine via Flipboard integration.

  2. Session Buddy : If you are like me, you context switch a lot between projects during a day. Chrome deals with tabs beautifully, and yet I find myself often with 20 or more tabs, usually researching a particular topic. Usually tabs follow a train of thought but fit within a particular context. Session buddy solves this exact problem, allowing me to save a set of tabs and windows which can be resumed later. For example, at the end of the day at work, I save a bunch of tabs, saved as the name of the project I was working on. This allows me to pick up from the chain of thought from when I left off. Switching between contexts is expensive, and session buddy is the best solution I have found.

  3. RescueTime: Holding myself accountable for how I spend time on the web has been eye-opening for me. RescueTime not only tracks how much time during a day you spend on distracting sites, but the desktop application also tracks your non browser time. Over time, it subtly moves you to try and improve your productivity score by avoiding distraction, thus getting more productive (Bravo on the gamification aspect, ResueTime team!) I have discovered patterns of my work through the tool as well, such as the benefits of taking regular breaks and how much standing desk really helps productivity. It however, cannot track that even on productive sites, you can find ways to procrastinate. (I am looking at you, StackOverflow).

  4. Block site: As much as tracking my time spent on the web helps, sometimes I need to simply block distracting sites. Block site makes for a great complement to RescueTime, allowing me to either visit whitelisted sites only or block off blacklisted sites. I have found my productivity levels on RescueTime spike when using block sites. Often its a reflex action for me to go to facebook or twitter without much thinking. Block sites counts such attempts, and shames you with clever 404 comments. I love that with the latest update, with one switch, you can turn off the distracting websites and focus on the task at hand.

  5. LastPass: Despite the rise of alternative logins, passwords are still integral parts of the web. With a million sites, having passwords lying around can be rather dangerous. Remembering passwords is time consuming and difficult, adding frustration to the day. LastPass securely stores your passwords for various sites, syncing them across devices. This saves the need to browse through memory trying to dig up a password or worse, store them in flat files. It can also do auto login’s for your chosen sites. One of my favorite features is that it warns if you are using duplicate passwords for sites, making your web experience more secure.

  6. Adblock Plus: Ok, this one is controversial. Adblock will stop advertisements on websites, keeping you focused and saving time particularly for video ads. Now, ads power the web, and any website with ads as primary revenue stream should be aware of adblock. I particularly enjoy the cat and mouse chase between youtube and adblock: youtube refusing to play videos as long as adblock is enabled, adblock gets smarter, youtube catches up. However, it can make certain sites load faster, thus saving time.

Connected Car: Meticulous.io @Angelhack

Angelhack looks for startups to come out of hackathons rather than just weekend hacks. As a result, it tends to get very product driven as teams compete for the Angelpad recognition. The theme this year was Apphack, and mobile apps were the key focus. Not really a surprise, as more and more startups build mobile first.

We worked on the verizon telematics data. Verizon telematics is something I did not know about before. Car companies have been working with Verizon to install a device in the cars that could send data about speed, location, rpm, condition of the car in terms of last time since gas, particular model of car, wear and tear to the cloud. I expect Verizon is able to collect a lot more than the columns we saw. Needless to say, this would be a very powerful dataset and it is surprising that there aren’t more going on in this space.

iPython and Pandas are built for analyzing and visualizing tabular data like this. Questions such as which brand of car gave the best milage per gallon, which brand was driven more on the highway vs which was more popular on the city roads were fairly easy to answer. It was not clear if the data given to us was indeed an uniform sample, but results were interesting. However, they are hardly hackathon worthy ideas.

Needed an idea. One set of data had trip ids tied to driver ids, the other set had per trip latitudes and longitudes. Aggregating them somehow would be interesting. Geographical history of most visited places can be found since trips could be tied to drivers. Historical driving records could be obtained from individual trips and the overall history of the driver measured. Global driving records could be obtained so that the best driver’s can be found.

Once noting those features, the next step was to think how this would make sense as a product. Context sensitive software is hugely in demand and this is a great dataset for revealing patterns of drivers. Buisnesses aware of customers frequenting particular location can push deals and coupons to those individuals to drive buisness. Timeliness of visit can be used to fine tune the deals even more, e.g. lunch deals would make sense around the Grand Central area in NYC to a busy professional. The brand of car can be used to push vehicle specific ads. This would also be a treasure trove of information for car insurance companies figuring out insurance rates. Finally, it is quite possible that driving patterns can be used to trigger alert that may detect an accident ahead of time.

Its hard to argue that driving patterns are valuable data. However, as with any user specific information used for personalization experiences curated by businesses, privacy would be a big concern. It was not entirely clear who would own the data, the driver, the provider, or else. It should be made very clear to the driver of the nature of the tracking going on and how the data may be used. Since data like this could easily be used by a malicious third party to monitor you in real time when you are driving, there is inherent danger in it being available without the driver being informed. Tracking is an issue users of any clowd aware system needs to be aware of.

We did win the Verizon prize for the hackathon, and it was nice having our idea at the eleventh hour being well accepted. Angelhack put on a good show, and I must mention the apps “Make it Rain” (literally, on the phone screen) and “Kanye”(alerts if you are nearby Kanye West) for the hilarity during the presentation!