Implementing HTTPOnly in PHP

Coding Horror has an article today about a little-known extension to the http cookie protocol: HTTPOnly.

Essentially, HTTPOnly makes any browser cookies from the site unreadable to javascript (in supported browsers anyway: IE7, Opera 9.5, FF3), thus raising the bar for XSS attacks considerably.

So how do we turn it on in PHP? <!-- more -->

If you're using a version of PHP pre5.2:

header("Set-Cookie: hidden=value; httpOnly");

If you're using a new version of PHP (5.2+):

//Either of these options set the $_SESSION cookie into HTTPOnly mode
ini_set("session.cookie_httponly", 1);
// or
session_set_cookie_params(0, NULL, NULL, NULL, TRUE);

//Individual cookies can be set using:
setcookie("abc", "test", NULL, NULL, NULL, NULL, TRUE); 
//or
setrawcookie("abc", "test", NULL, NULL, NULL, NULL, TRUE); 

And that's it! One simple line of code (or function argument) that you can add to your header file that helps makes attacks on your site tougher to execute.

Code snippets courtesy of Ilia Alshanetsky

Visualization Strategies: Text & Documents

Whether it's a campaign speech by a presidential contender, or a 300 page bestselling novel, large bodies of text are among the most requested topics for condensing into an infographic.

The purpose can vary from highlighting specific relations to contrasting points or use of language, but all of the following methods focus on distilling a volume of text down to a visualization.

Volumetric Comparisons:

Tag Clouds & Wordles:

A Tag Cloud from Many Eyes[/caption]

Among the most common visualizations is the so called 'Tag Cloud', which is just a list of words or descriptions that are sized by some relevance measure, usually number of repetitions in the data set.

Tag clouds are often used to highlight and compare themes within the document, such as stripping down a U.S. State of the Union Speech to its major keywords such as Iraq, Budget, etc.

A Sample Wordle from Wordle.net[/caption]

Closely related to tag clouds are wordles, which are more artistically arranged (and often vibrantly colored) versions of a text. They tend to be less directly insightful as an infographic, but often give a more personal feel to a document.

A number of free tools online exist to create Tag Clouds and Wordles, I've personally found the tag cloud tools at Many Eyes to be excellent, as well as the versions at Swivel. Wordles originated at Wordle.net

Word Spectrum Diagrams

[caption id="attachment_181" align="aligncenter" width="399" caption="World Spectrum Diagram from the Google Data Set"]World Spectrum Diagram from the Google Data Set[/caption]

Chris Harrison introduced these Word Spectrum Diagrams with a project meant to show related word bi-grams, or common word pairs.

The technique could easily be adapted to show relations between other sets of words, such as common idiomatic phrases used by speakers, or the word preferences of an author.

Structure and Document Flow

Document Contrast Diagrams

[caption id="attachment183" align="alignright" width="210" caption="A DCD from Jeff Clark"][A DCD from Jeff Clark](/public/images/2008/08/dcd1_s.png)[/caption]

Document contrast diagrams use the familiar bubble technique and effective use of color to contrast topic usage in two bodies of text.

Unlike many of the other infographic techniques featured here, DCD's help highlight key differences of the text as well as the similarities.

They're discussed more in depth at Neoformix.

Literary Organism Maps

[caption id="attachment_186" align="alignright" width="118" caption="Literary Organism Map for Kerouac's "On the Road""]Literary Organism Map for Kerouac's "On the Road"[/caption]

This next visualization technique comes from Stefanie Posavec, and while a bit less obvious than the others, offers some intriguing possibilities.

It purports to offer a scaled 'map' of where each chapter goes within a textual context. While not as intuitive as tag clouds or word maps, variations of this style could be used to track 'threaded' text like conversation transcripts.

Her project page has even more amazing work in the same vein, although most of it is more artistic than analytical.

Word Trees

[caption id="attachment_189" align="alignright" width="150" caption="Word Tree for "I Have a Dream""]Word Tree for "I Have a Dream"[/caption]

Word Trees are another document flow visualization from the folks at IBM Many Eyes.

They help provide context to unstructured text, showing the relations between major words and phrases and their follow-ups at a glance.

Font scaling helps to show importance and relative frequency among the parts, and easy searching lets a user follow a path from one concept to another throughout the text body.

By treating follow up phrases as links (almost like a web hyperlink), a user can easily navigate a speech by subject, which makes following a large body of text by theme a breeze.

An obvious application of this technique would be highlighting correspondence between two parties, whether by letters or telephone/email transcripts.

Document Arc Diagrams

[caption id="attachment195" align="alignright" width="150" caption="A Snippet from a Document Arc Diagram"][A Snippet from a Document Arc Diagram](/public/images/2008/08/documentarc_diagrams.jpg)[/caption]

Another winner from the NeoFormix team is the Document Arc Diagram.

Hover over a text fragment highlights all the related text fragments, or 'document arcs'. This allows for the relations to be visible at a glance when one sentence may be linked to many contexts or related words.

The site includes a handy generator so that anyone can build a custom interactive diagram.

Large Corpus Techniques

Transcript Analysis

[caption id="attachment_197" align="alignright" width="150" caption="Transcript Analysis focusing on the word "Iraq""]Transcript Analysis focusing on the word "Iraq"[/caption]

The New York Times offered this visualization after one of the 2008 Democratic primary debates.

Annotated speech blocks showed the frequency of custom words, and highlighted occurrences and context.

Although this required a highly annotated text, it allowed an in depth searching an comparison of the text without compare.

The biggest challenge to recreating this for other purposes would be the large amount of document tagging required (presumably by hand), although certain transcript type data could probably be automatically tagged and fed into a similar system.

Directed Sentence Diagrams

[caption id="attachment199" align="alignright" width="150" caption="A Sample Directed Sentence Diagram from Neoformix.com"][A Sample Directed Sentence Diagram from Neoformix.com](/public/images/2008/08/neosentdrawsotu2000.png)[/caption]

The final technique is a bit unconventional, but Directed Sentence Diagrams (again from Neoformix) are designed to show the topic 'flow' in a body of work via color and cartesian length.

While sentence drawings aren't new, the idea of 'directing' them as well as color coding them to show sentence length and topic imparts much more information into the space.

The sparse filling and line lengths gives a great overall picture of the percent of the document (or speaking time) given over to a given topic, and a careful analysis allows the reader to follow the outline of an argument's points or the meandering of a story's subjects.

Perhaps even moreso than the other diagrams this one requires a well annotated text, since every sentence must be have meta data on its subject, but for certain sample sets (particularly persuasive speeches or editorial style arguments) the technique is particularly effective.

Formatting the Data

One of the things in common with all text visualization techniques is effective use of sizing and color. Gradients and gradual font sizes can show relative importance, while opposing colors and sharp contrast can highlight points of contention.

Other primary challenges for the designer include trimming the text down to its vital elements, and pre-processing the data to remove stop words or do some sort of stemming to having a ensure a clean final product. At bare minimum, removal of particles such as 'He' and 'The' is generally necessary for any word prevalence visualization.

Automated text processing is often critical with larger bodies, whether it be simple techniques like using PHP's str_replace function on particles and stop words, or more advanced methods like stemming with Python's Natural Language Toolkit.

Vizualizing Similarity with Circos

The magazine American Scientist has a cover image this month featuring a circular vizualization of similarity between the human and canine genomes.

A Sample Image generated from Circos[/caption]

It's done using the Circos project, a perl vizualization framework for linked data (it's written with an eye towards genomics, but appears to work for any tabular data).

They even have an online generator that takes tabular data and offers a range of settings. If you're confused with regards to how to read the charts, the about page explains it.

The output seems a little busy, but that's probably mainly a consequence of the input data. I'm planning to give it a try tonight and report back with my findings.

Preventing Wordpress Post Updates from Changing RSS

A minor wordpress annoyance that i've run across lately is that every time you update a post, the date on the post changes, so it moves to the top of your RSS feed. Thankfully, Ciaran Gultnieks has a solution.

Implied Copyright Infringement and File Sharing Protocols

Wired.com's Threat Level blog is reporting that the judge in the only file sharing case to actually go to court is holding a hearing today on the issue of "Implied" copyright infringment, or the act of making copyrighted files available, where the copyright holder has no actual evidence that they were actually downloaded.

According to the article "The RIAA and the Motion Picture Association of America have told the judge that it's impossible to know whether others had downloaded copyrighted music from Thomas' Kazaa share folder."

Supposing the judge rules that the RIAA needs to have actual proof that the files were downloaded, what would that mean for current file sharing protocols?

I'm NOT a low level network programmer, nor am I a lawyer, so these are just informed conjecture. Anyone with more intimate knowledge of the protocols or legal statues involved is invited to leave a comment correcting any mistakes/misinterpretations.

Also, if it isn't obvious, this is solely a discussion of the laws as they apply in the U.S.

Protocols:

Sharing via the Web (Personal Pages, Rapidshare, MegaUpload, etc):

This would be the most cut & dry of cases if the world of the web were standardized, and every ISP/Server kept traffic logs. The RIAA could just find a site that is sharing copyrighted files (and that is hosted in the United States), and subpoena the ISP for download logs.

Unfortunately for the recording industry, standards on log file retention vary widely, and are easily configurable by the site owner themselves in most cases. The chances of any site still having log files proving infringement by the time the legal system gets around to them are slim (even if it's just a notice from the RIAA telling the site owner to retain all logs for possible coming legal action).

Currently under provisions of the DMCA the industry just sends a 'takedown notice' requesting the material be removed, and things will likely continue to operate this way in the future, regardless of how the judge rules in the current case.

Peer-to-Peer: (Kazaa/Limewire/Gnutella):

All of these protocols involve one user sharing a file, central servers indexing that file, and users contacting the sharer and negotiating the download. If there are logs of each individual download, they would reside on the sharer's computer (and possibly the sharee's) and be trivial to erase.

Critically, there is no way for a third party to detect how many times a file has been shared, or prove that it has ever been shared short of downloading a copy themselves (Which is worthless, because the judge in the case specifically noted that "You can't infringe your own copyright").

In theory if there is a local log file of all transfers, the RIAA would need to be able to subpoena the ISP to get the IP address of the sharer (as they already do), and then get access to the alleged sharer's hard drive.  This adds another (major) step to the process, seriously complicating things. In fact, it would likely make the current style of scatter shot lawsuits prohibitive to continue since any subpoena based seizure that didn't pan out leave the RIAA potentially liable for the time and effort expended by the defendant.

Bittorrent:

Under the current implementation, every bittorrent peer in the 'swarm' can see what pieces of the material being shared each other peer has.

It would, however, be difficult to determine beyond doubt that a user downloaded a complete copy of a file, since anyone else with the exact same file could in theory join the swarm and begin seeding without downloading a copy from the original source.

Presumably the only 'proof' of a complete share would come from watching a peer join the swarm, start with zero chunks, then gain each chunk from one seeder and eventually begin to seed the entire file.  This would be not only possible to monitor for, but reasonably trivial to write a third-party monitoring program to log.

A large part of the bittorrent protocol revolves around grabbing multiple pieces from different sources (seeders), and the legal question of sharing only part of a copyrighted file is still unresolved.  This means that either the industry would have to attempt to argue for a new legal precedent, or only work with the (substantially smaller) number of cases that are distributed solely from one peer.

Protocol encryption has no bearing on this discussion, since, although it masks the traffic, by nature the swarm must be aware of the other members' IP addresses. Encryption merely masks the data in transit.

The requirement of publicising IP addresses withing the swarm is widely known as one of the primary weaknesses in the bittorrent protocol, and a few groups are working on replacements, but none are widely distributed (yet).

Can't the RIAA Just download the files?

The judge in this case has ruled that any downloads that the RIAA makes can *not *be considered unauthorized downloads because the RIAA authorized them by the very act of downloading them.  This means that they have to 'catch' a third party download in progress.

Those of a conspiracy bent might suggest that over a more open protocol like bittorrent the RIAA would just have a third party such as mediadefender download the files while the RIAA monitors the transactions. This would be against the law, (since the RIAA authorized it and failed to reveal that in court) not to mention a gross violation of legal ethics, but both groups have been known to engage in shady tactics in the past.

So how does this effect sharers?

If the judge does rule against implied infringement, this will raise the bar substantially for future lawsuits, and (more critically) endanger any current cases that may go to trial (the perhaps make a number of the accused re-think settling).

On a user level, it's doubtful we'll see a return to Kazaa style sharing considering the massive benefits of bittorrent for (at the moment only slightly increased) risk.

Much of the legal precendent in this area is yet to be set, since the congressional legislation is in turns both sweepingly broad and not particularly technical, by the time the law catches up with the current technology, the market (both legal and illegal) will have moved on to the next big thing.