Visualization Strategies: Text & Documents

Whether it's a campaign speech by a presidential contender, or a 300 page bestselling novel, large bodies of text are among the most requested topics for condensing into an infographic.

The purpose can vary from highlighting specific relations to contrasting points or use of language, but all of the following methods focus on distilling a volume of text down to a visualization.

Volumetric Comparisons:

Tag Clouds & Wordles:

A Tag Cloud from Many Eyes[/caption]

Among the most common visualizations is the so called 'Tag Cloud', which is just a list of words or descriptions that are sized by some relevance measure, usually number of repetitions in the data set.

Tag clouds are often used to highlight and compare themes within the document, such as stripping down a U.S. State of the Union Speech to its major keywords such as Iraq, Budget, etc.

A Sample Wordle from Wordle.net[/caption]

Closely related to tag clouds are wordles, which are more artistically arranged (and often vibrantly colored) versions of a text. They tend to be less directly insightful as an infographic, but often give a more personal feel to a document.

A number of free tools online exist to create Tag Clouds and Wordles, I've personally found the tag cloud tools at Many Eyes to be excellent, as well as the versions at Swivel. Wordles originated at Wordle.net

Word Spectrum Diagrams

[caption id="attachment_181" align="aligncenter" width="399" caption="World Spectrum Diagram from the Google Data Set"]World Spectrum Diagram from the Google Data Set[/caption]

Chris Harrison introduced these Word Spectrum Diagrams with a project meant to show related word bi-grams, or common word pairs.

The technique could easily be adapted to show relations between other sets of words, such as common idiomatic phrases used by speakers, or the word preferences of an author.

Structure and Document Flow

Document Contrast Diagrams

[caption id="attachment183" align="alignright" width="210" caption="A DCD from Jeff Clark"][A DCD from Jeff Clark](/public/images/2008/08/dcd1_s.png)[/caption]

Document contrast diagrams use the familiar bubble technique and effective use of color to contrast topic usage in two bodies of text.

Unlike many of the other infographic techniques featured here, DCD's help highlight key differences of the text as well as the similarities.

They're discussed more in depth at Neoformix.

Literary Organism Maps

[caption id="attachment_186" align="alignright" width="118" caption="Literary Organism Map for Kerouac's "On the Road""]Literary Organism Map for Kerouac's "On the Road"[/caption]

This next visualization technique comes from Stefanie Posavec, and while a bit less obvious than the others, offers some intriguing possibilities.

It purports to offer a scaled 'map' of where each chapter goes within a textual context. While not as intuitive as tag clouds or word maps, variations of this style could be used to track 'threaded' text like conversation transcripts.

Her project page has even more amazing work in the same vein, although most of it is more artistic than analytical.

Word Trees

[caption id="attachment_189" align="alignright" width="150" caption="Word Tree for "I Have a Dream""]Word Tree for "I Have a Dream"[/caption]

Word Trees are another document flow visualization from the folks at IBM Many Eyes.

They help provide context to unstructured text, showing the relations between major words and phrases and their follow-ups at a glance.

Font scaling helps to show importance and relative frequency among the parts, and easy searching lets a user follow a path from one concept to another throughout the text body.

By treating follow up phrases as links (almost like a web hyperlink), a user can easily navigate a speech by subject, which makes following a large body of text by theme a breeze.

An obvious application of this technique would be highlighting correspondence between two parties, whether by letters or telephone/email transcripts.

Document Arc Diagrams

[caption id="attachment195" align="alignright" width="150" caption="A Snippet from a Document Arc Diagram"][A Snippet from a Document Arc Diagram](/public/images/2008/08/documentarc_diagrams.jpg)[/caption]

Another winner from the NeoFormix team is the Document Arc Diagram.

Hover over a text fragment highlights all the related text fragments, or 'document arcs'. This allows for the relations to be visible at a glance when one sentence may be linked to many contexts or related words.

The site includes a handy generator so that anyone can build a custom interactive diagram.

Large Corpus Techniques

Transcript Analysis

[caption id="attachment_197" align="alignright" width="150" caption="Transcript Analysis focusing on the word "Iraq""]Transcript Analysis focusing on the word "Iraq"[/caption]

The New York Times offered this visualization after one of the 2008 Democratic primary debates.

Annotated speech blocks showed the frequency of custom words, and highlighted occurrences and context.

Although this required a highly annotated text, it allowed an in depth searching an comparison of the text without compare.

The biggest challenge to recreating this for other purposes would be the large amount of document tagging required (presumably by hand), although certain transcript type data could probably be automatically tagged and fed into a similar system.

Directed Sentence Diagrams

[caption id="attachment199" align="alignright" width="150" caption="A Sample Directed Sentence Diagram from Neoformix.com"][A Sample Directed Sentence Diagram from Neoformix.com](/public/images/2008/08/neosentdrawsotu2000.png)[/caption]

The final technique is a bit unconventional, but Directed Sentence Diagrams (again from Neoformix) are designed to show the topic 'flow' in a body of work via color and cartesian length.

While sentence drawings aren't new, the idea of 'directing' them as well as color coding them to show sentence length and topic imparts much more information into the space.

The sparse filling and line lengths gives a great overall picture of the percent of the document (or speaking time) given over to a given topic, and a careful analysis allows the reader to follow the outline of an argument's points or the meandering of a story's subjects.

Perhaps even moreso than the other diagrams this one requires a well annotated text, since every sentence must be have meta data on its subject, but for certain sample sets (particularly persuasive speeches or editorial style arguments) the technique is particularly effective.

Formatting the Data

One of the things in common with all text visualization techniques is effective use of sizing and color. Gradients and gradual font sizes can show relative importance, while opposing colors and sharp contrast can highlight points of contention.

Other primary challenges for the designer include trimming the text down to its vital elements, and pre-processing the data to remove stop words or do some sort of stemming to having a ensure a clean final product. At bare minimum, removal of particles such as 'He' and 'The' is generally necessary for any word prevalence visualization.

Automated text processing is often critical with larger bodies, whether it be simple techniques like using PHP's str_replace function on particles and stop words, or more advanced methods like stemming with Python's Natural Language Toolkit.