EzDevInfo.com

word_cloud

A little word cloud generator in Python

geom_wordcloud : is this a pipe dream

I deal a bit with textual data across various grouping variables. I'm thinking of creating a method to make faceted wordcloud plots using Ian Fellows' wordcloud package. I like the way ggplot2 facets social variables. I'm deciding how to approach this problem (faceted wordcloud plot).

Is it possible to use Fellows' work as a geom (I've never made a geom but may learn if this is doable) or will ggplot not play nicely because one is grid and one is base (and wordcloud also uses some C coding) or some other problem? How difficult is this (I know this is dependent on my abilities but would like some ball park answer)? Please advise if using base graphics may be the more sensible approach to this problem. I foresee this may be approached using panes from the plotrix package to give it the aesthetic feel that ggplot's faceting gives.

Maybe this is a foolish concept considering the size of word clouds and the way faceting quickly limits the available space.


Source: (StackOverflow)

change specific word color in wordcloud

I would like to build a word cloud with R (I have done so with the package wordcloud) and then color specific words a certain color. Currently the behavior of the function is to color words according to frequency (which can be useful) but word size already does this so I'd want to use color for additional meaning.

Any idea on how to color specific words in wordcloud? (If there's another wordcloud function in R I'm unaware of I'm more than willing to go that route.)

A mock example and my attempt (I tried to treat the color argument in the same manor I would a regular plot from the plot function):

library(wordcloud)

x <- paste(rep("how do keep the two words as one chunk in the word cloud", 3), 
           collapse = " ")
X <- data.frame(table(strsplit(x, " ")))
COL <- ifelse(X$Var1 %in% c("word", "cloud", "words"), "red", "black")
wordcloud(X$Var1, X$Freq, color=COL)

EDIT: I wanted to add that the new version of wordcloud (Jan 10, 2010; version 2.0)[Thank you Ian Fellows & David Robinson] now was this feature along with some other terrific additions. Here is the code to accomplish the original goal within wordcloud:

wordcloud(X$Var1, X$Freq, color=COL, ordered.colors=TRUE, random.color=FALSE)

Source: (StackOverflow)

Advertisements

Word cloud generator for Rails

Is there a Ruby/Rails library that I can generate word clouds (output should be an image file) like in Wordle.net?


Source: (StackOverflow)

php word cloud translation

I have created a word cloud and I'd like to add a translation function in my webpage. My word cloud shows like a lot of words with different color and font size. My system is like analysis a text and generate word cloud words and then echo them out.

Screenshot I use the Google translate and the result is like this: enter image description here You can see that, what I really want to be translated remain change nothing. How can I solve this problem.

Thank you


Source: (StackOverflow)

D3: Using force layout for word clouds

I'm working on a tag visualization where tags transition between different force-directed layouts.

I had few issues figuring out how to transition from a bubble chart to a node chart, but I'm a bit stuck as to how to get the charts to transition into a word cloud. My difficulties largely stem from my inexperience at writing custom clustering/collision detection functions.

I declare the forces as globals and then stop and start them when the user clicks a button:

var force1 = d3.layout.force()
    .size([width, height])
    .charge(0)
    .gravity(0.02)
    .on("tick", ticka);

//layout for node chart
var force2 = d3.layout.force()
    .size([width, height])
    .charge(-50)
    .gravity(0.005)
    .linkDistance(120)
    .on("tick", tickb);

//layout for bubble chart
var force3 = d3.layout.force()
    .size([width, height])
    .charge(0)
    .gravity(0.02)
    .on("tick", tickc);

Relevant node/link functions are added to the force when the a function that draws the nodes is called (as data changes according to a slider value).

The code for creating node data is as follows:

nodes = splicedCounts.map(function(d, e) {
    var choice;
    var i = 0,
    r = d[1],
    d = { count: d[1],
          sentiment: d[2]/d[1],
          cluster: i,
          radius: radScale(r),
          name: d[0],
          index: e,
          x: Math.cos(i / m * 2 * Math.PI) * 200 + width / 2 + Math.random(),
          y: Math.sin(i / m * 2 * Math.PI) * 200 + height / 2 + Math.random()
    };
    if (!clusters[i] || (r > clusters[i].radius))
        clusters[i] = d;
    return d;
});

In order to keep this question relatively brief, the code I use for drawing the bubble chart is derivative of this example: http://bl.ocks.org/mbostock/7881887 and the code for drawing the node chart is similarly generic (I am happy to provide this code if it would help to solve my issue).

This is where my issue comes in:

I found this nice example for collision detection between rectangles and incorporated it into my code. However, since I'm using svg text and the font size changes on transition, I opted to estimate the text size/bounding box size based on text-length and radius.

The entire "tick" functions for the word chart are below.

function tickc(e) {
    node = nodeGroup.selectAll(".node");
    var nodeText = nodeGroup.selectAll(".node text");
    node.each(cluster(5 * e.alpha * e.alpha));
    var k = e.alpha;
    nodeText.each(function(a, i) {
        var compWidth = d3.select(this).attr("bWidth");
        var compHeight = d3.select(this).attr("bHeight");
        nodes.slice(i + 1).forEach(function(b) {
          // console.log(a);
          var lineWidthA = a["name"].length * a["radius"]/2.5;
          var lineHeightA = a["radius"]/0.9;

          var lineWidthB = b["name"].length * b["radius"]/2.5;
          var lineHeightB = b["radius"]/0.9;
          dx =  (a.x - b.x)
          dy =  (a.y - b.y)    
          adx = Math.abs(dx)
          ady = Math.abs(dy)
          mdx = (1 + 0.07) * (lineWidthA + lineWidthB)/2
          mdy = (1 + 0.07) * (lineHeightA + lineHeightB)/2
          if (adx < mdx  &&  ady < mdy) {          
            l = Math.sqrt(dx * dx + dy * dy)

            lx = (adx - mdx) / l * k
            ly = (ady - mdy) / l * k

            // choose the direction with less overlap
            if (lx > ly  &&  ly > 0)
                 lx = 0;
            else if (ly > lx  &&  lx > 0)
                 ly = 0;

            dx *= lx
            dy *= ly
            a.x -= dx
            a.y -= dy
            b.x += dx
            b.y += dy
          }
        });
  });
node.select("circle")
    .attr("cx", function(d) { return d.x; })
    .attr("cy", function(d) { return d.y; });
node.select("text")
    .attr("x", function(d) { return d.x; })
    .attr("y", function(d) { return d.y; });
}
// Move d to be adjacent to the cluster node.
function cluster2(alpha) {
  return function(d) {
    var cluster = clusters[d.cluster];
    if (cluster === d) return;
    var x = d.x - cluster.x,
    y = d.y - cluster.y,
    l = Math.sqrt(x * x + y * y),
    r = (d["name"].length * d["radius"]) + (cluster["name"].length * cluster["radius"]);

  };
}

I was unsure of how to conclude the clustering function so as to move the nodes appropriately. I tried to adapt the standard cluster function, i.e.

// Move d to be adjacent to the cluster node.
function cluster(alpha) {
  return function(d) {
    var cluster = clusters[d.cluster];
    if (cluster === d) return;
    var x = d.x - cluster.x,
        y = d.y - cluster.y,
        l = Math.sqrt(x * x + y * y),
        r = d.radius + cluster.radius;
    if (l != r) {
      l = (l - r) / l * alpha;
      d.x -= x *= l;
      d.y -= y *= l;
      cluster.x += x;
      cluster.y += y;
    }
  };
} 

to be more similar to the aforementioned rectangular cluster force layout but without luck (I'm afraid I no longer have copies of my exact attempts).

I'm afraid I can't attach images due to my lack of reputation but I can try to find a way to provide them if it would help. The overlap problem with the word cloud is minor (most words resolve into adjacent but not touching positions) but, if possible, I'd like it to resolve as perfectly as the bubble chart. I'm pretty sure that these issues arose from a.) the unfinished cluster function and b.) my hack at using text length and radius to estimate text size rather than proper bounding box coords, but I'm not sure exactly how to fix these things.


Source: (StackOverflow)

How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily.

Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud?

The result will look somewhat like this: enter image description here


Source: (StackOverflow)

Controlling word placement in D3 cloud -- bundling words closer together based on some attribute

I am working on an interactive implementation of a D3 tag cloud that relies on each term having its own category or class. I've managed to include a category attribute to the terms by modifying d3.layout.cloud.js like this:

cloud.start = function() {
  var board = zeroArray((size[0] >> 5) * size[1]),
      bounds = null,
      n = words.length,
      i = -1,
      tags = [],
      data = words.map(function(d, i) {
    return {
      // Added by me
      epidem_category: d['epidem_category'],
      other_details: d['other_details'],
      id: d['id'],
      ///////
      text: text.call(this, d, i),
      size: ~~fontSize.call(this, d, i),
      font: font.call(this, d, i),
      rotate: rotate.call(this, d, i),
      padding: cloudPadding.call(this, d, i)
    };
  }).sort(function(a, b) { return b.size - a.size; });

I can now access d.epidem_category as well as d.id when drawing the cloud to give certain categories either a different fill color or rotation value:

 canvas
 .selectAll("text")
 .data(words)
 .enter()
 .append("text")
 .style("font-size", function(d) { return d.size + "px"; })
 .style("font-family", 'Gentium Book Basic')
 .style("fill", function(d, i) { return entity_cloud.set_color(d.epidem_category);})
 .attr("text-anchor", "middle")
 .attr("transform", function(d) {
   return "translate(" + [d.x, d.y] + ")rotate(" + entity_cloud.set_rotation(d.epidem_category) ")";
   })
 .attr("class", function(d) { return d.epidem_category })
 .attr("id", function(d, i) { return d.id })
 .text(function(d) { return d.text; })

My problem is I would now also like to control the placement of the word as well -- I would like all terms of the same category to appear bundled together in the cloud. I thought maybe I might be able to control this by reordering my input array by category, assuming that the algorithm described on Jason Davies's tag cloud demo page:

Attempt to place the word at some starting point: usually near the middle, or somewhere on a central horizontal line.

.. So by that logic, if the first eg 10 words are of the same category, they should appear bundled together somewhere in the middle, the other categories would follow in a circular pattern. Testing this did not produce the anticipated result, however. In fact, I could see hardly any change in the layout at all.

Does anyone have any ideas of how to achieve a layout where terms are bundled together based on some attribute?


Source: (StackOverflow)

Creating a Corpus with Spanish Text in R

Trying to do some text-mining and wordcloud visualization on Spanish text. I actually have 9 different .txt files, but will just post one for reproduction.

"Nos los representantes del pueblo de la Nación ARGENTINA, reunidos en Congreso General Constituyente por voluntad y elección de las provincias que la componen, en cumplimiento de pactos preexistentes, con el objeto de constituir la unión nacional, afianzar la justicia, consolidar la paz interior, proveer la defensa común, promover el bienestar general, y asegurar los beneficios de la libertad, para nosotros, para nuestra posteridad, y para todos los hombres del mundo que quieran habitar en el suelo argentino: invocando la protección de Dios, fuente de toda razón y justicia: ordenamos, decretamos y establecemos esta Constitución, para la Nación ARGENTINA."

The file is saved as a .txt file. Below is my naïve attempt to generate the term-document-matrix with the correct encoding. When I inspect it, I am not getting the text as it is in the original file ("constitución" becomes "constitucif3n," for example). I'm new to text-mining, and knowing that the solution probably involves a wide variety of co-dependent adjustments, I figured I'd ask here instead of searching for 4 hours. Thanks in advance.

#Generate Term-Document-Matrix

#Convert Text to Corpus and Clean
cleanCorpus <- function(corpus) {
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
  corpus.tmp <- tm_map(corpus.tmp, tolower)
  corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("spanish"))
  return(corpus.tmp)
}

generateTDM <- function(path) {
  cor.tmp <- Corpus(DirSource(directory=path, encoding="ISO8859-1"))
  cor.cl <- cleanCorpus(cor.tmp)
  tdm.tmp <- TermDocumentMatrix(cor.cl)
  tdm.s <- removeSparseTerms(tdm.tmp, 0.7)
}

tdm <- generateTDM(pathname)
tdm.m <- as.matrix(tdm)

Source: (StackOverflow)

d3 cloud layout - show all tags

[background] currently I'm working on a little web project for data visualization and I want to use create a tagcloud or wordcloud with cloud layout of the javascript d3 framework (d3 cloud layout).

I've built some examples that almost satisfy my requirements. The only thing that doesn't work is, that some words/tags aren't displayed in my tag cloud. As far as I understand the placing algorithm for the tags, this comes because the algorithm finds no suitable place to position all tags without overlaying other tags.

My question: How can I display all available tags and is there a setting in the framework to do this? I'm rather new to javascript so I had quite a hard time to understand the whole cloud layout and the positioning algorithm to find a way to achieve my goal.


Source: (StackOverflow)

Responsive width with wordcloud2.js (canvas html5 element)

With wordcloud2.js you can create beautiful and easy wordclouds on canvas-elements. I don't really have problems with this script, actually only with the canvas-element in general: I'd like to have a responsive width (in this case relating to the browser-width).

It shows the correct width (100%), but the canvas is just upscaled and the "image" is distorted. If I save the "png" it has the old/basic resolution given by the script.

How to fix it?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>canvas</title>
<script src="//code.jquery.com/jquery-1.11.0.min.js"></script>
<script src="//code.jquery.com/jquery-migrate-1.2.1.min.js"></script>
<link rel='nofollow' href='http://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'>
<script src="js/wordcloud2.js"></script>

<style type="text/css">    
#canvas_cloud{
width: 100%;
height:500px;
}
</style>

</head>

<body>

<canvas id="canvas_cloud"></canvas>

<script>

var options = 
{
  list : [ 
  ["Pear", "9"],
  ["Grape", "3"],
  ["Pineapple", "8"], 
  ["Apple", "5"]
  ],
  gridSize: Math.round(16 * document.getElementById('canvas_cloud').offsetWidth / 1024),
  weightFactor: function (size) {
    return Math.pow(size, 1.9) * document.getElementById('canvas_cloud').offsetWidth / 1024;
  }
}

WordCloud(document.getElementById('canvas_cloud'), options); 

</script>

</body>
</html>

Source: (StackOverflow)

Scatterplot with two legends with corresponding colors

I've got the following plot: enter image description here

So I've got two groups of words that I've coloured in red and blue. Besides that I've got two legends where each legend corresponds to one group.

My code is as follows:

install.packages("wordcloud")
library(wordcloud)

textplot(cor_met_u1, cor_met_u2, 1:length(cor_met_u1) ,ylim=c(-1,1), xlim=c(-1,1), col ="red", show.lines=F)
par(new=T)
textplot(cor_met_v1, cor_met_v2, 1:length(cor_met_v1),ylim=c(-1,1), xlim=c(-1,1),show.lines=F,col="blue")

legend("topright", inset=c(-0.1,0), legend=objwoorden, title="Object names",cex=0.7,col="red")
legend("topright", inset=c(0.1,0), legend=trefwoorden, title="Keywords",cex=0.7,col="blue")

Now I would like to adapt the following things, but I can't find how to do this:

  • The legend with the title Object names: I would like that every word in this legend is in red AND I would like that every word in this legend has the corresponding number in the plot as key.

  • Same as above: The legend with the title Keywords: I would like that every word in this legend is in blue AND I would like that every word in this legend has the corresponding number in the plot as key.

  • My legends haven't enough space, a part of the legends aren't plotted. How can I reduce the space of the plot of points, and increase the space for the legends?

My data (the red part):

cor_met_u1 <- c(-8.553663e-01, -7.726949e-01, -7.308201e-01, -6.992058e-01, -6.675692e-01, -5.971927e-01, -5.870302e-01, -4.856212e-01, -4.612918e-01, -4.185641e-01, -4.106425e-01,  3.816280e-01,  3.184851e-01,  8.766928e-03, 9.121623e-03, 9.227969e-03, -3.477085e-02,  1.248777e-02,  2.982004e-03,  3.970818e-03, 3.970818e-03, 3.970818e-03,  4.099181e-03, -2.823043e-03,  2.702839e-02,-1.683602e-03, -2.231668e-02,  4.884192e-02, -1.177896e-02, -2.984341e-02, -1.120810e-02,  1.449123e-02, -2.223017e-02,  2.764716e-02,  1.514186e-02, 3.261371e-03, -1.661866e-03, -1.661866e-03, -1.661866e-03, -1.661866e-03, -1.661866e-03,  4.787548e-05, -5.408560e-04, -1.331249e-02,  1.669416e-02, 1.739344e-02)
cor_met_u2 <- c(-2.246893e-03, -2.632274e-03, -1.049068e-03, -2.192703e-03, -1.948807e-03, -5.081165e-04,  9.637142e-04, -6.389820e-04, -1.113667e-03, -2.423015e-01, -4.794701e-05, -1.412691e-03, -1.321541e-03, -9.755640e-01, -9.682569e-01, -9.530348e-01, -9.129931e-01, -8.893264e-01, -8.197392e-01, -8.077923e-01,-8.077923e-01, -8.077923e-01, -8.069009e-01, -8.060184e-01, -7.557130e-01,-7.496069e-01, -7.100768e-01, -6.772976e-01, -6.075918e-01, -5.945667e-01,-5.296330e-01, -5.198169e-01, -4.598129e-01, -4.484590e-01, -4.466080e-01, -4.401859e-01, -3.982912e-01, -3.982912e-01, -3.982912e-01, -3.982912e-01,-3.982912e-01, -3.956812e-01, -3.681578e-01, -3.640512e-01, -3.532156e-01,-3.064998e-01)
objwoorden <- c('subcha', 'subchange', 'executant', 'information', 'authorization', 'change', 'origin', 'admi', 'acount', 'start', 'telnummer', 'device', 'mgmt', 'krn', 'uitoef', 'doel', 'titel', 'child', 'calculator', 'bckup', 'execid', 'fgr', 'vanuit','content', 'personeelsnummer', 'enkel', 'niveau', 'value', 'indicator', 'verschil', '1jaar', 'parent', 'jaarmaand','volgnummer', 'parentvolgnummers', 'plt2', 'rsum', 'gebruiksart', 'herstellingskost', 'leeggoedverschil', 'voorraadverschil',                 'kasverschil', 'begindatummaand', 'jaarmaand1jaar', 'descr', 'excid') 

Source: (StackOverflow)

R - Removing corpus wordset from larger corpus to find unique words

I have two corpuses (which I turn into DocumentTermMatrices, data frames, and then wordclouds) of which, one is a subset of another. To be exact, one is a corpus of text regarding just one university and the other is the corpus of text regarding all the universities in that conference.

Is there a way in R to extract just the words unique to the smaller wordset? This is kind of what I've been running so far for each corpus (this is for the 'conference' corpus)

> SECDraft = read.csv("SECDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))
> wordcloud(colnames(SECallReports), colSums(SECallReports), random.order = FALSE, max.words = 200, scale=c(2, 0.25))

thanks guys!


Source: (StackOverflow)

How do I remove words from a wordcloud?

I'm creating a wordcloud using the wordcloud package in R, and the help of "Word Cloud in R".

I can do this easily enough, but I want to remove words from this wordcloud. I have words in a file (actually an excel file, but I could change that), and I want to exclude all these words, of which there are a couple hundred. Any suggestions?

require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
ap.corpus=Corpus(DataframeSource(data.frame(as.character(data.merged2[,6]))))
ap.corpus=tm_map(ap.corpus, removePunctuation)
ap.corpus=tm_map(ap.corpus, tolower)
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm=TermDocumentMatrix(ap.corpus)
ap.m=as.matrix(ap.tdm)
ap.v=sort(rowSums(ap.m),decreasing=TRUE)
ap.d=data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

Source: (StackOverflow)

d3.js cloud from external .csv or .txt file?

I'm trying to create a word cloud using D3. To do this, I'm modifying Jason Davis' code: https://github.com/jasondavies/d3-cloud/blob/master/examples/simple.html

I want to change the code so that instead of using a word array, I can just link to a .txt or a .csv file with a larger amount of text.

I tried using the d3.text() and d3.csv() methods, but I'm doing something wrong. Since both methods call for a URL, I used a data URL generator (http://dataurl.net/#dataurlmaker) to turn a text file into a URL. I then changed the code and inserted the dataurl as follows:

var fill = d3.scale.category20();
var text = d3.text(data:text/plain;base64,RGVsbCwgdGhl....continued....more...URLdata)

  d3.layout.cloud().size([300, 300])
  .words(text.map(function(d) {
    return {text: d, size: 10 + Math.random() * 90};
  }))
  .rotate(function() { return ~~(Math.random() * 2) * 90; })
  .font("Impact")
  .fontSize(function(d) { return d.size; })
  .on("end", draw)
  .start();

The second option I tried was to insert the text into a script tag in the html and then reference that in the JS code like so:

<!DOCTYPE html>
<script src="../lib/d3/d3.js"></script>
<script id="text" type="text/plain">Dell, the company, has...more..text...</script>
<script src="../d3.layout.cloud.js"></script>
<body>

<script>
var fill = d3.scale.category20();
var text = d3.select("#text");

  d3.layout.cloud().size([300, 300])
      .words(text.map(function(d) {
        return {text: d, size: 10 + Math.random() * 90};
        }))

etc........

Could someone help me figure out a way to read in a .txt or .csv file? Thanks!


Source: (StackOverflow)

How to make R word cloud display most frequent term in lighter shade of color

I created a word cloud in R with the code:

wordcloud(words$term, words$freq, random.order=FALSE, colors=colorRampPalette(brewer.pal(9,"Blues"))(32), scale=c(5, .5))

And it works fine only that it colors the terms in such a way that the most frequent appear in the darkest shade of the color and the least frequent in the lightest shade of the color. But I want it to be the other way round. Any pointers? Thanks.


Source: (StackOverflow)