tag:blogger.com,1999:blog-8269737761286917895.post2257205013497622748..comments2024-02-19T23:07:54.532-08:00Comments on Coffee and Econometrics in the Morning: Splitting Vectors of Uneven StringsTony Cooksonhttp://www.blogger.com/profile/12565713889808330198noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-8269737761286917895.post-57520955717008851092011-08-03T19:17:47.658-07:002011-08-03T19:17:47.658-07:00That's a nice *string* of comments.That's a nice *string* of comments.Tony Cooksonhttps://www.blogger.com/profile/12565713889808330198noreply@blogger.comtag:blogger.com,1999:blog-8269737761286917895.post-70322526688730941322011-08-03T09:44:32.439-07:002011-08-03T09:44:32.439-07:00When myvec is really big, so that each string coul...When myvec is really big, so that each string could be millions of characters long, you don't want to work on all the things that aren't the first 3 words. Because of this, strsplit, and regular expression that involve (.+) will be slow.<br /><br />A fast solution in that case is:<br />substr( myvec,1,attr(regexpr("[^ ]* [^ ]* [^ ]* ", myvec),"match.length"))<br /><br />Here:<br />> myvec=rep( paste( rep(c(1:10," "),10^5),collapse=""),10)<br />> system.time({ substr( myvec,1,attr(regexpr("[^ ]* [^ ]* [^ ]* ", myvec),"match.length"))})<br /> user system elapsed <br /> 0.026 0.000 0.027 <br />> system.time( { sub("(([[:alpha:]]+[[:blank:]]){3})(.+)","\\1",myvec)})<br /> user system elapsed <br /> 3.146 0.019 3.181 <br />> system.time( { do.call(rbind,lapply(myvec, FUN=function(x) paste(strsplit(x," ")[[1]][1:3], collapse="")))})<br /> user system elapsed <br /> 2.522 0.010 2.558Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8269737761286917895.post-27774911359409208112011-08-01T04:45:34.491-07:002011-08-01T04:45:34.491-07:00This is actually even shorter:
sub("(([[:alph...This is actually even shorter:<br />sub("(([[:alpha:]]+[[:blank:]]){3})(.+)","\\1",myvec)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8269737761286917895.post-20815727086929841702011-08-01T04:34:59.806-07:002011-08-01T04:34:59.806-07:00you better use regexp:
eg:
myvec <- sub("(...you better use regexp:<br />eg:<br />myvec <- sub("([[:alpha:]]+[[:blank:]][[:alpha:]]+[[:blank:]][[:alpha:]]+)(.+)","\\1",myvec)<br />myvec <- gsub(" ","",myvec)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8269737761286917895.post-86472884981067369822011-07-30T10:27:08.978-07:002011-07-30T10:27:08.978-07:00Clever solution!Clever solution!Big catnoreply@blogger.comtag:blogger.com,1999:blog-8269737761286917895.post-76463051556335187922011-07-30T07:04:25.373-07:002011-07-30T07:04:25.373-07:00How's this for doing it in one line?
do.call(...How's this for doing it in one line?<br /><br />do.call(rbind,lapply(myvec, FUN=function(x) paste(strsplit(x," ")[[1]][1:3], collapse="")))Anonymousnoreply@blogger.com