서식이 지정되지 않은 텍스트에서 URL 추출

서식이 지정되지 않은 텍스트에서 URL 추출

HTML 파일과 같은 서식이 지정된 텍스트에서 하위 문자열을 추출하는 예만 찾았지만 제 경우에는 다음과 같은 URL 목록을 출력해야 합니다.

... 
https://twitter.com/user1/status/xyza 
https://twitter.com/user1/status/xyzb
https://twitter.com/user1/status/xyzc
https://twitter.com/user2/status/xyza
https://twitter.com/user2/status/xyzb
...

구조화되지 않은 매우 큰 파일(+100MB)에서 다음은 내 입력입니다.

n          3\\n        \\n      \\n  \\n    \\n      \\n      Retweeted\\n    \\n      \\n        \\n          3\\n        \\n      \\n  \\n\\n      \\n  \\n    \\n      \\n        \\n      \\n      Like\\n    \\n      \\n        \\n          5\\n        \\n      \\n  \\n    \\n      \\n        \\n      \\n      Liked\\n    \\n      \\n        \\n          5\\n        \\n      \\n  \\n\\n      \\n\\n        \\n    \\n  \\n      \\n        \\n        More\\n      \\n  \\n  \\n  \\n    \\n    \\n  \\n  \\n    \\n      \\n        Copy link to Tweet\\n      \\n      \\n        Embed Tweet\\n      \\n        \\n  \\n\\n\\n\\n\\n  \\n\\n    \\n\\n      \\n\\n      \\n        \\n  \\n    \\n      \\n  \\n\\n      \\n    \\n\\n  \\n\\n\\n      \\n\\n\\n    \\n      \\n          \\n\\n    \\n        \\n          \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n        \\n        \\n  \\n    \\n  \\n      \\n\\n    \\n        \\n\\n    \\n\\n          Back to top ↑\\n\\n  \\n\\n\\n    \\n  \\n    \\n  \\n\\n\\n  \\n\\n\\n    \\n  \\n    Loading seems to be taking a while.\\n    \\n      Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.\\n    \\n  \\n\\n\\n\\n      \\n    \\n  \\n\\n      \\n    \\n\\n\\n\\n\\n\\n  \\n\\n\\n  \\n    \\n      Suggested by Twitter\\n      \\n        \\n      \\n    \\n   \\n\\n    \\n  \\n    \\n  \\n    \\n    false\\n  \\n  \\n    \\n    \\n  \\n\\n  \\n\\n\\n\\n  \\n      \\n  \\n    \\n      \\n        © 2015 Twitter\\n        About\\n        Help\\n        Terms\\n        Privacy\\n        Cookies\\n        Ads info\\n      \\n    \\n  \\n\\n\\n  \\n\\n\\n\\n      \\n    \\n  \\n\\n\\n    \\n  \\n  \\n\\n\\n\\n    \\n    \\n  \\n\\n  \\n\\n  \\n\\n    \\n  \\n\\n  \\n    \\n\\n\",\"meta_tags\":[{},{\"content\":\"0; URL=https://mobile.twitter.com/i/nojs_router?path=%2FTerriBauman%2Fstatus%2F680996161843380224\"},{\"name\":\"robots\",\"content\":\"NOODP\"},{\"name\":\"msapplication-TileImage\",\"content\":\"//abs.twimg.com/favicons/win8-tile-144.png\"},{\"name\":\"msapplication-TileColor\",\"content\":\"#00aced\"},{\"name\":\"swift-page-name\",\"content\":\"permalink\"},{\"content\":\"article\"},{\"content\":\"https://twitter.com/TerriBauman/status/680996161843380224\"},{\"content\":\"Terri Bauman on Twitter\"},{\"content\":\"https://pbs.twimg.com/media/BcaVtMKCEAAyz9f.jpg:large\"},{\"content\":\"true\"},{\"content\":\"“Social Media Jobs: https://t.co/NDDK4WaRA4 Please Retweet to spread words #OnlineJobs #Jobs”\"},{\"content\":\"Twitter\"},{\"content\":\"2231777543\"}],\"links\":[\"https://twitter.com/\",\"https://twitter.com/about\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/#supported_languages\",\"https://twitter.com/?lang=id\",\"https://twitter.com/?lang=msa\",\"https://twitter.com/?lang=cs\",\"https://twitter.com/?lang=da\",\"https://twitter.com/?lang=de\",\"https://twitter.com/?lang=en-gb\",\"https://twitter.com/?lang=es\",\"https://twitter.com/?lang=fil\",\"https://twitter.com/?lang=fr\",\"https://twitter.com/?lang=it\",\"https://twitter.com/?lang=hu\",\"https://twitter.com/?lang=nl\",\"https://twitter.com/?lang=no\",\"https://twitter.com/?lang=pl\",\"https://twitter.com/?lang=pt\",\"https://twitter.com/?lang=ro\",\"https://twitter.com/?lang=fi\",\"https://twitter.com/?lang=sv\",\"https://twitter.com/?lang=vi\",\"https://twitter.com/?lang=tr\",\"https://twitter.com/?lang=el\",\"https://twitter.com/?lang=ru\",\"https://twitter.com/?lang=uk\",\"https://twitter.com/?lang=he\",\"https://twitter.com/?lang=ar\",\"https://twitter.com/?lang=fa\",\"https://twitter.com/?lang=mr\",\"https://twitter.com/?lang=hi\",\"https://twitter.com/?lang=bn\",\"https://twitter.com/?lang=gu\",\"https://twitter.com/?lang=ta\",\"https://twitter.com/?lang=kn\",\"https://twitter.com/?lang=th\",\"https://twitter.com/?lang=ko\",\"https://twitter.com/?lang=ja\",\"https://twitter.com/?lang=zh-cn\",\"https://twitter.com/?lang=zh-tw\",\"https://twitter.com/login\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/TerriBauman\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/hashtag/Entrepreneur?src=hash\",\"https://twitter.com/hashtag/SocialMediaExpert?src=hash\",\"https://twitter.com/hashtag/SocialMediaMarketer?src=hash\",\"https://twitter.com/hashtag/BusinessOwner?src=hash\",\"https://twitter.com/hashtag/InternetMarketer?src=hash\",\"https://twitter.com/hashtag/SocialMediaJobs?src=hash\",\"https://t.co/ZciT91kZwP\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\",\"https://twitter.com/#\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"http://support.twitter.com/forums/26810/entries/78525\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/signup\",\"https://twitter.com/login\",\"http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code\",\"https://twitter.com/TerriBauman/status/680996164058001408\",\"https://twitter.com/TerriBauman/status/680977383365578752\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://t.co/NDDK4WaRA4\",\"https://twitter.com/hashtag/OnlineJobs?src=hash\",\"https://twitter.com/hashtag/Jobs?src=hash\",\"https://t.co/SJvkM1yWUI\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/cakafete\",\"https://twitter.com/KassemAlYateem\",\"https://twitter.com/Worldspacetech1\",\"https://twitter.com/ElisaBW\",\"https://twitter.com/patrickarrelle\",\"https://twitter.com/AcousticsPro1\",\"https://twitter.com/#\",\"http://status.twitter.com\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\"]}"},{"url":"http://status.twitter.com/page/2","result":"{\"date_crawled\":\"2015-12-27T10:01:58Z\",\"title\":\"Twitter Status\",\"lossyHTML\":\"\\n\\n\\r\\n\\r\\n    \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n            \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n        \\r\\n                \\r\\n        \\r\\n\\r\\n        \\r\\n        Twitter Status\\r\\n        \\n\\r\\n        \\r\\n         \\r\\n\\r\\n        \\r\\n\\r\\n    \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\r\\n    \\r\\n\\r\\n\\r\\n\\r\\n\\r\\n        \\r\\n\\r\\n\\r\\n\\r\\n    \\r\\n    \\r\\n        \\r\\n            \\r\\n                \\r\\n                    Updates on the status of the Twitter service.\\r\\n\\r\\n\\r\\n\\r\\n\\r\\nRelated Links\\r\\nOfficial Company Blog\\r\\n\\r\\nOfficial Help Documents\\r\\n\\r\\nDeveloper Community\\r\\n\\r\\n\\r\\n\\r\\n                    Archive\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n                    Powered by Tumblr\\r\\n                \\r\\n\\r\\n                \\r\\n            \\r\\n            \\r\\n\\r\\n\\r\\n            \\r\\n                \\r\\n                    \\r\\n       

나는 다음을 시도했습니다.

grep 'https://' input.txt | grep 'status' >> output.txt

sed와 awk의 사용 예를 본 적이 있지만 이해하기가 매우 어려울 뿐 아니라 거의 항상 열 선택을 기반으로 하는데 제 경우에는 불가능합니다.

답변1

두 개의 슬래시가 있는 URL을 얻으려면 GNU grep을 사용해 보십시오:

grep -o 'http[s]*://[^/][^\\]*' file

두 개 이상의 슬래시가 있는 URL:

grep -o 'http[s]*://[^\\]*' file

추천 도서:스택 오버플로 정규식 FAQ

[s]*: 별표 수량자( *)는 앞의 표현식이 0번 이상 일치할 수 있음을 나타냅니다. 여기서 앞의 표현식은 문자 클래스(괄호로 표시)에만 포함된 모든 문자일 수 있습니다 s. 사용하는 것이 더 편리합니다 s*.

[^\\]*: 백슬래시를 제외한 모든 문자와 0회 이상 일치합니다. 탈출을 방지하기 위해 백슬래시를 사용하여 백슬래시를 탈출했습니다 ].

관련 정보